[[PageOutline(2-3, Contents)]] = Haiku Package Format = This document specifies the Haiku Package (HPKG) file format, which was designed for efficient use by Haiku's package file system. It is somewhat inspired by the [http://code.google.com/p/xar/ XAR format] (separate TOC and data heap), but aims for greater compactness (not XML for the TOC). Three stacked format layers can be identified: - A generic container format for structured data. - An archive format specifying how file system data are stored in the container. - A package format, extending the archive format with attributes for package management. == The Data Container Format == A HPKG file consists of four sections: Header:: Identifies the file as HPKG file and provides access to the other sections. Heap:: Contains arbitrary (mostly unstructured) data referenced by the next two sections. TOC (table of contents):: The main section, containing structured data with references to unstructured data in the Heap. Package Attributes:: A section similar to the TOC. Rather than describing the data contained in the file, it specifies meta data of the package as a whole. All numbers in the HPKG are stored in big endian format or [http://en.wikipedia.org/wiki/LEB128 LEB128] encoding. === Header === The header has the following structure: {{{ struct hpkg_header { uint32 magic; uint16 header_size; uint16 version; uint64 total_size; // package attributes section uint32 attributes_compression; uint32 attributes_length_compressed; uint32 attributes_length_uncompressed; // TOC section uint32 toc_compression; uint64 toc_length_compressed; uint64 toc_length_uncompressed; uint64 toc_attribute_types_length; uint64 toc_attribute_types_count; uint64 toc_strings_length; uint64 toc_strings_count; }; }}} magic:: The string 'hpkg' (B_HPKG_MAGIC). header_size:: The size of the header. version:: The version of the HPKG format the file conforms to. The current version is 1 (B_HPKG_VERSION). total_size:: The total file size. attributes_compression:: The compression algorithm used for the package attributes section. attributes_length_compressed:: The compressed size of the package attributes section. Equals attributes_length_uncompressed, if the section is not compressed. attributes_length_uncompressed:: The uncompressed size of the package attributes section. toc_compression:: The compression algorithm used for the TOC section. toc_length_compressed:: The compressed size of the TOC section. Equals toc_length_uncompressed, if the section is not compressed. toc_length_uncompressed:: The uncompressed size of the TOC section. toc_attribute_types_length:: The size of the attributes types subsection of the TOC section. toc_attribute_types_count:: The number of entries in the attributes types subsection of the TOC section. toc_strings_length:: The size of the strings subsection of the TOC section. toc_strings_count:: The number of entries in the strings subsection of the TOC section. === TOC === The TOC section contains a list of attribute trees. An attribute has a name, a data type, and a value, and can have child attributes. E.g.: - "shopping list" : string : "bakery" - "item" : string : "rye bread" - "item" : string : "bread roll" - "count" : int : 10 - "item" : string : "cookie" - "count" : int : 5 - "shopping list" : string : "hardware store" - "item" : string : "hammer" - "item" : string : "nail" - "size" : int : 10 - "count" : int : 100 Attributes often share the same name and data type, particularly when a list of some kind are stored. In order to save space each unique name and data type pair is stored as an attribute type in a separate subsection and is referenced by an index. A similar optimization exists for shared string attribute values. A string value used by more than one attribute is stored in the strings subsection and is referenced by an index as well. Hence the TOC section consists of three subsections: Attribute types:: A table of attribute name, data type pairs. Strings:: A table of commonly used strings. Main TOC:: The attribute trees. ==== Attribute Types ==== The attribute types subsection consists of a list of attribute type entries terminated by a 0 byte. An attribute type entry is stored as: Attribute data type:: A uint8 specifying the data type. Attribute name:: A null-terminated UTF-8 string. These are the specified data type values: ||0||B_HPKG_ATTRIBUTE_TYPE_INVALID||invalid|| ||1||B_HPKG_ATTRIBUTE_TYPE_INT||signed integer|| ||2||B_HPKG_ATTRIBUTE_TYPE_UINT||unsigned integer|| ||3||B_HPKG_ATTRIBUTE_TYPE_STRING||UTF-8 string|| ||4||B_HPKG_ATTRIBUTE_TYPE_RAW||raw data|| Each attribute type is implicity assigned the (null-based) index at which the respective entry appears in the list, i.e. the nth entry has the index n - 1. The attribute is referenced by this index in the main TOC subsection. ==== Strings ==== The strings subsections consists of a list of null-terminated UTF-8 strings. The section itself is terminated by a 0 byte. Each string is implicity assigned the (null-based) index at which the it appears in the list, i.e. the nth string has the index n - 1. The string is referenced by this index in the main TOC subsection. ==== Main TOC ==== The main TOC subsection consists of a list of attribute entries terminated by a 0 byte. An attribute entry is stored as: Attribute tag:: An unsigned LEB128 encoded number. Attribute value:: The value of the attribute encoded as described below. Attribute child list: Only if this attribute is marked to have children: A list of attribute entries terminated by a 0 byte. The attribute tag encodes three pieces of information: {{{(typeIndex << 3) + (encoding << 1) + hasChildren + 1}}} typeIndex:: The index of the attribute type. encoding:: Specifies the encoding of the attribute value as described below. hasChildren:: 1, if the attribute has children, 0 otherwise. ==== Attribute Values ==== A value of each of the data types can be encoded in different ways, which is defined by the encoding value: - B_HPKG_ATTRIBUTE_TYPE_INT and B_HPKG_ATTRIBUTE_TYPE_UINT: ||0||B_HPKG_ATTRIBUTE_ENCODING_INT_8_BIT||int8/uint8|| ||1||B_HPKG_ATTRIBUTE_ENCODING_INT_16_BIT||int16/uint16|| ||2||B_HPKG_ATTRIBUTE_ENCODING_INT_32_BIT||int32/uint32|| ||3||B_HPKG_ATTRIBUTE_ENCODING_INT_64_BIT||int64/uint64|| - B_HPKG_ATTRIBUTE_TYPE_STRING: ||0||B_HPKG_ATTRIBUTE_ENCODING_STRING_INLINE||null-terminated UTF-8 string|| ||1||B_HPKG_ATTRIBUTE_ENCODING_STRING_TABLE||unsigned LEB128: index into string table|| - B_HPKG_ATTRIBUTE_TYPE_RAW ||0||B_HPKG_ATTRIBUTE_ENCODING_RAW_INLINE||unsigned LEB128: size; followed by raw bytes|| ||1||B_HPKG_ATTRIBUTE_ENCODING_RAW_HEAP||unsigned LEB128: size; unsigned LEB128: offset into heap|| === Package Attributes === The package attributes section contains a list of attribute trees, just like the TOC section. Since the purpose of the section is to store meta data of the package as a whole, it will be relatively small and less repetitive (no or only short item lists). Therefore this section does not have attribute types and strings subsections. It directly stores a list of self contained attribute entries terminated by a 0 byte. An entry has the following format: Attribute data type:: A uint8 specifying the data type of the attribute value. Has children:: A uint8: non 0, if the attribute has children, 0 otherwise. Attribute name: A null-terminated UTF-8 string. Attribute value:: The value of the attribute encoded as described in the Main TOC section. Attribute child list: Only if this attribute is marked to have children: A list of attribute entries terminated by a 0 byte. === Section Compression === The TOC and the package attributes section can be compressed. Which compression algorithm is used is specified by the {{{toc_compression}}} respectively the {{{attributes_compression}}} field in the header. The following values are defined: ||0||B_HPKG_COMPRESSION_NONE||no compression|| ||1||B_HPKG_COMPRESSION_ZLIB||zlib (LZ77) compression|| == The Archive Format == This section specifies how file system objects (files, directories, symlinks) are stored in a HPKG file. It builds on top of the container format, defining the types of attributes, their order, and allowed values. E.g. a "bin" directory, containing a symlink and a file: {{{ bin 0 2009-11-13 12:12:09 drwxr-xr-x awk 0 2009-11-13 12:11:16 lrwxrwxrwx -> gawk gawk 301699 2009-11-13 12:11:16 -rwxr-xr-x }}} could be represented by this attribute tree: - "dir:entry" : string : "bin" - "file:type" : uint : 1 (0x1) - "file:mtime" : uint : 1258110729 (0x4afd3f09) - "dir:entry" : string : "awk" - "file:type" : uint : 2 (0x2) - "file:mtime" : uint : 1258110676 (0x4afd3ed4) - "symlink:path" : string : "gawk" - "dir:entry" : string : "gawk" - "file:permissions" : uint : 493 (0x1ed) - "file:mtime" : uint : 1258110676 (0x4afd3ed4) - "data" : raw : size: 301699, offset: 0 - "file:attribute" : string : "BEOS:APP_VERSION" - "file:attribute:type" : uint : 1095782486 (0x41505056) - "data" : raw : size: 680, offset: 301699 - "file:attribute" : string : "BEOS:TYPE" - "file:attribute:type" : uint : 1296649555 (0x4d494d53) - "data" : raw : size: 35, offset: 302379 === Attribute Types === The following attribute types are specified by the archive format. Any other attributes will be ignored. ==== B_HPKG_ATTRIBUTE_NAME_DIRECTORY_ENTRY ("dir:entry") ==== - '''Type:''' string - '''Value:''' File name of the entry. - '''Allowed Values:''' Any valid file (not path!) name, save "." and "..". - '''Child Attributes:''' - B_HPKG_ATTRIBUTE_NAME_FILE_TYPE: The file type of the entry. - B_HPKG_ATTRIBUTE_NAME_FILE_PERMISSIONS: The file permissions of the entry. - B_HPKG_ATTRIBUTE_NAME_FILE_USER: The owning user of the entry. - B_HPKG_ATTRIBUTE_NAME_FILE_GROUP: The owning group of the entry. - B_HPKG_ATTRIBUTE_NAME_FILE_ATIME[_NANOS]: The entry's file access time. - B_HPKG_ATTRIBUTE_NAME_FILE_MTIME[_NANOS]: The entry's file modification time. - B_HPKG_ATTRIBUTE_NAME_FILE_CRTIME[_NANOS]: The entry's file creation time. - B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE: An extended file attribute associated with entry. - B_HPKG_ATTRIBUTE_NAME_DATA: Only if the entry is a file: The file data. - B_HPKG_ATTRIBUTE_NAME_SYMLINK_PATH: Only if the entry is a symlink: The path the symlink points to. - B_HPKG_ATTRIBUTE_NAME_DIRECTORY_ENTRY: Only if the entry is a directory: The child entries in that directory. ==== B_HPKG_ATTRIBUTE_NAME_FILE_TYPE ("file:type") ==== - '''Type:''' uint - '''Value:''' Type of the entry. - '''Allowed Values:''' ||0||B_HPKG_FILE_TYPE_FILE||file|| ||1||B_HPKG_FILE_TYPE_DIRECTORY||directory|| ||2||B_HPKG_FILE_TYPE_SYMLINK||symlink|| - '''Default Value:''' B_HPKG_FILE_TYPE_FILE - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_PERMISSIONS ("file:permissions") ==== - '''Type:''' uint - '''Value:''' File permissions. - '''Allowed Values:''' Any valid permission mask. - '''Default Value:''' - For files: 0644 (octal). - For directories: 0755 (octal). - For symlinks: 0777 (octal). - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_USER ("file:user") ==== - '''Type:''' string - '''Value:''' Name of the user owning the file. - '''Allowed Values:''' Any non-empty string. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_GROUP ("file:group") ==== - '''Type:''' string - '''Value:''' Name of the group owning the file. - '''Allowed Values:''' Any non-empty string. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_ATIME ("file:atime") ==== - '''Type:''' uint - '''Value:''' File access time (seconds since the Epoch). - '''Allowed Values:''' Any value. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_ATIME_NANOS ("file:mtime:nanos") ==== - '''Type:''' uint - '''Value:''' The nano seconds fraction of the file access time. - '''Allowed Values:''' Any value in [0, 999999999]. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_MTIME ("file:mtime") ==== - '''Type:''' uint - '''Value:''' File modified time (seconds since the Epoch). - '''Allowed Values:''' Any value. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_MTIME_NANOS ("file:mtime:nanos") ==== - '''Type:''' uint - '''Value:''' The nano seconds fraction of the file modified time. - '''Allowed Values:''' Any value in [0, 999999999]. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_CRTIME ("file:crtime") ==== - '''Type:''' uint - '''Value:''' File creation time (seconds since the Epoch). - '''Allowed Values:''' Any value. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_CRTIM_NANOS ("file:crtime:nanos") ==== - '''Type:''' uint - '''Value:''' The nano seconds fraction of the file creation time. - '''Allowed Values:''' Any value in [0, 999999999]. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE ("file:attribute") ==== - '''Type:''' string - '''Value:''' Name of the extended file attribute. - '''Allowed Values:''' Any valid attribute name. - '''Child Attributes:''' - B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE_TYPE: The type of the file attribute. - B_HPKG_ATTRIBUTE_NAME_DATA: The file attribute data. ==== B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE_TYPE ("file:attribute:type") ==== - '''Type:''' uint - '''Value:''' Type of the file attribute. - '''Allowed Values:''' Any value in [0, 0xffffffff]. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_DATA ("data") ==== - '''Type:''' data - '''Value:''' Raw data of a file or attribute. - '''Allowed Values:''' Any value, if uncompressed, otherwise see below. - '''Child Attributes:''' - B_HPKG_ATTRIBUTE_NAME_DATA_COMPRESSION: The compression algorithm used for storing the data. - B_HPKG_ATTRIBUTE_NAME_DATA_SIZE: The size of the uncompressed data. ==== B_HPKG_ATTRIBUTE_NAME_DATA_COMPRESSION ("data:compression") ==== - '''Type:''' uint - '''Value:''' ID of the data compression algorithm. - '''Allowed Values:''' ||0||B_HPKG_COMPRESSION_NONE||no compression|| ||1||B_HPKG_COMPRESSION_ZLIB||zlib (LZ77) compression|| - '''Default Value:''' B_HPKG_COMPRESSION_NONE - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_DATA_SIZE ("data:size") ==== - '''Type:''' uint - '''Value:''' Size of the uncompressed data. - '''Allowed Values:''': Any value. - '''Default Value:''' Size of the compressed data. - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_DATA_CHUNK_SIZE ("data:chunk_size") ==== - '''Type:''' uint - '''Value:''' Size of a compressed data chunk. - '''Allowed Values:''': Any value. - '''Default Value:''' - If not compressed: 0 - If B_HPKG_COMPRESSION_ZLIB compressed: 64 * 1024 - '''Child Attributes:''' none ==== B_HPKG_ATTRIBUTE_NAME_SYMLINK_PATH ("symlink:path") ==== - '''Type:''' string - '''Value:''' The path the symlink refers to. - '''Allowed Values:''': Any valid symlink path. - '''Default Value:''': Empty string. - '''Child Attributes:''' none === TOC Attributes === The TOC can directly contain any number of attributes of the B_HPKG_ATTRIBUTE_NAME_DIRECTORY_ENTRY type, which in turn contain descendent attributes as specified in the previous section. Any other attributes are ignored. === Data Compression === Data referred to by an B_HPKG_ATTRIBUTE_NAME_DATA attribute will be the raw data, if uncompressed. If compressed, the data have a special format, that allows for fast random access. ==== B_HPKG_COMPRESSION_ZLIB ==== The original data are split into equally sized chunks and compressed individually. The compressed data chunks are stored (in order) without padding, preceded by an uint64 array specifying the relative positions of the compressed data of each chunk. The positions are relative to the first byte following the position array. Since the first chunk is always at position 0, it's array element is omitted. Therefore a uncompressed data split into n chunks will have n - 1 position array elements. == The Package Format == TODO...