wiki:Obsolete/MovedToTree/PackageManagement/FileFormat

Version 2 (modified by bonefish, 10 years ago) (diff)

Fixed typos and semantical errors. Extended the B_HPKG_COMPRESSION_ZLIB section -- allowing for uncompressed chunks.

Haiku Package Format

This document specifies the Haiku Package (HPKG) file format, which was designed for efficient use by Haiku's package file system. It is somewhat inspired by the XAR format (separate TOC and data heap), but aims for greater compactness (no XML for the TOC).

Three stacked format layers can be identified:

  • A generic container format for structured data.
  • An archive format specifying how file system data are stored in the container.
  • A package format, extending the archive format with attributes for package management.

The Data Container Format

A HPKG file consists of four sections:

Header
Identifies the file as HPKG file and provides access to the other sections.
Heap
Contains arbitrary (mostly unstructured) data referenced by the next two sections.
TOC (table of contents)
The main section, containing structured data with references to unstructured data in the heap section.
Package Attributes
A section similar to the TOC. Rather than describing the data contained in the file, it specifies meta data of the package as a whole.

All numbers in the HPKG are stored in big endian format or LEB128 encoding.

The header has the following structure:

struct hpkg_header {
	uint32	magic;
	uint16	header_size;
	uint16	version;
	uint64	total_size;

	// package attributes section
	uint32	attributes_compression;
	uint32	attributes_length_compressed;
	uint32	attributes_length_uncompressed;

	// TOC section
	uint32	toc_compression;
	uint64	toc_length_compressed;
	uint64	toc_length_uncompressed;

	uint64	toc_attribute_types_length;
	uint64	toc_attribute_types_count;
	uint64	toc_strings_length;
	uint64	toc_strings_count;
};
magic
The string 'hpkg' (B_HPKG_MAGIC).
header_size
The size of the header.
version
The version of the HPKG format the file conforms to. The current version is 1 (B_HPKG_VERSION).
total_size
The total file size.
attributes_compression
The compression algorithm used for the package attributes section.
attributes_length_compressed
The compressed size of the package attributes section. Equals attributes_length_uncompressed, if the section is not compressed.
attributes_length_uncompressed
The uncompressed size of the package attributes section.
toc_compression
The compression algorithm used for the TOC section.
toc_length_compressed
The compressed size of the TOC section. Equals toc_length_uncompressed, if the section is not compressed.
toc_length_uncompressed
The uncompressed size of the TOC section.
toc_attribute_types_length
The size of the attributes types subsection of the TOC section.
toc_attribute_types_count
The number of entries in the attributes types subsection of the TOC section.
toc_strings_length
The size of the strings subsection of the TOC section.
toc_strings_count
The number of entries in the strings subsection of the TOC section.

TOC

The TOC section contains a list of attribute trees. An attribute has a name, a data type, and a value, and can have child attributes. E.g.:

  • "shopping list" : string : "bakery"
    • "item" : string : "rye bread"
    • "item" : string : "bread roll"
      • "count" : int : 10
    • "item" : string : "cookie"
      • "count" : int : 5
  • "shopping list" : string : "hardware store"
    • "item" : string : "hammer"
    • "item" : string : "nail"
      • "size" : int : 10
      • "count" : int : 100

Attributes often share the same name and data type, particularly when lists of some kind are stored. In order to save space each unique name and data type pair is stored as an attribute type in a separate subsection and is referenced by an index.

A similar optimization exists for shared string attribute values. A string value used by more than one attribute is stored in the strings subsection and is referenced by an index as well.

Hence the TOC section consists of three subsections:

Attribute types
A table of attribute name, data type pairs.
Strings
A table of commonly used strings.
Main TOC
The attribute trees.

Attribute Types

The attribute types subsection consists of a list of attribute type entries terminated by a 0 byte. An attribute type entry is stored as:

Attribute data type
A uint8 specifying the data type.
Attribute name
A null-terminated UTF-8 string.

These are the specified data type values:

0B_HPKG_ATTRIBUTE_TYPE_INVALIDinvalid
1B_HPKG_ATTRIBUTE_TYPE_INTsigned integer
2B_HPKG_ATTRIBUTE_TYPE_UINTunsigned integer
3B_HPKG_ATTRIBUTE_TYPE_STRINGUTF-8 string
4B_HPKG_ATTRIBUTE_TYPE_RAWraw data

Each attribute type is implicity assigned the (null-based) index at which the respective entry appears in the list, i.e. the nth entry has the index n - 1. The attribute is referenced by this index in the main TOC subsection.

Strings

The strings subsections consists of a list of null-terminated UTF-8 strings. The section itself is terminated by a 0 byte.

Each string is implicity assigned the (null-based) index at which the it appears in the list, i.e. the nth string has the index n - 1. The string is referenced by this index in the main TOC subsection.

Main TOC

The main TOC subsection consists of a list of attribute entries terminated by a 0 byte. An attribute entry is stored as:

Attribute tag
An unsigned LEB128 encoded number.
Attribute value
The value of the attribute encoded as described below.
Attribute child list
Only if this attribute is marked to have children: A list of attribute entries terminated by a 0 byte.

The attribute tag encodes three pieces of information:

(typeIndex << 3) + (encoding << 1) + hasChildren + 1

typeIndex
The index of the attribute type.
encoding
Specifies the encoding of the attribute value as described below.
hasChildren
1, if the attribute has children, 0 otherwise.

Attribute Values

A value of each of the data types can be encoded in different ways, which is defined by the encoding value:

  • B_HPKG_ATTRIBUTE_TYPE_INT and B_HPKG_ATTRIBUTE_TYPE_UINT:
0B_HPKG_ATTRIBUTE_ENCODING_INT_8_BITint8/uint8
1B_HPKG_ATTRIBUTE_ENCODING_INT_16_BITint16/uint16
2B_HPKG_ATTRIBUTE_ENCODING_INT_32_BITint32/uint32
3B_HPKG_ATTRIBUTE_ENCODING_INT_64_BITint64/uint64
  • B_HPKG_ATTRIBUTE_TYPE_STRING:
0B_HPKG_ATTRIBUTE_ENCODING_STRING_INLINEnull-terminated UTF-8 string
1B_HPKG_ATTRIBUTE_ENCODING_STRING_TABLEunsigned LEB128: index into string table
  • B_HPKG_ATTRIBUTE_TYPE_RAW
0B_HPKG_ATTRIBUTE_ENCODING_RAW_INLINEunsigned LEB128: size; followed by raw bytes
1B_HPKG_ATTRIBUTE_ENCODING_RAW_HEAPunsigned LEB128: size; unsigned LEB128: offset into heap

Package Attributes

The package attributes section contains a list of attribute trees, just like the TOC section. Since the purpose of the section is to store meta data of the package as a whole, it will be relatively small and less repetitive (no or only short item lists). Therefore this section does not have attribute types and strings subsections. It directly stores a list of self contained attribute entries terminated by a 0 byte. An entry has the following format:

Attribute data type
A uint8 specifying the data type of the attribute value.
Has children
A uint8: non 0, if the attribute has children, 0 otherwise.
Attribute name
A null-terminated UTF-8 string.
Attribute value
The value of the attribute encoded as described in the Main TOC section.
Attribute child list
Only if this attribute is marked to have children: A list of attribute entries terminated by a 0 byte.

Section Compression

The TOC and the package attributes section can be compressed. Which compression algorithm is used is specified by the toc_compression respectively the attributes_compression field in the header. The following values are defined:

0B_HPKG_COMPRESSION_NONEno compression
1B_HPKG_COMPRESSION_ZLIBzlib (LZ77) compression

The Archive Format

This section specifies how file system objects (files, directories, symlinks) are stored in a HPKG file. It builds on top of the container format, defining the types of attributes, their order, and allowed values.

E.g. a "bin" directory, containing a symlink and a file:

bin           0  2009-11-13 12:12:09  drwxr-xr-x
  awk         0  2009-11-13 12:11:16  lrwxrwxrwx  -> gawk
  gawk   301699  2009-11-13 12:11:16  -rwxr-xr-x

could be represented by this attribute tree:

  • "dir:entry" : string : "bin"
    • "file:type" : uint : 1 (0x1)
    • "file:mtime" : uint : 1258110729 (0x4afd3f09)
    • "dir:entry" : string : "awk"
      • "file:type" : uint : 2 (0x2)
      • "file:mtime" : uint : 1258110676 (0x4afd3ed4)
      • "symlink:path" : string : "gawk"
    • "dir:entry" : string : "gawk"
      • "file:permissions" : uint : 493 (0x1ed)
      • "file:mtime" : uint : 1258110676 (0x4afd3ed4)
      • "data" : raw : size: 301699, offset: 0
      • "file:attribute" : string : "BEOS:APP_VERSION"
        • "file:attribute:type" : uint : 1095782486 (0x41505056)
        • "data" : raw : size: 680, offset: 301699
      • "file:attribute" : string : "BEOS:TYPE"
        • "file:attribute:type" : uint : 1296649555 (0x4d494d53)
        • "data" : raw : size: 35, offset: 302379

Attribute Types

The following attribute types are specified by the archive format. Any other attributes will be ignored.

B_HPKG_ATTRIBUTE_NAME_DIRECTORY_ENTRY ("dir:entry")

  • Type: string
  • Value: File name of the entry.
  • Allowed Values: Any valid file (not path!) name, save "." and "..".
  • Child Attributes:
    • B_HPKG_ATTRIBUTE_NAME_FILE_TYPE: The file type of the entry.
    • B_HPKG_ATTRIBUTE_NAME_FILE_PERMISSIONS: The file permissions of the entry.
    • B_HPKG_ATTRIBUTE_NAME_FILE_USER: The owning user of the entry.
    • B_HPKG_ATTRIBUTE_NAME_FILE_GROUP: The owning group of the entry.
    • B_HPKG_ATTRIBUTE_NAME_FILE_ATIME[_NANOS]: The entry's file access time.
    • B_HPKG_ATTRIBUTE_NAME_FILE_MTIME[_NANOS]: The entry's file modification time.
    • B_HPKG_ATTRIBUTE_NAME_FILE_CRTIME[_NANOS]: The entry's file creation time.
    • B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE: An extended file attribute associated with entry.
    • B_HPKG_ATTRIBUTE_NAME_DATA: Only if the entry is a file: The file data.
    • B_HPKG_ATTRIBUTE_NAME_SYMLINK_PATH: Only if the entry is a symlink: The path the symlink points to.
    • B_HPKG_ATTRIBUTE_NAME_DIRECTORY_ENTRY: Only if the entry is a directory: A child entry in that directory.

B_HPKG_ATTRIBUTE_NAME_FILE_TYPE ("file:type")

  • Type: uint
  • Value: Type of the entry.
  • Allowed Values:
0B_HPKG_FILE_TYPE_FILEfile
1B_HPKG_FILE_TYPE_DIRECTORYdirectory
2B_HPKG_FILE_TYPE_SYMLINKsymlink
  • Default Value: B_HPKG_FILE_TYPE_FILE
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_PERMISSIONS ("file:permissions")

  • Type: uint
  • Value: File permissions.
  • Allowed Values: Any valid permission mask.
  • Default Value:
    • For files: 0644 (octal).
    • For directories: 0755 (octal).
    • For symlinks: 0777 (octal).
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_USER ("file:user")

  • Type: string
  • Value: Name of the user owning the file.
  • Allowed Values: Any non-empty string.
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_GROUP ("file:group")

  • Type: string
  • Value: Name of the group owning the file.
  • Allowed Values: Any non-empty string.
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_ATIME ("file:atime")

  • Type: uint
  • Value: File access time (seconds since the Epoch).
  • Allowed Values: Any value.
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_ATIME_NANOS ("file:mtime:nanos")

  • Type: uint
  • Value: The nano seconds fraction of the file access time.
  • Allowed Values: Any value in [0, 999999999].
  • Default Value: 0
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_MTIME ("file:mtime")

  • Type: uint
  • Value: File modified time (seconds since the Epoch).
  • Allowed Values: Any value.
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_MTIME_NANOS ("file:mtime:nanos")

  • Type: uint
  • Value: The nano seconds fraction of the file modified time.
  • Allowed Values: Any value in [0, 999999999].
  • Default Value: 0
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_CRTIME ("file:crtime")

  • Type: uint
  • Value: File creation time (seconds since the Epoch).
  • Allowed Values: Any value.
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_CRTIM_NANOS ("file:crtime:nanos")

  • Type: uint
  • Value: The nano seconds fraction of the file creation time.
  • Allowed Values: Any value in [0, 999999999].
  • Default Value: 0
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE ("file:attribute")

  • Type: string
  • Value: Name of the extended file attribute.
  • Allowed Values: Any valid attribute name.
  • Child Attributes:
    • B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE_TYPE: The type of the file attribute.
    • B_HPKG_ATTRIBUTE_NAME_DATA: The file attribute data.

B_HPKG_ATTRIBUTE_NAME_FILE_ATTRIBUTE_TYPE ("file:attribute:type")

  • Type: uint
  • Value: Type of the file attribute.
  • Allowed Values: Any value in [0, 0xffffffff].
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_DATA ("data")

  • Type: data
  • Value: Raw data of a file or attribute.
  • Allowed Values: Any value, if uncompressed, otherwise see below.
  • Child Attributes:
    • B_HPKG_ATTRIBUTE_NAME_DATA_COMPRESSION: The compression algorithm used for storing the data.
    • B_HPKG_ATTRIBUTE_NAME_DATA_SIZE: The size of the uncompressed data.
    • B_HPKG_ATTRIBUTE_NAME_DATA_CHUNK_SIZE: The size of an uncompressed data chunk.

B_HPKG_ATTRIBUTE_NAME_DATA_COMPRESSION ("data:compression")

  • Type: uint
  • Value: ID of the data compression algorithm.
  • Allowed Values:
0B_HPKG_COMPRESSION_NONEno compression
1B_HPKG_COMPRESSION_ZLIBzlib (LZ77) compression
  • Default Value: B_HPKG_COMPRESSION_NONE
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_DATA_SIZE ("data:size")

  • Type: uint
  • Value: Size of the uncompressed data.
  • Allowed Values:: Any value.
  • Default Value: Size of the compressed data.
  • Child Attributes: none

B_HPKG_ATTRIBUTE_NAME_DATA_CHUNK_SIZE ("data:chunk_size")

  • Type: uint
  • Value: Size of an uncompressed data chunk.
  • Allowed Values:: Any value.
  • Default Value:
    • If not compressed: 0
    • If B_HPKG_COMPRESSION_ZLIB compressed: 64 * 1024
  • Child Attributes: none
  • Type: string
  • Value: The path the symlink refers to.
  • Allowed Values:: Any valid symlink path.
  • Default Value:: Empty string.
  • Child Attributes: none

TOC Attributes

The TOC can directly contain any number of attributes of the B_HPKG_ATTRIBUTE_NAME_DIRECTORY_ENTRY type, which in turn contain descendent attributes as specified in the previous section. Any other attributes are ignored.

Data Compression

Data referred to by an B_HPKG_ATTRIBUTE_NAME_DATA attribute will be the raw data, if uncompressed. If compressed, the data have a special format, that allows for fast random access.

B_HPKG_COMPRESSION_ZLIB

The original data are split into equally sized chunks and compressed individually. The compressed data chunks are stored (in order) without padding, preceded by an uint64 array specifying the relative offsets of the compressed data of each chunk. The offsets are relative to the first byte following the offset array. Since the first chunk is always at offset 0, its array element is omitted. Therefore uncompressed data split into n chunks will have n - 1 offset array elements.

The size of the compressed chunks is implied by the offset differences (respectively for the last chunk the difference to the total size). A compressed chunk is always shorter than the uncompressed chunk. If the chunk data couldn't be compressed, the data are stored uncompressed and the size of the stored chunk will be equal to the uncompressed chunk size.

The Package Format

TODO...