Metakit database. Specification of the format used to serialize the in-memory representation to a file.
The foundation for this description can be found at http://www.equi4.com/metakit/format.html
See also the tcl based metakit reader at http://www.equi4.com/pub/sk/readkit.tcl
All data is stored using the endianness of the platform creating the database file. Platforms reading a file written with an endianness different from their own will convert automatically, on the fly.
See also the questions at the end.
Generic format
- Header
- Scattered meta-data and content
Header
- Magic number, 2 byte, also used to determine endianess of the file: 'J', 'L' (litte endian), or 'L', 'J' (big endian).
- Reserved area, 2 byte. First byte is 0x1a for valid header. Second byte is 0x80 for valid old header. Definition of old is not known.
- Pointer to structural information, 4 bytes, an offset counted from the beginning of the database itself, i.e. the first byte of the header. We cannot say counted from the beginning of the file as the database may be embedded in a larger file, usually attached.
Structural information
- Format code. Size unknown.
- Structure definition string. Defines the structure of all views containing in the file.
- Per view information about its size, as the number of rows contained in it, followed by the location and size of the data for all its columns. The location of a column is specified as offset relative to the beginning of the database. The size is length of the data area for the column, in bytes. Each integer value is stored using the Octet-packed Integer format described later. For columns which are internally actually represented as two physical columns (i.e. the binary types) the system will store two location/size pairs.
Structure definition string
- The character set underlying this string is unicode, the character encoding is UTF-8.
- Contains the description of all views.
- The descriptions of the individual views are separated by commas (,).
- A description for a view is of the format "viewName [ column-descriptions ]".
- The column descriptions are separated by commas (,).
- A column description can be either be of the simple format "columnName : columnType", or a full-fledged view description. The name of such a sub-view is then the name of the column.
Column data types
The first six datatypes are only for columns which are not subviews. In a column description (see above) they are all represented by a single character. See the description of the column formats below to find the character representing the type.
- String - S
- Unicode, UTF-8 encoded. Column data are the concatenated strings, with the terminating zero bytes separating each item.
- Integer - I
- Adaptive Integer. Uses 1/2/4/8/16/32 bits per value, depending on the largest absolute value used in the column.
- Floats - F
- 32-bit floating point - 4 bytes per entry.
- Double - D
- 64-bit floating point - 8 bytes per entry.
- Binary - B
- Untyped binary data - Stored in two columns. The first column contains the actual data, without separators. The second column stores the sizes of the individual entries, as adaptive integers.
- Memo - M
- Like binary, except that the data column actually contains offsets into the database to the locations where the data is stored. This makes for better usage of the available space. When adding data to a memo view existing data doesn't have to be moved, and resizing happens only for the comparatively small table of offsets and not for a large blob of binary data. Even better when entries are deleted or set to NULL. The offsets are relative to the beginning of the database. The offsets are stored as adaptive integers.
- Subview
- Each entry in a subview column contains the location and size for each column in the subview, stored as adaptive integers. This is similar to memo's, except that the referred areas in the file are not unstructured blobs. Another difference is that the location/size data is stored in a single column, and not separated in two. The latter is difficult to handle, given that a subview may contain more than one column.
Octet-packed Integer
Stored as octets with the most significant bits first. Each octet stores 7 bits of the original integer value in its bits 0 to 6. Bit 7 is a flag and is set for the last octet in the value. For all other octets the bit 7 is not set. If the first octet has the value 0x0 the stored number is negative and the 1's complement of the number in the following octets has to be taken to find the actual stored value.
Note: The original description uses the term byte to describe the format. As, according to escargo, bytes are not always 8 bits, we now use the more accurate term octet [L1 ] instead. - There's also Tcl code to play with octet-packed integers.
Free space
There is no explicit handling of unused space in the file. All areas which are not reached through the chain of file offsets beginning in the header are considered unused, and can be reused. The in-memory management data structures will most certainly manage a free-space pool to be able to reuse it, fast. This information won't be saved however.
garynkill: I had used metakit quite frequently in my program but after i appending integer property to the view names "field names", my program starts suffering from segmentation fault. Why is it so? I just did this
mk::view layout db.dbtable1 { table1:I }
Questions and musings
- Metakit databases can be attached to an executable (*pack), or a script (*kit). This means the beginning of the database is not known, only the end. It is not clear from the above how to jump from the end of the file to its header.
- It is not quite clear what is stored as the size of a column. The description above tries to make a reasonable assumption.
- It is unclear for the two binary column types, i.e. the two types stored as two columns how this affects the offset/size information in the structural information. Does that mean that for a column of type B or M the per view information will contain two offset/size pairs ? Or only one offset/size pair and the referenced location is somehow split ? I guess that it is the former, and described it that way above.
- For integer columns it is not clear if each entry is stored as adaptive integer, or if all entries are of the same size, the size chosen according to the largest absolute value. In case of the latter it is not clear where the information about the size of entries is stored, so that we can separate the values from each other at the bit level.
- Offsets are always described as relative to the beginning of the database, i.e. the first byte of the header, which is at offset 0. This is actually not explicitly stated in the original description. So they could be relative to the location of where the offset is stored. I consider this unlikely, still, the original description does not rule this out.
- The contents of a column for a subview are not described at all in the original. My description above is a reasonable guess.
- Given the format above it could make sense to represent a string column as two columns. One containing the string data (= the current column), the other containing byte-length and character length of each string. That would make it easier to jump to a particular row/entry in the column. Of course, maybe metakit create this information internally after reading a file, and is just not writing it. This is redundant information after all.
MR After spending some time in the hex editor today repairing a database, I can shed some insight as to how Metakit knows where the start of the database is. When opening a file, Metakit will look at the last sixteen bytes of the file, which are composed of 4 32-bit numbers, the second of which is the size of the database portion of the file, and the last of which is the offset into the database file containing the top level structure information.
As an example, I have a datafile of length 13782175. The last sixteen bytes in hex are: 80000000 000148e5 80000369 000013fd. The length then is 000148e5, or 84197 decimal. I can then calculate the start location as filesize - length - 16, or 13782175-84197 -16 = 13697962. The extra "-16" there is for the 16 byte footer at the end of the file. The first byte of the Metakit structure information would be located at offset 0000013fd, or 5117 decimal. In other words, it is at 13697962+5117 = 13703079. At that location, I find three bytes (presumably the 'format code') which are 80 06 b1, followed by the structure definition string.