Category: MySQLDevelopment

MySQL Internals MyISAM

← Back to MySQL Internals overview page

Contents

[edit] MyISAM Storage Engine

[edit] MyISAM Record Structure

[edit] Introduction

When you say:

CREATE TABLE Table1 ...

MySQL creates files named Table1.MYD ("MySQL Data"), Table1.MYI ("MySQL Index"), and Table1.frm ("Format"). These files will be in the directory:

/<datadir>/<database>/

For example, if you use Linux, you might find the files in the /usr/local/var/test directory (assuming your database name is test). if you use Windows, you might find the files in the \mysql\data\test\ directory.

Let's look at the .MYD Data file (MyISAM SQL Data file) more closely. There are three possible formats fixed, dynamic, and packed. First, let's discuss the fixed format.

Unlike most DBMSs, MySQL doesn't store on disk using pages. Therefore you will not see filler space between rows. (Reminder: This does not refer to BDB and InnoDB tables, which do use pages).

The minimal record header is a set of flags:

The length of the record header is thus:

(1 + number of NULL columns + 7) / 8 bytes

After the header, all columns are stored in the order that they were created, which is the same order that you would get from SHOW COLUMNS.

Here's an example. Suppose you say:

CREATE TABLE Table1 (column1 CHAR(1), column2 CHAR(1), column3 CHAR(1));
INSERT INTO Table1 VALUES ('a', 'b', 'c');
INSERT INTO Table1 VALUES ('d', NULL, 'e');

A CHAR(1) column takes precisely one byte (plus one bit of overhead that is assigned to every column I'll describe the details of column storage later). So the file Table1.MYD looks like this:

Hexadecimal Display of Table1.MYD file

F1 61 62 63 00 F5 64 20 65 00              ... .abc..d e.

Here's how to read this hexadecimal-dump display:

(It's probably easier to understand the flag setting if you restate F5 as 11110101 binary, and (a) notice that the third flag bit from the right is on, and (b) remember that the first flag bit is the X bit.)

There are complications the record header is more complex if there are variable-length fields but the simple display shown in the example is exactly what you'd see if you looked at the MySQL Data file with a debugger or a hexadecimal file dumper.

So much for the fixed format. Now, let's discuss the dynamic format.

The dynamic file format is necessary if rows can vary in size. That will be the case if there are BLOB columns, or "true" VARCHAR columns. (Remember that MySQL may treat VARCHAR columns as if they're CHAR columns, in which case the fixed format is used.) A dynamic row has more fields in the header. The important ones are "the actual length", "the unused length", and "the overflow pointer". The actual length is the total number of bytes in all the columns. The unused length is the total number of bytes between one physical record and the next one. The overflow pointer is the location of the rest of the record if there are multiple parts.

For example, here is a dynamic row:

 
 03                 start of header - Block type, see mi_dynrec.c, _mi_get_block_info()
 04, 00             actual length
 0c                 block length
 01, fc             flags + overflow pointer
 ****               data in the row
 ************       unused bytes
                    <-- next row starts here)
 

In the example, the actual length and the unused length are short (one byte each) because the table definition says that the columns are short if the columns were potentially large, then the actual length and the unused length could be two bytes each, three bytes each, and so on. In this case, actual length plus unused length is 10 hexadecimal (sixteen decimal), which is a minimum.

In a dynamic row, there is no deleted bit. Instead, deleted rows are marked with a block of type 0.

As for the third format packed we will only say briefly that:

For details, see the source files /myisam/mi_statrec.c (for fixed format), /myisam/mi_dynrec.c (for dynamic format), and /myisam/mi_packrec.c (for packed format).

Note: Internally, MySQL uses a format much like the fixed format which it uses for disk storage. The main differences are:

  1. BLOB values have a length and a memory pointer rather than being stored inline.
  2. "True VARCHAR" (a column storage which will be fully implemented in version 5.0) will have a 16-bit length plus the data.
  3. All integer or floating-point numbers are stored with the low byte first. Point (3) does not apply for ISAM storage or internals.

[edit] Physical Attributes of Columns

Next I'll describe the physical attributes of each column in a row. The format depends entirely on the data type and the size of the column, so, for every data type, I'll give a description and an example.

All the types are defined within the include/mysql_com.h file within the enum_field_types enumerated structure. Here's a sample of the key values and corresponding numbers:

 MYSQL_TYPE_BIT                  16
 MYSQL_TYPE_BLOB                 252
 MYSQL_TYPE_DATE                 10
 MYSQL_TYPE_DATETIME             12
 MYSQL_TYPE_DECIMAL              0
 MYSQL_TYPE_DOUBLE               5
 MYSQL_TYPE_ENUM                 247
 MYSQL_TYPE_FLOAT                4
 MYSQL_TYPE_GEOMETRY             255
 MYSQL_TYPE_INT24                9
 MYSQL_TYPE_LONG                 3
 MYSQL_TYPE_LONGLONG             8
 MYSQL_TYPE_LONG_BLOB            251
 MYSQL_TYPE_MEDIUM_BLOB          250
 MYSQL_TYPE_NEWDATE              14
 MYSQL_TYPE_NEWDECIMAL           246
 MYSQL_TYPE_NULL                 6
 MYSQL_TYPE_SET                  248
 MYSQL_TYPE_SHORT                2
 MYSQL_TYPE_STRING               254
 MYSQL_TYPE_TIME                 11
 MYSQL_TYPE_TIMESTAMP            7
 MYSQL_TYPE_TINY                 1
 MYSQL_TYPE_TINY_BLOB            249
 MYSQL_TYPE_VARCHAR              15
 MYSQL_TYPE_VAR_STRING           253
 MYSQL_TYPE_YEAR                 13

CHAR

VARCHAR

Important: MySQL almost always stores multi-byte binary numbers with the low byte first. This is called "little-endian" numeric storage; it's normal on Intel x86 machines; MySQL uses it even for non-Intel machines so that databases will be portable. TINYINT

SMALLINT

MEDIUMINT

INT

BIGINT

FLOAT

DOUBLE PRECISION

REAL

DECIMAL

NUMERIC

BOOL

DATE

DATETIME

TIME

TIMESTAMP

YEAR

SET

ENUM

Warning: Because TINYBLOB's preceding length is one byte long (the size of a TINYINT) and MEDIUMBLOB's preceding length is three bytes long (the size of a MEDIUMINT), it's easy to think there's some sort of correspondence between the BLOB and INT types. There isn't a BLOB's preceding length is not four bytes long (the size of an INT). TINYBLOB

TINYTEXT

BLOB

TEXT

MEDIUMBLOB

MEDIUMTEXT

LONGBLOB

LONGTEXT

[edit] Where to Look For More Information

References:

Most of the formatting work for MyISAM columns is visible in the program /sql/field.cc in the source code directory. And in the MyISAM directory, the files that do formatting work for different record formats are: /myisam/mi_statrec.c, /myisam/mi_dynrec.c, and /myisam/mi_packrec.c.

[edit] The .MYI file

A .MYI file for a MyISAM table contains the table's indexes.

The .MYI file has two parts: the header information and the key values. So the next sub-sections will be "The .MYI Header" and "The .MYI Key Values".

The .MYI Header

A .MYI file begins with a header, with information about options, about file sizes, and about the "keys". In MySQL terminology, a "key" is something that you create with CREATE [UNIQUE] INDEX.

Program files which read and write .MYI headers are in the ./myisam directory: mi_open.c has the routines that write each section of the header, mi_create.c has a routine that calls the mi_open.c routines in order, and myisamdef.h has structure definitions corresponding to what we're about to describe.

These are the main header sections:

Section                       Occurrences
 -------                       -----------
 state                         Occurs 1 time
 base                          Occurs 1 time
 keydef (including keysegs)    Occurs once for each key
 recinfo                       Occurs once for each field
 

Now we will look at each of these sections, showing each field.

We are going to use an example table throughout the description. To make the example table, we executed these statements:

  CREATE TABLE T (S1 CHAR(1), S2 CHAR(2), S3 CHAR(3));
  CREATE UNIQUE INDEX I1 ON T (S1);
  CREATE INDEX I2 ON T (S2,S3);
  INSERT INTO T VALUES ('1', 'aa', 'b');
  INSERT INTO T VALUES ('2', 'aa', 'bb');
  INSERT INTO T VALUES ('3', 'aa', 'bbb');
  DELETE FROM T WHERE S1 = '2';

We took a hexadecimal dump of the resulting file, T.MYI.

In all the individual descriptions below, the column labeled Dump From Example File has the exact bytes that are in T.MYI. You can verify that by executing the same statements and looking at a hexadecimal dump yourself. With Linux this is possible using od -h T.MYI; with Windows you can use the command-line debugger.

Along with the typical value, we may include a comment. The comment usually explains why the value is what it is. Sometimes the comment is derived from the comments in the source code.

state

This section is written by mi_open.c, mi_state_info_write().

Name                         Size Dump From Example File  Comment
 ----                         ---- ----------------------  -------
 
 file_version                  4   FE FE 07 01             from myisam_file_magic
 options                       2   00 02                   HA_OPTION_COMPRESS_RECORD
                                                           etc.
 header_length                 2   01 A2                   this header example has
                                                           0x01A2 bytes
 state_info_length             2   00 B0                   = MI_STATE_INFO_SIZE
                                                           defined in myisamdef.h
 base_info_length              2   00 64                   = MI_BASE_INFO_SIZE
                                                           defined in myisamdef.h
 base_pos                      2   00 D4                   = where the base
                                                           section starts
 key_parts                     2   00 03                   a key part is a column
                                                           within a key
 unique_key_parts              2   00 00                   key-parts+unique-parts
 keys                          1   02                      here are 2 keys --
                                                           I1 and I2
 uniques                       1   00                      number of hash unique
                                                           keys used internally
                                                           in temporary tables
                                                           (nothing to do with
                                                           'UNIQUE' definitions)
 language                      1   08                      "language for indexes"
 max_block_size                1   01
 fulltext_keys                 1   00                      # of fulltext keys.
                                                           = 0 if version <= 4.0
 not_used                      1   00                      to align to 8-byte
                                                           boundary
 
 state->open_count             2   00 01
 state->changed                1   39                      set if table updated;
                                                           reset if shutdown (so
                                                           one can examine this
                                                           to see if there was an
                                                           update without proper
                                                           shutdown)
 state->sortkey                1   FF                      "sorted by this key"
                                                           (not used)
 state->state.records          8   00 00 00 00 00 00 00 02 number of actual,
                                                           un-deleted, records
 state->state.del              8   00 00 00 00 00 00 00 01 # of deleted records
 state->split                  8   00 00 00 00 00 00 00 03 # of "chunks" (e.g.
                                                           records or spaces left
                                                           after record deletion)
 state->dellink                8   00 00 00 00 00 00 00 07 "Link to next removed
                                                           "block". Initially =
                                                           HA_OFFSET_ERROR
 state->state.key_file_length  8   00 00 00 00 00 00 0c 00 2048
 state->state.data_file_length 8   00 00 00 00 00 00 00 15 = size of .MYD file
 state->state.empty            8   00 00 00 00 00 00 00 00
 state->state.key_empty        8   00 00 00 00 00 00 00 00
 state->auto_increment         8   00 00 00 00 00 00 00 00
 state->checksum               8   00 00 00 00 00 00 00 00
 state->process                4   00 00 09 E6             from getpid(). process
                                                           of last update
 state->unique                 4   00 00 00 0B             initially = 0
 state->status                 4   00 00 00 00
 state->update_count           4   00 00 00 04             updated for each write
                                                           lock (there were 3
                                                           inserts + 1 delete,
                                                           total 4 operations)
 state->key_root               8   00 00 00 00 00 00 04 00 offset in file where
                                                           I1 keys start, can be
                                                           = HA_OFFSET_ERROR
                                   00 00 00 00 00 00 08 00 state->key_root occurs
                                                           twice because there
                                                           are two keys
 state->key_del                8   FF FF FF FF FF FF FF FF delete links for keys
                                                           (occurs many times if
                                                           many delete links)
 state->sec_index_changed      4   00 00 00 00             sec_index = secondary
                                                           index (presumably)
                                                           not currently used
 state->sec_index_used         4   00 00 00 00             "which extra indexes
                                                           are in use"
                                                           not currently used
 state->version                4   3F 3F EB F7             "timestamp of create"
 state->key_map                8   00 00 00 03             "what keys are in use"
 state->create_time            8   00 00 00 00 3F 3F EB F7 "time when database
                                                           created" (actually:
                                                           time when file made)
 state->recover_time           8   00 00 00 00 00 00 00 00 "time of last recover"
 state->check_time             8   00 00 00 00 3F 3F EB F7 "time of last check"
 state->rec_per_key_rows       8   00 00 00 00 00 00 00 00
 state->rec_per_key_parts      4   00 00 00 00             (key_parts = 3, so
                                   00 00 00 00              rec_per_key_parts
                                   00 00 00 00              occurs 3 times)
 

base

This section is written by mi_open.c, mi_base_info_write(). The corresponding structure in myisamdef.h is MI_BASE_INFO.

In our example T.MYI file, the first byte of the base section is at offset 0x00d4. That's where it's supposed to be, according to the header field base_pos (above).

Name                         Size Dump From Example File  Comment
 ----                         ---- ----------------------  -------
 
 base->keystart               8    00 00 00 00 00 00 04 00 keys start at offset
                                                           1024 (0x0400)
 base->max_data_file_length   8    00 00 00 00 00 00 00 00
 base->max_key_file_length    8    00 00 00 00 00 00 00 00
 base->records                8    00 00 00 00 00 00 00 00
 base->reloc                  8    00 00 00 00 00 00 00 00
 base->mean_row_length        4    00 00 00 00
 base->reclength              4    00 00 00 07             length(s1)+length(s2)
                                                           +length(s3)=7
 base->pack_reclength         4    00 00 00 07
 base->min_pack_length        4    00 00 00 07
 base->max_pack_length        4    00 00 00 07
 base->min_block_length       4    00 00 00 14
 base->fields                 4    00 00 00 04             4 fields: 3 defined,
                                                           plus 1 extra
 base->pack_fields            4    00 00 00 00
 base->rec_reflength          1    04
 base->key_reflength          1    04
 base->keys                   1    02                      was 0 at start
 base->auto_key               1    00
 base->pack_bits              2    00 00
 base->blobs                  2    00 00
 base->max_key_block_length   2    04 00                   length of block = 1024
                                                           bytes (0x0400)
 base->max_key_length         2    00 10                   including length of
                                                           pointer
 base->extra_alloc_bytes      2    00 00
 base->extra_alloc_procent    1    00
 base->raid_type              1    00
 base->raid_chunks            2    00 00
 base->raid_chunksize         4    00 00 00 00
 [extra] i.e. filler          6    00 00 00 00 00 00
 

keydef

This section is written by mi_open.c, mi_keydef_write(). The corresponding structure in myisamdef.h is MI_KEYDEF.

This is a multiple-occurrence structure. Since there are two indexes in our example (I1 and I2), we will see that keydef occurs two times below. There is a subordinate structure, keyseg, which also occurs multiple times (once within the keydef for I1 and two times within the keydef for I2).

Name                         Size Dump From Example File  Comment
 ----                         ---- ----------------------  -------
 
 /* key definition for I1 */
 
 keydef->keysegs              1    01                      there is 1 keyseg (for
                                                           column S1).
 keydef->key_alg              1    01                      algorithm = Rtree or
                                                           Btree
 keydef->flag                 2    00 49                   HA_NOSAME +
                                                           HA_SPACE_PACK_USED +
                                                           HA_NULL_PART_KEY
 keydef->block_length         2    04 00                   i.e. 1024
 key def->keylength           2    00 06                   field-count+sizeof(S1)
                                                           sizeof(ROWID)
 keydef->minlength            2    00 06
 keydef->maxlength            2    00 06
   /* keyseg for S1 in I1 */
   keyseg->type               1    01                      /* I1(S1) size(S1)=1,
                                                              column = 1 */
                                                           = HA_KEYTYPE_TEXT
   keyseg->language           1    08
   keyseg->null_bit           1    02
   keyseg->bit_start          1    00
   keyseg->bit_end            1    00
   [0] i.e. filler            1    00
   keyseg->flag               2    00 14                   HA_NULL_PART +
                                                           HA_PART_KEY
   keyseg->length             2    00 01                   length(S1) = 1
   keyseg->start              4    00 00 00 01             offset in the row
   keyseg->null_pos           4    00 00 00 00
 
 /* key definition for I2 */
 
 keydef->keysegs              1    02                      keysegs=2, for columns
                                                           S2 and S3
 keydef->key_alg              1    01                      algorithm = Rtree or
                                                           Btree
 keydef->flag                 2    00 48                   HA_SPACE_PACK_USED +
                                                           HA_NULL_PART_KEY
 keydef->block_length         2    04 00                   i.e. 1024
 key def->keylength           2    00 0B                   field-count+ sizeof(all fields)+
                                                             sizeof(RID)
 keydef->minlength            2    00 0B
 keydef->maxlength            2    00 0B
   /* keyseg for S2 in I2 */
   keyseg->type               1    01                      /* I2(S2) size(S2)=2,
                                                              column = 2 */
   keyseg->language           1    08
   keyseg->null_bit           1    04
   keyseg->bit_start          1    00
   keyseg->bit_end            1    00
   [0] i.e. filler            1    00
   keyseg->flag               2    00 14                   HA_NULL_PART +
                                                           HA_PART_KEY
   keyseg->length             2    00 02                   length(S2) = 2
   keyseg->start              4    00 00 00 02
   keyseg->null_pos           4    00 00 00 00
   /* keyseg for S3 in I2 */
   keyseg->type               1    01                      /* I2(S3) size(S3)=3,
                                                              column = 3 */
   keyseg->language           1    08
   keyseg->null_bit           1    08
   keyseg->bit_start          1    00
   keyseg->bit_end            1    00
   [0] i.e. filler            1    00
   keyseg->flag               2    00 14                   HA_NULL_PART +
                                                           HA_PART_KEY
   keyseg->length             2    00 03                   length(S3) = 3
   keyseg->start              4    00 00 00 04
   keyseg->null_pos           4    00 00 00 00
 

recinfo

The recinfo section is written by mi_open.c, mi_recinfo_write(). The corresponding structure in myisamdef.h is MI_COLUMNDEF.

This is another multiple-occurrence structure. It appears once for each field that appears in a key, including an extra field that appears at the start and has flags (for deletion and for null fields).

Name                         Size Dump From Example File  Comment
 ----                         ---- ----------------------  -------
 
 recinfo->type                2    00 00                   extra
 recinfo->length              2    00 01
 recinfo->null_bit            1    00
 recinfo->null_pos            2    00 00
 
 recinfo->type                2    00 00                   I1 (S1)
 recinfo->length              2    00 01
 recinfo->null_bit            1    02
 recinfo->null_pos            2    00 00
 
 recinfo->type                2    00 00                   I2 (S2)
 recinfo->length              2    00 02
 recinfo->null_bit            1    04
 recinfo->null_pos            2    00 00
 
 recinfo->type                2    00 00                   I2 (S3)
 recinfo->length              2    00 03
 recinfo->null_bit            1    08
 recinfo->null_pos            2    00 00
 

We are now at offset 0xA2 within the file T.MYI. Notice that the value of the third field in the header, header_length, is 0xA2. Anything following this point, up till the first key value, is filler.

The .MYI Key Values

And now we look at the part which is not the information header: we look at the key values. The key values are in blocks (MySQL's term for pages). A block contains values from only one index. To continue our example: there is a block for the I1 key values, and a block for the I2 key values.

According to the header information (state->key_root above), the I1 block starts at offset 0x0400 in the file, and the I2 block starts at offset 0x0800 in the file.

At offset 0x0400 in the file, we have this:

Name                         Size Dump From Example File  Comment
 ----                         ---- ----------------------  -------
 
 (block header)               2    00 0E                   = size (inclusive)
                                                           (first bit of word =
                                                           0 meaning this is a
                                                           B-Tree leaf, see the
                                                           mi_test_if_nod macro)
 (first key value)            2    01 31                   Value is "1" (0x31).
 (first key pointer)          2-8  00 00 00 00             Pointer is to Record
                                                           #0000. pointer length
                                                           is = base->rec_reflength
 (second key value)           2    01 33                   Value is "3" (0x33).
 (second key pointer)         2-8  00 00 00 02             Pointer is to Record
                                                           #0002.
 (junk)                       1010 .. .. .. .. .. .. ..    rest of the 1024-byte
                                                           block is unused
 

At offset 0x0800 in the file, we have this:

Name                         Size Dump From Example File  Comment
 ----                         ---- ----------------------  -------
 
 (block header)               2    00 18                   = size (inclusive)
 (first key value)            7    01 61 61 01 62 20 20    Value is "aa/b  "
 (first key pointer)          2-8  00 00 00 00             Pointer is to Record
                                                           #0000.
 (second key value)           7    01 61 61 01 62 62 62    Value is "aa/bbb"
 (second key pointer)         2-8  00 00 00 02             Pointer is to Record
                                                           #0002.
 (junk)                       1000 .. .. .. .. .. .. ..    rest of the 1024-byte
                                                           block is unused
 

From the above illustrations, these facts should be clear:

These facts are not illustrated, but are also clear:

[edit] MyISAM Files

Some notes about MyISAM file handling:

[edit] MyISAM Dynamic Data File Layout

Variable length records are contained in "frames". A record can be put in one or more frames, also called the record "parts" or "blocks". The sense of the frames is to allow reusage of the space of deleted records. Starting with an empty data file, records are put in a single frame each, unless a record is bigger than the maximum frame size (16MB - 4). When a record is deleted, its frame is marked deleted. When a record is inserted after this, it reuses the old frame. If the new record is smaller, the frame is split. The unused part is marked deleted. If a new record is bigger than the old frame, the frame is filled with the record as much as fits. The rest is inserted in other old frames, or, if non is available, in a new frame at end of file.

[edit] Layout of the Record Storage Frame (Record Part, Record Block)

 MI_MIN_BLOCK_LENGTH   20            /* 20 bytes are required for the biggest frame type: Deleted block. */
 MI_MAX_BLOCK_LENGTH   16777212      /* 16MB - 4, max 3 bytes for length available, 4-byte aligned. */
 MI_DYN_ALIGN_SIZE     4             /* Frames start a 4-byte boundaries. */
 Part header[0] (decimal/hexadecimal, one byte):
   0/00: Deleted block
         block_len 3 bytes [1-3]
         next_filepos 8 bytes [4-11]
         prev_filepos 8 bytes [12-19]
         => header length 20
   1/01: Full small record, full block
         rec_len,data_len,block_len 2 bytes [1-2]
         => header length 3
   2/02: Full big record, full block
         rec_len,data_len,block_len 3 bytes [1-3]
         => header length 4
   3/03: Full small record, unused space
         rec_len,data_len 2 bytes [1-2]
         unused_len 1 byte [3]
         => header length 4
   4/04: Full big record, unused space
         rec_len,data_len 3 bytes [1-3]
         unused_len 1 byte [4]
         => header length 5
   5/05: Start small record
         rec_len 2 bytes [1-2]
         data_len,block_len 2 bytes [3-4]
         next_filepos 8 bytes [5-12]
         => header length 13
   6/06: Start big record
         rec_len 3 bytes [1-3]
         data_len,block_len 3 bytes [4-6]
         next_filepos 8 bytes [7-14]
         => header length 15
   7/07: End small record, full block
         data_len,block_len 2 bytes [1-2]
         => header length 3
   8/08: End big record, full block
         data_len,block_len 3 bytes [1-3]
         => header length 4
   9/09: End small record, unused space
         data_len 2 bytes [1-2]
         unused_len 1 byte [3]
         => header length 4
  10/0A: End big record, unused space
         data_len 3 bytes [1-3]
         unused_len 1 byte [4]
         => header length 5
  11/0B: Continue small record
         data_len,block_len 2 bytes [1-2]
         next_filepos 8 bytes [3-10]
         => header length 11
  12/0C: Continue big record
         data_len,block_len 3 bytes [1-3]
         next_filepos 8 bytes [4-11]
         => header length 12
  13/0D: Start giant record
         rec_len 4 bytes [1-4]
         data_len,block_len 3 bytes [5-7]
         next_filepos 8 bytes [8-15]
         => header length 16


block_len does not include the header length except of deleted blocks. In deleted blocks block_len includes the header length.

data_len is the number of bytes within this block that are part of the record.

rec_len is the total record length including the data lengths of all belonging blocks.

In deleted blocks next_filepos and prev_filepos make a doubly linked list over all deleted blocks. At list start, prev_filepos is HA_OFFSET_ERROR (all bits set). At list end, next_filepos is HA_OFFSET_ERROR (all bits set).

In non-deleted blocks next_filepos points to the next part of the record.

The read function for the block header of dynamic records is mi_dynrec.c:_mi_get_block_info().

[edit] Record contents

[edit] Packed record layout

 pack bits (!= NULL bits!): One bit per packable column:
              FIELD_BLOB: Set if blob is empty. No blob data in record.
              FIELD_SKIP_ZERO: Set if all bytes zero, no data in record.
              FIELD_SKIP_ENDSPACE,
              FIELD_SKIP_PRESPACE: Set if some blanks stripped in record.
              => The "pack bits" are rounded up to the next byte boundary.
 Record contents from the in-memory representation. Each field is copied verbatim unless packed according to the "pack bits" paragraph.

[edit] In-memory record layout

 null bits:   One bit per column that can be NULL.
              The "null bits" are rounded up to the next byte boundary.
              the number of "null bytes" are also referred as data_offset.
 fields:      One field per column. No separators, length fields, etc.
              Length (pack_length) and layout of the fields depend on the
              field type:
              -- to be added --

Note: The "in-memory record layout" is used by the SQL layer. It is independent from storage engines. All storage engines have to accept and produce this in-memory format at the handler interface.

[edit] MyISAM Compressed Data File Layout

This chapter describes the layout for the data file of compressed MyISAM tables.

[edit] Huffman compression

MyISAM compression is based on Huffman compression. In his article from 1952 Huffman proved that his algorithm uses the least possible number of bits to encode a sequence of messages. The number of bits assigned to each message depends on its probability to appear in the sequence.

Huffman did not specify exactly, what those "messages" are. One could take all possible values - say of a table column - as "messages". But if there are too many of them, the code tables could become bigger than the uncompressed table. One would need to specify every possible value once and the code tree with its indexes and offsets. Not to forget the effort to step through big binary trees for every value and - on the encoding side - the comparison of each value against the already collected distinct values.

The usual way to define "Huffman messages" is to take the possible 256 values, which a byte can express, as the "messages". That way the code trees are of limited size. On the other hand, the theoretical maximum compression is 1:8 (12.5%) in this case.

[edit] The myisampack Program

myisampack tries both ways to compress the column values. When starting to analyze the existing uncompressed data, it collects distinct column values up to a limit of 8KB. If there are more, it falls back to byte value compression for this column.

This means also that myisampack may use different algorithms for different columns. Besides a couple of other tricks, myisampack determines for every column if distinct column value compression or byte value compression is better. After that it tries to combine byte value compression trees of different columns into one or more code trees. This means that finally we may have less code trees than columns. Therefore the column information in the file header contains the number of the code tree used for each column. Some columns might not need a code tree at all. This happens for columns which have the same value in all records.

[edit] Record and Blob Length Encoding

Since the compressed data file should be usable for read-only purposes by the MySQL database management system, every record starts on a byte boundary. For easier handling by the system, every record begins with a length information for the compressed record and a length information for the total size of all uncompressed blobs of this record. Both lengths are encoded in 1 to 5 bytes, depending on its value.

A length from 1 to 253 bytes is represented in one byte. A length of 254 to 65535 bytes (64KB-1) is represented by three bytes. The first contains the value 254 and the next two bytes contain the plain length. The low order byte goes first. A length of 65536 to 4294967295 bytes (4GB-1) is represented by five bytes. The first contains the value 255 and the next four bytes contain the plain length. The low order byte goes first.

The encoded compressed record length does not include these length bytes. It tells the number of bytes which follow behind the length bytes for this record.

[edit] Code Tree Representation

The code trees are binary trees. Every node has exactly two children. The children can be leaves or branches. A leaf contains one original, uncompressed value. A branch contains a pointer to another node. The Huffman codes represent the navigation through the tree. Every left branch gets a 0 bit, every right branch gets a 1 bit.

The in-memory representation of the trees are two unsigned integers per node. Each describes either a leaf value or an offset (in unsigned integers relative from this node) to another node. To distinguish values from offsets, the 15th bit (decimal value 32768) is set together with offsets. This is safe as the size of the trees is limited by either having a maximum of 256 elements for byte value compression or 4096 elements for distinct column value compression.

The representation of the trees in the compressed data file is almost the same. But instead of writing all bits of the unsigned integers, only as many bits are written as are required to represent the highest value or offset respectively. One more bit per value/offset is written in advance, to distinguish both. The number of bits required per value and per offset is computed in advance and part of the code tree description.

[edit] Usage of the Index File

While the header of the compressed data file contains a lot of information, there are still some things which need to be taken from the index file. These are the number of columns of the table and the length of each column. The latter is required for columns with suppressed leading spaces or suppressed trailing spaces or zeros.

[edit] Tricks

As already mentioned, myisampack uses some tricks to decrease the amount of data to be encoded. These cope with leading and trailing spaces or zeros or with all blank or NULL fields.

I do not describe these in detail here. They do not materialize in the compressed data files other than the later mentioned field and pack types. They are however important to know for decoding the records.

[edit] Detailed Specification of the Decoding:

Below follows the detailed specification of the encoding:

Datafile fixed header (32 bytes):

4 byte  magic number
4 byte  total header length (fixed + column info + code trees)
4 byte  minimum packed record length
4 byte  maximum packed record length
4 byte  total number of elements in all code trees
4 byte  total number of bytes collected for distinct column values
2 byte  number of code trees
1 byte  maximum number of bytes required to represent record+blob lengths
1 byte  record pointer length, number of bytes for compressed data file length
4 byte  zeros

Column Information. For every column in the table:

5 bits  field type
    FIELD_NORMAL          0
    FIELD_SKIP_ENDSPACE   1
    FIELD_SKIP_PRESPACE   2
    FIELD_SKIP_ZERO       3
    FIELD_BLOB            4
    FIELD_CONSTANT        5
    FIELD_INTERVALL       6
    FIELD_ZERO            7
    FIELD_VARCHAR         8
    FIELD_CHECK           9

6 bits  pack type as a set of flags
    PACK_TYPE_SELECTED      1
    PACK_TYPE_SPACE_FIELDS  2
    PACK_TYPE_ZERO_FILL     4

5 bits  if pack type contains PACK_TYPE_ZERO_FILL
    minimum number of trailing zero bytes in this column
        else
    number of bits to encode the number of
    packed bytes in this column (length_bits)

x bits  number of the code tree used to encode this column
    x is the minimum number of bits required to represent the highest
    tree number.

Alignment:

x bits  alignment to the next byte border

Code Trees. For every tree:

1 bit   compression type
    0 = byte value compression
        8 bits  minimum byte value coded by this tree
        9 bits  number of byte values encoded by this tree
        5 bits  number of bits used to encode the byte values
        5 bits  number of bits used to encode offsets to next tree elements
    1 = distinct column value compression
        15 bits number of distinct column values encoded by this tree
        16 bits length of the buffer with all column values
        5 bits  number of bits used to encode the index of the column value
        5 bits  number of bits used to encode offsets to next tree elements
For each code tree element:
    1 bit   IS_OFFSET
    x bits  the announced number of bits for either a value or an offset
x bits  alignment to the next byte border
If compression by distinct column values:
    The number of 8-bit values that make up the column value buffer

Compressed Records. For every record:

1-5 bytes  length of the compressed record in bytes
    1. byte  0..253 length
             254    length encoded in the next two bytes little endian
             255    length encoded in the next  x  bytes little endian
                    x = 3  for pack file version 1
                    x = 4  for pack file version > 1
1-5 bytes  total length of all expanded blobs of this record
    1. byte  0..253 length
             254    length encoded in the next two bytes little endian
             255    length encoded in the next  x  bytes little endian
                    x = 3  for pack file version 1
                    x = 4  for pack file version > 1
For every column:
    If pack type includes PACK_TYPE_SPACE_FIELDS,
        1 bit   1 = spaces only, 0 = not only spaces
    In case the field type is of:
        FIELD_SKIP_ZERO
            1 bit   1 = zeros only, 0 = not only zeros
                    In the latter case
                        x bits  the Huffman code for every byte
        FIELD_NORMAL
            x bits  the Huffman code for every byte
        FIELD_SKIP_ENDSPACE
            If pack type includes PACK_TYPE_SELECTED,
                1 bit   1 = more than min endspace, 0 = not more
                        In the former case
                            x bits  nr of extra spaces, x = length_bits
            else
                x bits  nr of extra spaces, x = length_bits
            x bits  the Huffman code for every byte
        FIELD_SKIP_PRESPACE
            If pack type includes PACK_TYPE_SELECTED,
                1 bit   1 = more than min prespace, 0 = not more
                        In the former case
                            x bits  nr of extra spaces, x = length_bits
            else
                x bits  nr of extra spaces, x = length_bits
            x bits  the Huffman code for every byte
        FIELD_CONSTANT or FIELD_ZERO or FIELD_CHECK
            nothing for these
        FIELD_INTERVALL
            x bits  the Huffman code for the buffer index of the column value
        FIELD_BLOB
            1 bit   1 = blob is empty, 0 = not empty
                    In the latter case
                        x bits  blob length, x = length_bits
                        x bits  the Huffman code for every byte
        FIELD_VARCHAR
            1 bit   1 = varchar is empty, 0 = not empty
                    In the latter case
                        x bits  blob length, x = length_bits
                        x bits  the Huffman code for every byte
    x bits  alignment to the next byte border

[edit] MyISAM Key Cache

The MyISAM Key Cache

[edit] MyISAM Concurrent Insert

     Session1                             Session2
     ========                             ========
  +------------+                       +------------+
  | TABLE      |                       | TABLE      |
  |            |   +---------------+   |            |
  |          s --->| TABLE_SHARE   |<--- s          |
  |    file    |   |               |   |    file    |
  +-----|------+   |               |   +-----|------+
        |          +---------------+         |
        v                                    v
  +------------+                       +------------+
  | ha_myisam  |                       | ha_myisam  |
  |            |                       |            |
  |    file    |                       |    file    |
  +-----|------+                       +-----|------+
        |                                    |
        v                                    v
  +------------+                       +------------+
  | MI_INFO    |                       | MI_INFO    |
  |            |   +---------------+   |            |
  |          s --->| MYISAM_SHARE  |<--- s          |
  |            |   |               |   |            |
  | state -+--------> state.state <--------+- state |
  |        |   |   |               |   |   |        |
  |        v   |   |               |   |   v        |
  | save_state |   |               |   | save_state |
  |            |   |               |   |            |
  |   dfile    |   |   kfile       |   |   dfile    |
  +-----|------+   +-----|---------+   +-----|------+
        |                |                   |
        |                v                   |
        |              .MYI                  |
        |                                    |
        +------------> .MYD <----------------+

MI_INFO::state may either point to MI_INFO::save_state or to MYISAM_SHARE::state.state. The latter is the normal case. Amongst others, state contains data_file_length.

To support concurrent inserts, every statement starts with copying MYISAM_SHARE::state.state to MI_INFO::save_state and lets MI_INFO::state point to the copy. This is done in mi_get_status(). This is called from the hook THR_LOCK::(*get_status)(). Some of the hooks are explained in thr_lock.c. (*get_status)() is called when a thread gets a lock.

The copy back is done in mi_update_status(), which is called from mi_lock_database() when unlocking from a write lock. This, in turn, is called from ha_myisam::external_lock() from unlock_external() from mysql_unlock_tables(). Until 5.1 this was done after thr_multi_unlock(). So it was possible that another thread (or even multiple of them) could thr_lock the table and work with it before the first thread updates the status. In 6.0 the order is reversed. The status should now be accurate when another thread acquires a thr_lock.

However, with concurrent inserts, the trick is that read locks are allowed to proceed while a concurrent insert holds a write lock. So it can copy outdated information when entering the lock. Since it works on its local copy of the state, it won't notice rows that are made available through mi_update_status() after it got the lock.

But there is another chance to miss the row(s). See also Bug#36618 (myisam insert not immediately visible to select from another client). When the concurrent insert ends, it reports success to its client application before it unlocks the table. So there is non-deterministic time span between the seemingly successful ended insert and the final update of the MyISAM status.

Retrieved from "http://forge.mysql.com/wiki/MySQL_Internals_MyISAM"

This page has been accessed 16,236 times. This page was last modified 16:53, 18 June 2009.

Find

Browse
MySQLForge
Main Page
Current events
Recent changes
Random page
Help
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Special pages
New pages
File list
Statistics
Bug reports
More...