MySQL Internals MyISAM Key Cache
MySQL_Internals_MyISAM_Key_Cache
Contents |
[edit] MyISAM Key Cache
This chapter describes the key cache as of MySQL 5.1.19 (see Bug #17332).
[edit] Introduction
The purpose of the MyISAM Key Cache is to hold parts (or all) of one or more MyISAM index files in memory. Whenever MyISAM requests some portion of file data, the key cache fetches the enclosing file block(s) into one (or more) cache block(s). When MyISAM writes, The cache may need to fetch the file blocks that are affected by the new data before overwriting the file. Only if full file blocks are overwritten, the fetch is omitted. Of course, file blocks that are already in the cache don't need to be read again.
Normally, at end of each SQL statement, the changed blocks for every affected index file are flushed (written to the file) but kept in the cache for later use. This per statement flush can be suppressed by the table option DELAY_KEY_WRITE, the system or session variable DELAY_KEY_WRITE, or the startup option --delay-key-write.
When the last thread closes a MyISAM table, the MyISAM table share is closed too. In this case the changed blocks for the index file are flushed (written to the file) and freed. This cannot be suppressed. However no flush is done for temporary tables.
[edit] Nomenclature
[edit] Index Blocks, File Blocks, and Cache Blocks
The MyISAM index file is organized in "index blocks". "Index blocks" can be of different size even within the same file.
The key cache does not distinguish between leaf blocks and branch blocks. One can however supply an integer argument per block that is used to decide how early a block may be selected for eviction. This could be used to keep branch blocks longer in the cache.
The cached data are organized in "cache blocks". A "cache block" is a contiguous range of memory addresses. Its purpose is to hold data from the index file. The blocks of the index file, whose data are copied in the "cache blocks" are called "file blocks". "File blocks" and "cache blocks" are of the same size. This size is fixed per key cache. The first "file block" starts at file offset zero. The "file blocks" lay in the file without gaps and without overlap. Since the file size does not need to be a multiple of the "file block" size, the last "file block" can be partially filled with file data.
Since "file blocks" and "cache blocks" are tightly coupled, I often speak of "blocks" where the distinction is obvious or unimportant.
However, there is one important difference. Usually the number of configured "cache blocks" does not match the number of "file blocks". This is especially true because the file can grow over time, while the number of "cache blocks" remains constant until it is explicitly reconfigured. The most important case is that the number of "cache blocks" is smaller than the number of "file blocks". In this case it can happen that all "cache blocks" are assigned to "file blocks" when another "file block" needs to be accessed. Then one of the cached "file blocks" needs to be "evicted" from the cache. For the selection, which "block" to "evict", a "least recently used" ("LRU") mechanism is used.
The mechanism is not pure "least recently used". There is a "hot" and a "warm" area. Blocks in the "hot" area are kept longer in the cache.
"File blocks" and "index blocks" do not necessarily match. They do not need to be of the same size nor do they need to be aligned. An "index block" can start in the middle of a "file block" and spread over two or more "file blocks". A "file block" can start in the middle of an "index block" and spread over two or more "index blocks". However, by default, "file blocks" have a size of the smallest possible "index block" (1K). Since "index blocks" are aligned to their smallest possible size (1K), in most cases "file blocks" are aligned to "index blocks" and don't spread over two or more of them. However, it is common for "index blocks" to be bigger than 1K. So a "file block" often contains a fragment of an "index block" only.
If "file blocks" are configured bigger than the smallest "index block" size, then it can happen that a "file block" contains data from more than one index. But it is from the same file and thus from the same table. So we could have data from different indexes of the same table in a "file block".
But the notion of "index blocks" is irrelevant for the key cache. It would work for record accesses of arbitrary lengths too. Whenever the key cache is asked for an arbitrary fragment of data, it loads the "file block(s)" which surround that fragment. To emphasize its generality, I will omit references to the "index file" and speak of a cached "file" only.
[edit] File Descriptor, Position, and File/Pos
A "file descriptor" is a non-negative integer value. It is a unique descriptor of an open file in a process. However, in theory, multiple file descriptors can reference the same file. The key cache has no mechanism to detect this uncommon situation. It relies completely on the caller that there is exactly one file descriptor for one cached file at any given point in time. If there would be two or more file descriptors for the same file, multiple copies of the same file block could exist in the cache. If one cache block is changed (dirty), the other block(s) wouldn't notice it and deliver old data to their caller. If two or more cache blocks of the same file block would be dirty, changes would be lost in the file because the last flushed dirty block would overwrite the formerly flushed dirty blocks in the file.
A certain "file block" within a certain file is identified by the "file descriptor" and the "position" of the block within this file. The "position" is the offset of the block from the file beginning in bytes. The identifying combination of "file descriptor" and "position" is often written as "file/pos".
[edit] Write Burst
When the dirty blocks of a file are "flushed" (written to the file), the changed blocks for this file are collected in an array, which is ordered in ascending "position". The blocks in the array are then written to the file in a row. I call this a "write burst". The array is limited in size. A big file with many changed blocks may require multiple "write bursts" when it is flushed.
By writing in sweeps (pseudo sequential writing of ascending block positions) we saw a 10x speedup when flushing all data in the cache.
[edit] Data Structures
The root structure for every key cache is 'KEY_CACHE'. The default key cache is always present. Its root structure is the static variable 'dflt_key_cache_var', referenced by the pointer 'dflt_key_cache'. A running MySQL server can have more key caches. Their root structures will be allocated when they are created.
For every file block that is in the cache, or is requested to go in the cache, there exists a 'hash_link' structure that holds the file descriptor and the position of the block in the file. It can be seen as a representative of the file block.
For every cache block there exists a 'block_link' structure and a block buffer. A cache block which is in use is assigned to a file block (hash_link). The hash_link points back to the block_link. During eviction of a file block there are two hash_links pointing at a block_link. The one that the block is still assigned to (the block_link points back to it) and the one that is requested to go in the cache.
The structures are split to allow an efficient reassignment of cache blocks to file blocks. Assume a well-used cache, which has lesser cache blocks than the sum of file blocks in all index files which it caches. All cache blocks are in use. When a file block is requested, which is not in the cache, the least recently used file block must be evicted. Imagine two threads trying to access the same block at (almost) the same time. Note that the threads are synchronised on the cache structures by a mutex. But they release the mutex when they need to wait for I/O or for something else.
T1: Wants to read file block (1,10)
T2: Wants to read file block (1,10)
T1: Gets a new hash_link for (1,10)
T1: Notices there is no free block, flushes another block (2,20)
(Assuming here that block 2,20 is dirty)
T2: Finds the hash_link for (1,10)
T2: Notices it is in flush, starts waiting
T1: Reassigns the cache block to file block (1,10)
T1: Reads data from file into the block buffer
T1: Signals T2
T2: Awakes with file block in cache
Now imagine the same with a common structure for file block and cache block:
T1: Wants to read file block (1,10)
T2: Wants to read file block (1,10)
T1: Does not find the block (1,10) in the cache
T1: Notices there is no free block, flushes another block (2,20)
(Assuming here that block 2,20 is dirty)
T2: Does not find the block (1,10) in the cache
T2: Notices there is no free block, flushes another block (3,30)
(Assuming here that block 3,30 is dirty)
T1: Still does not find the block (1,10) in the cache
T1: Reassigns the flushed cache block (2,20) to file block (1,10)
T1: Reads data from file into the block buffer
T2: Now finds the block (1,10) in the cache, but not read from file yet
T2: Releases flushed block (3,30) back to the LRU ring, starts waiting
T1: Signals T2
T2: Awakes with file block in cache
This way we would have needed to search for the block in the cache twice per thread and would do an unnecessary I/O (the flush of (3,30)). However this would only apply if the least recently blocks are dirty. If T1 finds the least recently used block clean, it would immediately reassign it and start reading. T2 would find the block and wait. A background thread could flush least recently blocks regularly to minimize the mentioned overhead. But that's all theory. We have the split between hash_link and block_link. This seems to be the most efficient solution for this problem.
The hash_links are organized in a hash structure (hence the name). Requests for file data are done by file/pos. By the hash the right hash_link can be found quickly. Free hash_links are in the free_hash_list chain.
The block_links are organized in a couple of structures. The most prominent one is the LRU ring. A block can also be in the free_block_list chain, in the changed_blocks hash, or in the file_blocks hash. The latter three are mutual exclusive. A block is either free or contains valid changed or unchanged data. The first two are also mutual exclusive. A free block cannot be in the LRU ring and vice versa.
The linkage of block_links is done as follows:
- The LRU ring uses (**prev_used, *next_used). - The free_block_list uses (*next_used). - The chains of the changed_blocks and file_blocks hashes use (**prev_changed, *next_changed).
Blocks that are in the LRU ring, are always also in either the changed_blocks hash or the file_blocks hash. The reverse is not true. Blocks are removed from the LRU ring when they are requested for use (read, write, flush) and linked back after use. But they stay in one of the hashes.
[edit] Block status
Blocks can have a combination of zero or more of the below bits:
- BLOCK_ERROR An error occurred when performing file I/O.
- BLOCK_READ The cache block contains a copy of the file block.
- BLOCK_IN_SWITCH The block is selected for eviction.
- BLOCK_REASSIGNED The block is selected for eviction or to be freed.
- BLOCK_IN_FLUSH The block is selected for flush.
- BLOCK_CHANGED The cache block contents is changed over the file
block contents. The file block contents is wrong
until the block is flushed.
- BLOCK_IN_USE The block is not unused (neither free, nor never
used). This is needed for debugging only.
- BLOCK_IN_EVICTION The block is selected for eviction. This happens
only if the LRU ring was empty when a new block
was requested. See the section "BLOCK_IN_EVICTION"
for further explanation.
- BLOCK_IN_FLUSHWRITE A flush operation is writing this block currently.
A writer must wait until done.
- BLOCK_FOR_UPDATE Buffer contents is being changed currently.
A flusher must wait until done.
[edit] Locking
All key cache locking is done with a single mutex per key cache: keycache->cache_lock. This mutex is locked almost all the time when executing code in mf_keycache.c. However it is released for I/O and some copy operations.
The cache_lock is also released when waiting for some event. Waiting and signalling is done via condition variables. In most cases the thread waits on its thread->suspend condition variable. Every thread has a my_thread_var structure, which contains this variable and a '*next' and '**prev' pointer. These pointers are used to insert the thread into a wait queue.
NOTE: Since there is only one pair of queue pointers per thread, a thread can be in one wait queue only. A thread is in a wait queue only when it is waiting.
Before starting to wait on its condition variable with pthread_cond_wait(), the thread enters itself to a specific wait queue with link_into_queue() (double linked with '*next' + '**prev') or wait_on_queue() (single linked with '*next').
Another thread, when releasing a resource, looks up the waiting thread in the related wait queue. It sends a signal with pthread_cond_signal() to the waiting thread.
NOTE: Depending on the particular wait situation, either the sending thread removes the waiting thread from the wait queue with unlink_from_queue() or release_whole_queue() respectively, or the waiting thread removes itself.
There is one exception from this locking scheme. Each block has a reference to a condition variable (condvar). It holds a reference to the thread->suspend condition variable, if that thread is waiting for the block. When that thread is signalled, the reference is cleared. This is similar to the above, but it clearly means that only one thread can wait for a particular block. There is no queue in this case. Strangely enough block->condvar is used for waiting for the assigned hash_link only. More precisely it is used to wait for all requests to be unregistered from the assigned hash_link.
[edit] Wait queues
When a thread needs to wait for another thread, it links its my_thread_var structure into a wait queue and waits on its 'suspend' condition variable. There are two types of wait queues:
- Double linked (**prev, *next) in a ring. This allows a thread to be unlinked from the middle of the ring. The operations are: - link_into_queue() - unlink_from_queue()
- Single linked (*next) in a NULL terminated chain. Only the thread at the chain head can be unlinked. The operations are: - wait_on_queue() - release_whole_queue() which unlinks all threads and signals each.
The other thread, when releasing the resource, looks up a waiting thread in the related wait queue. It sends a signal to the waiting thread and removes it from the wait queue with unlink_from_queue() or in release_whole_queue() respectively.
NOTE: Since there is only one pair of queue pointers per thread, a thread can be in one wait queue only. A thread is in a wait queue only when it is waiting.
The resize_queue contains threads that are waiting for resizing the key cache. Only one thread at a time can resize the cache. The thread that starts resizing is unlinked from the resize_queue. After flushing, if it needs to wait for other threads to complete their I/O, it is linked into the waiting_for_resize_cnt queue. The resize_queue serves two purposes:
1. Threads that want to do a resize wait there if in_resize is set.
This is not used in the server. The server refuses a second resize
request if one is already active. keycache->in_init is used for the
synchronization. See set_var.cc.
2. Threads that want to access blocks during resize wait here during
the re-initialization phase.
When the resize is done, all threads on the queue are signalled. Hypothetical resizers can compete for resizing, and read/write requests will restart to request blocks from the freshly resized cache. If the cache has been resized too small, it is disabled and 'can_be_used' is false. In this case read/write requests bypass the cache. Since they increment and decrement 'cnt_for_resize_op', the next resizer can wait on the queue 'waiting_for_resize_cnt' until all I/O finished.
The queue waiting_for_resize_cnt is used by a resizer to wait for pending I/O after the flush phase. No I/O should be active when memory for the cache structures is re-allocated and/or the key_cache_block_size is changed. I/O operations start with inc_counter_for_resize_op() and end with dec_counter_for_resize_op(). The last I/O, which decrements cnt_for_resize_op to zero, wakes the resizer. Synchronization on this queue is done with wait_on_queue()/release_whole_queue().
The queue waiting_for_hash_link is used in the very improbable case that a thread wants to access a file block which is not in the cache and all hash_links are in use already.
The queue waiting_for_block is used when a thread wants to access a file block which is not in the cache and the LRU ring is empty. This means that all blocks are currently in use for read, write, and/or flush.
Every block has two wait queues:
block->wqueue[COND_FOR_REQUESTED] is used to synchronize secondary requestors for a block. This happens when more than one thread wants to access the same file block before it has been read into the block buffer. The first thread that requests the file block acquires a cache block and starts reading the file block into it. Before it starts to read it releases the cache_lock. This gives other threads the chance to request the same file block. These need to wait on block->wqueue[COND_FOR_REQUESTED] until the read is complete.
block->wqueue[COND_FOR_REQUESTED] is also used by threads that flush cache blocks. It might happen that a flush is due while another thread is writing into the cache block. The flusher has to wait until this is complete.
block->wqueue[COND_FOR_SAVED] is used to synchronize threads that wait for a block to finish flushing, eviction, or becoming free.
[edit] Access Methods
There are five operations that request one or more blocks:
- key_cache_read()
- key_cache_insert()
- key_cache_write()
- flush_key_blocks()
- resize_key_cache()
When one of the first three requests a block, its hash_link is marked in use (hash_link->requests++). After they are done with the block, they release the hash_link and signal threads that may be waiting for it (remove_reader()). Waiting for hash_link->requests happens during eviction and freeing (wait_for_readers()).
hash_link->requests protects a block against reassignment and free. As long as a block points to a hash_link with a request, eviction and free wait for the request to go away before the final reassignment/free. hash_link->requests does not protect a block against flush. Neither against explicit flush, nor against implicit flush during eviction.
A similar thing is block->requests. It counts how many threads want to do something with a block. Whenever a thread wants to read, write, insert (load), flush, or free a block, it increments the count and takes the block off the LRU ring. This means that the block is implicitly protected against eviction.
After a thread is done with a block, it decrements block->requests. The thread that decrements it to zero puts the block back in the LRU ring (unless it was a 'free' operation).
block->requests protect a block against eviction. Even against being selected for eviction. block->requests does not protect a block against flush or free.
In summary the three above mentioned standard functions implicitly increment both hash_link->requests and block->requests. This protects the blocks against reuse for other purposes. Only flush is allowed in parallel. But this does not modify the block contents nor reassign it for different purposes.
However, the key cache does not prohibit two write operations to change the same block at the same time. As mentioned above, there is a single cache_lock (mutex). So the blocks cannot be modified at exactly the same time. But if a write operation changes a part of the cache block only, a read of the rest might be necessary. In this case the cache_lock is released and the other write could do its work.
So be warned that proper locking is required in the caller of the key cache operations.
The first three access functions allow to read/write arbitrary data size. This means that they could request multiple file blocks. This is done in a loop.
The functions take the cache_lock if the cache is initialized. They do this before the loop. They do not release the lock for every iteration. They hold the lock unless doing I/O or copying data from the block buffer.
After entering the lock, they check for a resize in progress. If the flush phase is over, they wait for completion of the resize (the re-initialization). After the flush phase the resizer waits for I/Os to complete. It would not be nice to start another I/O at this moment. Otherwise the resizer could easily starve.
Thereafter 'cnt_for_resize_op' is incremented so that another resizer can wait for this I/O. The count is decremented before the function returns. Incrementing/decrementing is not done in every loop iteration. But it is always done in the cache_lock. So the code after "no_key_cache:" looks somewhat complicated.
Since the count is incremented/decremented within the cache_lock and every I/O is counted, the resizer can wait for pending I/O before starting the re-initialization.
In key_cache_insert() and key_cache_write(), if page_st is PAGE_WAIT_TO_BE_READ then we must wait for the block in switch to be reassigned and read by another thread. Otherwise we 1.) might not know the correct hash_link to unregister our request from, and 2.) might write new data into the wrong block, or 3.) might write into a block that is read from file afterwards.
Proposal:
keycache->can_be_used is not changed during resize. It is set when the cache is [re-]initialized with proper parameters. It is cleared when the cache is [re-]initialized too small and thus disabled. The semantics of keycache->can_be_used is the same as (disk_blocks > 0). I propose to get rid of it.
Proposal:
Also there is keycache->disk_blocks and keycache->blocks. Both 'int'. I propose to get rid of one of them.
Proposal:
Change 'wrmode' parameter of find_key_block() to 'mode'. Have modes MODE_READ, MODE_WRITE, and MODE_INSERT. For the latter return NULL if the block is already in the cache or another thread is reading it already. Unregister the hash_link request in this case.
Proposal:
In cases where PAGE_WAIT_TO_BE_READ would be returned, wait in find_key_block() until the block is read. The respective branch in read_block() as well as all respective handling in key_cache_read(), key_cache_write(), and key_cache_insert() can be dropped.
[edit] Resizing
Resizing consists of two major phases:
- The flush phase: Flush all dirty cache blocks and free all blocks.
- The re-initialization phase: re-allocate memory structures and set new
parameters, most notable key_cache_block_size.
The basic operation is so that
- the cache_lock (mutex) is taken,
- the cache is marked in resize and in flush for resize,
- all dirty (changed) blocks are flushed,
- all blocks are freed,
- the cache is unmarked flush for resize,
- the resizer waits for I/O that bypassed the cash in the flush phase,
- all cache structures (except the root KEY_CACHE and the mutex) are freed,
- the cache structures are reallocated with the new size,
- the cache structures are initialized so that it is usable again,
- new parameters are set,
- all threads waiting for end of resize are awaken, and
- the cache_lock is released.
Resizing the key cache is implemented in an online fashion. The normal MyISAM operation is not stopped.
Here follow a couple of problems and their solutions:
[edit] Resizing 1: no new writes, no new reads
During resize we do not allow to let new blocks in the cache.
The old strategy (before 5.1.19) was to disallow new changed blocks in the cache while allowing new clean blocks. This allowed the following to happen:
On a write request we tried to find the block in the cache. If it was not there, we immediately went to write directly. If it was present and clean, we freed the block and wrote directly. If it was present and dirty, we freed it and wrote directly. Note that this worked only with MyISAM typical alignment and key block size less or equal to the smallest MyISAM index block.
When going to write directly, we release the cache_lock. This allows another thread to read the same block. It will not find it in the cache, but it may find a free block or evict a clean block. In both cases it can start reading the block from file without waiting for anything first. If the second thread is "in luck", it reads the block (with old contents) from file before the first thread even started to write.
When the first thread completes its write, we have a new block in file and an old block in cache. And since the the read request was done by MyISAM, it worked with an old block.
So we do not allow to let new clean block in the cache during resize either.
Alternatively it would be possible to allow new dirty block to go in during resize too. But this could make resizing an infinite process.
Note that no two threads can access the same index block at the same time because they have to have a "key_root_lock" on the index. A new read of an index block should only be possible after the other thread has all writes completed and unlocked the "key_root_lock". However, if key_cache_block_size is big enough to contain data from two or more indexes, it can fail badly.
It is implemented so: During resize
- read requests for a non-cached block read from file directly,
- write requests for a non-cached block write to file directly,
- read requests for a dirty block use the block,
- write requests for a dirty block use the block,
- read requests for a clean block use the block,
- write requests for a clean block free the block and write to file directly.
[edit] Resizing 2: flush_key_blocks_int() and key_cache_write()
If a key_cache_write() modifies the buffer contents while it is being written (flushed) to file, then the write finishes first, because it did not release the cache_lock. (Even if it did, it set BLOCK_CHANGED before modifying the buffer). When the flusher acquires the cache_lock after the write, it clears BLOCK_CHANGED. This allows the block to be evicted or freed without another flush. The change of the last writer is lost. In the worst case parts of the changes are included, leaving an inconsistent block behind.
The above could even happen in normal cache operation. Even with 1K cache block size. The normal flush after a statement takes place after the thr_lock has been released. Another thread can modify the cache blocks while the flush is going on.
Different situation, same cause: Assume a cache with all blocks clean. A writer evicts a block. It wants to overwrite a part of the block only. So it has to read the new block first. It releases the cache_lock for reading. A resizer sets resize_in_flush. It does not find dirty blocks. It starts to free all blocks. When it comes to the one the writer is reading, it has to wait until the writer releases the block. The writer awakes from reading, gets cache_lock, modifies the block, and wakes the waiting resizer. The resizer awakes inside free_block(). It frees the dirty block.
A pair of block status flags fixes the above problems: BLOCK_IN_FLUSHWRITE, BLOCK_FOR_UPDATE. The former is set by a flusher right before it releases the cache_lock for writing the block. The latter is set by a writer before it releases the cache_lock. Both check for the other flag and wait on block->wqueue[COND_FOR_REQUESTED] (flusher) or block->wqueue[COND_FOR_SAVED] (writer).
BLOCK_IN_FLUSHWRITE: This is set by the flusher right before the cache_lock is released for writing. It is cleared right after writing. key_cache_write() can proceed if BLOCK_IN_FLUSH is set but not if BLOCK_IN_FLUSHWRITE is set. If BLOCK_IN_FLUSHWRITE is set, it waits on block->wqueue[COND_FOR_SAVED].
BLOCK_FOR_UPDATE: This is set in key_cache_write() as soon as possible before the cache_lock is released for waiting or copying the buffer contents. It is cleared after the buffer is finally changed. It is used by a flusher to skip the block in a flush burst. flush_key_blocks_int() repeats it search for changed blocks and will finally find the skipped block again. To avoid an infinite loop, it remembers one of the blocks marked BLOCK_FOR_UPDATE and waits on block->wqueue[COND_FOR_REQUESTED] for its modification to complete. When signaling COND_FOR_REQUESTED, key_cache_write() does not wake threads waiting for the block to be flushed. BLOCK_FOR_UPDATE is also used to prevent free_block() to be called. The block requires a(nother) flush before being freed.
[edit] Resizing 3: BLOCK_IN_FLUSH
When flushing a file, one block is selected from the changed_blocks hash and then all blocks that belonged to this file are selected from the changed_blocks hash. These blocks are marked for flush and put in an array for a write burst. The array is sorted by increasing write position. Every block is written. Before every write the cache_lock is released and taken back after the write. Only during these write periods parallel cache operations are possible. After taking back the mutex, all threads waiting for the block are signalled.
Before selecting each block it must be checked if the block had already been marked for flush before. At the end of most writing SQL statements, during unlock of a MyISAM table, or during the final close of the table, the index blocks for this table are flushed. This flush can happen in parallel with the big flush from the cache resizing. So it is possible that two threads flush the same file.
[edit] Resizing 4: BLOCK_IN_EVICTION (Eviction with empty LRU ring)
When a block is marked BLOCK_IN_FLUSH by a flushing thread, a request is registered against the block. This unlinks it from the LRU ring so that it cannot be selected for eviction. If it had been selected for eviction before, it had been marked BLOCK_IN_SWITCH and thus would not be selected for flush.
However there was one exception. If the LRU ring was empty and a block was requested for eviction, then the next block that is put on the LRU ring is immediately attached to the requesting hash_link and the requesting thread is signalled. From this moment until the requesting thread awakes, a flusher could select the block for flush. The block was not marked in any way during this interval. Though it is not on the LRU ring, it could still be selected for flush. If it is unchanged, it could even immediately be freed by the flusher. When the evicter awakes, it could find a block in flush, a free block, or even a block which has been reused for something else after being free.
NOTE: This horrible scenario can only happen if the LRU ring got empty. In normal operation this happens only if there are less cache blocks than threads. Every thread works on one block at a time only. This applies to key_cache_read(), key_cache_insert() and key_cache_write().
An exception is flush_key_blocks(). During flush a thread can take a lot of blocks from the LRU ring. The same goes with resize_key_cache().
In other words:
When a block needs to be evicted, the least recently used block is taken from the LRU ring. If the LRU ring is empty, the evicting thread needs to wait for a block to be put on the LRU ring. (BTW. Even free_block() puts the block on the LRU ring and takes it back again without intermediate release of the cache_lock.)
A block is put in the LRU ring by link_block(). Before it links the block in the LRU ring, it checks if threads are waiting for a block. If so, it does not link the block in the LRU ring, but assigns it to the hash_link of a waiting thread instead. Then it registers a request on the block for every thread waiting on the same hash_link and wakes them all.
In this case nothing was set in block_link. The most obvious change to do would be to set the flag BLOCK_IN_SWITCH. But this cannot be done in link_block(). Multiple threads may be waiting for the block. Only the first one must execute the eviction. All others have to wait. The logic is so that the first thread sets BLOCK_IN_SWITCH and the others wait when they see it.
The outcome was that nothing in the block indicated that it was selected for eviction until the first of the requesting threads set BLOCK_IN_SWITCH.
When the thread that called link_block() releases the cache_lock, another thread continues to run. This could be another thread but one of those waiting for the selected block. This thread could want to do all sort of things with the selected block. For example free it.
free_block() could not detect that the block has been selected for eviction. In theory it could detect the additional request(s) on the block (one request must be registered on the block anyway before it can be freed). But this was not done in the original code.
This problem is fixed with a new block->status flag: BLOCK_IN_EVICTION. It is set by link_block() when the block is not linked in the LRU ring, but assigned to a requesting hash_link. This is checked it in free_block() after unreg_request(). If set, it stops modifying the block in any way.
The flag also to prevents calling free_block() for a block that has been selected for eviction already.
Proposal:
We could save a lot of hassles in the code, like the special handling in link-block() (which does not link the block in the LRU ring in this case), and the uncommon use of thread->opt_info, if the cache behaves much more stupid in the extreme case of empty free list, all blocks in use, _and_ empty LRU ring.
If there is no block available, the thread needs to wait on waiting_for_block. This is no change yet.
But instead of special handling in link_block() and free_block() I would just put the block in the LRU ring or in the free list respectively and wake all threads from waiting_for_block.
The threads that awake restart their search for a block from the beginning (get hash_link, ...). The first thread that awakes finds the block and starts to use it. Later awaking threads do either fail and wait again, or they are in luck if they wanted to access the same block as the other thread, or there is yet another block available meanwhile.
Admittedly this could produce more CPU load in a situation where the system is short of cache blocks anyway. But OTOH this is a very seldom situation. And it requires a far too small configured cache.
[edit] Resizing 5: flush_key_blocks_int()
Changed blocks are always in the changed_block hash. There is one exception. Dirty blocks in eviction, which need to be flushed, are temporarily put into a "first_in_switch" chain within flush_key_blocks_int(). But since flush_key_blocks_int() waits until their eviction is done (which makes them clean blocks), this is irrelevant here.
Unfortunately there is another chance for failure. A second flusher could move the block from that "first_in_switch" chain to its own "first_in_switch" chain. The first flusher could then believe all changed blocks of "its" file are clean now and finish prematurely.
Proposal:
I strongly recommend to get rid of the temporary "first_in_switch" chain. Using BLOCK_IN_FLUSH also during eviction makes this possible.
Flushing of the same file by two threads (the table "owner" and the resizer) can also lead to a situation where one has to wait for the block and the other one frees it after use. The fix for this is to check if the block still references a hash_link.
While freeing clean blocks, the file_blocks chain is traversed. The 'next' block could move from its place in the chain while free_block() has to wait for readers. It could even be marked for to be freed (BLOCK_REASSIGNED) by another thread. To address this are two nested loops. The outer loop runs until no more blocks to be freed are found. The inner loop traverses the file_blocks chain. Whenever a block needs to be freed, the state of the 'next' block is saved and checked after free_block(). If it changed, the loop is restarted because the 'next' block may no longer be at the old position in the current chain. Also we need to skip all blocks that are marked for eviction or for to be freed.
It could even happen that a write started before the start of the flush. The write could still be in progress on a clean block. We have to skip this too. Whenever we found something to skip, we need to restart the flush from the beginning (checking for dirty blocks). These writes could move a block from the file_blocks hash to the changed_blocks hash.
[edit] LOAD INDEX INTO CACHE
Caveats of the old implementation:
- All indexes must have the same block size.
- All blocks of the table are flushed with FLUSH_RELEASE first.
- IGNORE LEAVES won't work well with big cache blocks (see below).
- If index file is bigger than cache, it could evict self-loaded blocks.
Advantage of the old implementation:
- Bigger read size.
mysql_preload_keys() needs to take a TL_READ_NO_INSERT lock. Otherwise concurrent inserts could modify index blocks in the cache, but mi_preload() reads directly from file, bypassing the cache. In theory it should not overwrite cache blocks that have valid data already. But the dirty blocks could be evicted (with flush) after mi_preload() read the data and before key_cache_insert() tries to load that block.
Since mi_preload() does not take a "key_root_lock", it could become a secondary requestor of a block that key_cache_write() is writing. Hence key_cache_write() must also wake readers. This is required when big key cache blocks contain data from two or more indexes.
key_cache_insert() fills blocks with PAGE_TO_BE_READ only. Pages with PAGE_WAIT_TO_BE_READ will be read by another thread anyway. Pages with PAGE_READ are in the cache already.
Index data start after the index file header. This is a multiple of the maximum index block size of this table. The data is loaded in chunks of 32K by default. If key_cache_block_size > maximum index block size of this table, for every chunk the first and last of the affected cache blocks is filled partially only. Every succeeding chunk finds its first block in the cache already (partially filled from the previous chunk, status PAGE_READ), and would ignore the block (see the last paragraph).
Because we allow reads while loading, it could happen that a reader finds one of those partially filled blocks. If it wants data from this block behind what has been loaded so far, it would report an error.
To get around this, a cache block must contain as much data as is available for it in the file. This means that every block except the last one of the file must be completely filled.
Consequently, key_cache_insert() needs to read the block itself, if the caller supplies less data than the block size.
This clearly slows down the loading. It happens very often in normal operation with a big key_cache_block_size.
There is no way around this for LOAD INDEX ... IGNORE LEAVES.
Note that IGNORE LEAVES may not work as expected anyway if cache blocks are bigger than the smallest index blocks. Depending on the difference in size we may not save much. The key cache cannot read partial cache blocks. If key_cache_insert() is given a part of a cache block only, it will read the whole block itself. To avoid double reading of the index file, we always must supply a full cache block to key_cache_insert(). But to check if we have branch index blocks, we must work on index blocks. The required algorithm is more complex than the one currently existing in mi_preload().
Proposal:
Without IGNORE LEAVES we could do a first chunk that fills the rest of the first file block only. After that the read is done aligned with the cache blocks.
Proposal:
MyISAM hands over to key cache a list of (position, length) for the blocks to load. Key cache computes maximum chunks to read while excluding file blocks already in cache. Key cache stops when all blocks are used by this file, or, in other words, when as many blocks for this file have been requested as there are cache blocks. So it would not evict pages loaded by itself.
[edit] Ideas and Open Problems
[edit] BLOCK_ERROR
Eviction of a changed block: If my_pwrite() fails, block is marked BLOCK_ERROR, but reassigned anyway. This makes the evicting thread mark 'its' table as crashed. This may be the wrong table. The block contents is not changed, but the block is assigned to the wrong table.
If this block is found by another thread (who did not notice the crash of the table yet) it finds it correctly assigned to the hash_link but not marked BLOCK_READ. It concludes that it must be a block in switch and behave as a secondary reader (PAGE_WAIT_TO_BE_READ). Fortunately BLOCK_ERROR is still set. So it wouldn't wait, but report an error.
I propose to 1. abstain from reassigning the block in error, 2. leave it in the changed_blocks hash so it is due for another attempt to flush, 3. do not put it back to the LRU ring so it cannot be evicted again, 4. set BLOCK_ERROR, 5. clear BLOCK_IN_SWITCH and let the evicter select another block.
When a thread tries to access that block (read, write, insert, flush), the error would be reported back to the caller and the right table would be marked crashed. At flush the block would be tried to write again. If this fails again, it would be freed (FLUSH_RELEASE) or left untouched (FLUSH_KEEP).
If the error flush happens in a resize operation, the cache should stay in the 'resize_in_flush' state until all erroneous blocks are flushed by the right thread(s). Direct I/O is allowed in this state.
A potential problem of my proposal could raise if all blocks get marked BLOCK_ERROR. All attempts to evict a block would stall. A detection of this situation could be implemented. Any attempt to evict a block would then be redirected to bypass the cache.
To provoke these errors, error injection is required.
[edit] global_blocks_changed
Used for SHOW STATUS like 'Key_blocks_not_flushed'. It is incremented/decremented together with blocks_changed. So it shows the current number of dirty blocks.
But global_blocks_changed is cleared in reset_key_cache_counters(). So it can overflow on the next decrement.
And global_blocks_changed is not cleared in end_key_cache(). This should be safe if it was counted correctly and not reset by reset_key_cache_counters(). Otherwise a wrong count could survive resize_key_cache() if a too small cache size has been requested.
This sounds hypothetical, but I saw:
| Key_blocks_not_flushed | 103067 | | Key_blocks_unused | 0 | | Key_blocks_used | 29 |
in several test runs. This should not normally be possible.
I found a source of incorrect counting, introduced by myself. Now SHOW STATUS looks reasonable. But my question is still valid: What do we want to show with SHOW STATUS like 'Key_blocks_not_flushed'? Is it correct to handle it differently from 'blocks_changed'?
I propose to get rid of 'global_blocks_changed', use 'blocks_changed' instead and not to clear this count in reset_key_cache_counters(). Or, to get less changes elsewhere in the server, get rid of 'blocks_changed' and count 'global_blocks_changed' where 'blocks_changed' is counted now.
[edit] reg_requests()
I propose to change the name to reg_block_request() to avoid confusion with hash_link requests. Use singular as it is always called for one request only. Also it would be nice to have hash_link->h_requests and block->b_requests for easier searching of one of these variables.
[edit] remove_reader()
This function takes a 'block' as argument, though its purpose is to decrement a request from a hash_link and to wake threads waiting on the hash_link. A 'block' is needed because the condition to signal is block->condvar. This means that the function can be used only after a block is assigned to the hash_link. The mere fact that the hash_link points to a block is insufficient because it could be a block in eviction for this hash_link. In this case block->condvar cannot be used because it belongs to another hash_link. One would signal the wrong thread. Moreover block->condvar is only used for the purpose to wait for the referenced hash_link to become "unrequested".
The function would be more flexible if it takes a 'hash_link' and the condvar is located in hash_link. This would come handy in key_cache_insert() when it detects that a block for the file/pos is already in eviction. It could just unregister its request an proceed. Now it needs to wait until the block is assigned before it can unregister.
For a "casual reader" (or a "beginner") it might even be easier to understand what remove_reader() and wait_for_readers() really do.
Additionally it would be more inline with other wait/signal events if we would use a wait queue instead of a condvar. Though there should never be more than one thread waiting for a hash_link to become "unrequested". The overhead would be small, but another anomaly would vanish thus saving a casual reader some thinking.
[edit] Background key cache flusher
Every N seconds ensure than the M least recently blocks are "clean".
[edit] Multiple resize flushers
Since index files could be on different disks we may want to allow for multiple threads doing the flush.
[edit] Maximum cache size
Enlarge some variables to allow key_buffer_size > 4G.
[edit] Block hashes
The changed_blocks and file_blocks hashes are indexed by file descriptor. Each of their chains can contain blocks for several files. It might be more efficient to have file_heads in these chains and blocks for one file only chained below the file_heads. This would make flushing faster. Flushing is done per file and so could use up a whole chain without caring for skipping other files blocks.
file_blocks[i] -> file_head(17) -> file_head(21)
| |
v v
block_link
|
v
block_link
[edit] hash_links hash
Similar could be done for hash_links.
hash_root[i] -> hash_link_file(17) -> hash_link_file(21)
| |
v v
hash_link_pos(0)
|
v
hash_link_pos(1024)
However one can get the same effect by just making key_cache->file_blocks[] as big as max number of file descriptors.
But even if we make the hash as big as file descriptors are available, we can have quite long chains of hash_links for each file. Beginning with a certain size of the key cache it might become better to have a more complex access structure. Either another hash per file, or an ordered chain with jump pointers. We should profile get_hash_link() to see if we have a problem there.
[edit] my_thread_var
The key cache uses the condition variable of the my_thread_var structure for waiting. This could also be used elsewhere in the MySQL server. We should check that stray signals cannot confuse the key cache code at any place (that is, check if wait loops are always used correctly). And there should be a server-wide rule how to use the 'next' and 'prev' pointers of my_thread_var. The key cache code cannot protect against abuse of these pointers. It may even be good to make the wait_for_event()/signal_event() function pair global mysys functions to be used for all waiting/signalling that is based on my_thread_var.