Category: MySQLDevelopment

MySQL Internals Algorithms

← Back to MySQL Internals overview page

Contents

[edit] Important Algorithms and Structures

MySQL uses many different algorithms and structures. This chapter tries to describe some of them.

[edit] The Item Class

To us, the word Item means more than just thingamabob; it is a technical term with a precise definition in the context of our source code. Item is a class. Each instance of the Item class has:

All of the following SQL thingamabobs are modeled in the Item class:

In the function category we include operators such as + and ||, because operators are merely functions that return values. We also include operators such as = and LIKE, which are operators that return boolean values. Consider the following statement:

SELECT UPPER(column1) FROM t WHERE column2 = @x;

For this statement, MySQL will need to store a list of items for the select list ('column1' column reference and UPPER function), and a list of items for the WHERE clause ('column2' column reference and '@x' variable and '=' operator).

Terminology: an Item instance in a MySQL program roughly corresponds to a "site", which according to the standard_SQL definition is "a place that holds an instance of a value of a specified data type", Another word that you'll see often in MySQL code is "field", which means column reference, and the Item_field subclass is generally for column values that occur for the intersection of a row and column in a table.

MySQL's Item class is defined in .../sql/item.h, and its subclasses are defined in .../sql/item*.h (that is, in item.h, item_cmpfunc.h, item_func.h, item_geofunc.h, item_row.h, item_strfunc.h, item_subselect.h, item_sum.h, item_timefunc.h). Page-width limitations prevent us from displaying the whole tree, but these are the main Item subclasses, and the subclasses of the subclasses:

Item_ident (Item_field, Item_ref)
Item_null
Item_num (Item_int, Item_real)
Item_param
Item_string (Item_static_string_func, Item_datetime, Item_empty_string)
Item_hex_string (Item_bin_string)
Item_result_field (all "item_func.h" "item_subselect.h" "item_sub.h" classes)
Item_copy_string
Item_cache (Item_cache_int, Item_cache_real, Item_cache_str, Item_cache_row)
Item_type_holder
Item_row

There's no formal classification of subclasses, but the main distinctions are by use (field, parameter, function) and by data type (num, string).

So, how does MySQL use items? You'll find that nearly every .cc program in the /sql directory makes some use of the Item class and its subclasses, so this list of programs is only partial and very general:

sql_parse.cc:     Makes new items in add_field_to_list()
item_sum.cc:      Uses item_func subclasses for COUNT, AVG, SUM
item_buff.cc:     Where buffers for item values can be stored
item_cmpfunc.cc:  Comparison functions with item_func subclasses
item_create.cc    For creating items that the lex might use
item_subselect.cc Subqueries are another type of function
mysqld.cc:        When main() ends, it uses clean_up() for items
opt_range.cc:     Uses field, compare-condition, and value subclasses
procedure.cc:     Notice Procedure * has a pointer to an item list
protocol.cc:      Uses send_fields() to pass item values back to users
sys_var.cc:       System variables have Item associations too
sql_base.cc:      Thread-specific Item searchers like find_field_in_table()
sql_class.cc:     Look at cleanup_after_query()
sql_delete.cc     This (like sql_insert.cc etc.) has field references
sql_error.cc      Has one of many examples of SHOW's use of items
sql_lex.cc:       Notice "add...to_list" functions
sql_select.cc     The largest program that uses items, apparently
udf_example.cc    The comments in this program are extensive

Whenever there's a need for an SQL operation that assigns, compares, aggregates, accepts, sends, or validates a site, you'll find a MySQL use of Item and its subclasses.

[edit] How MySQL Does Sorting (filesort)

MySQL has two filesort algorithms for sorting and retrieving results. The original method uses only the ORDER BY columns. The modified method uses not just the ORDER BY columns, but all the columns used in the query.

The optimizer selects which filesort algorithm to use. Prior to MySQL 4.1, it uses the original algorithm. As of MySQL 4.1, it normally uses the modified algorithm except when BLOB or TEXT columns are involved, in which case it uses the original algorithm.

The original filesort algorithm works as follows:

  1. Read all rows according to key or by table scanning. Rows that do not match the WHERE clause are skipped.
  2. For each row, store a pair of values in a buffer (the sort key and the row pointer). The size of the buffer is the value of the sort_buffer_size system variable.
  3. When the buffer gets full, run a qsort (quicksort) on it and store the result in a temporary file. Save a pointer to the sorted block. (If all pairs fit into the sort buffer, no temporary file is created.)
  4. Repeat the preceding steps until all rows have been read.
  5. Do a multi-merge of up to MERGEBUFF (7) regions to one block in another temporary file. Repeat until all blocks from the first file are in the second file.
  6. Repeat the following until there are fewer than MERGEBUFF2 (15) blocks left.
  7. On the last multi-merge, only the pointer to the row (the last part of the sort key) is written to a result file.
  8. Read the rows in sorted order by using the row pointers in the result file. To optimize this, we read in a big block of row pointers, sort them, and use them to read the rows in sorted order into a row buffer. The size of the buffer is the value of the read_rnd_buffer_size system variable. The code for this step is in the sql/records.cc source file.

One problem with this approach is that it reads rows twice: One time when evaluating the WHERE clause, and again after sorting the pair values. And even if the rows were accessed successively the first time (for example, if a table scan is done), the second time they are accessed randomly. (The sort keys are ordered, but the row positions are not.)

The modified filesort algorithm incorporates an optimization such that it records not only the sort key value and row position, but also the columns required for the query. This avoids reading the rows twice. The modified filesort algorithm works like this:

  1. Read the rows that match the WHERE clause.
  2. For each row, record a tuple of values consisting of the sort key value and row position, and also the columns required for the query.
  3. Sort the tuples by sort key value
  4. Retrieve the rows in sorted order, but read the required columns directly from the sorted tuples rather than by accessing the table a second time.

Using the modified filesort algorithm, the tuples are longer than the pairs used in the original method, and fewer of them fit in the sort buffer (the size of which is given by sort_buffer_size). As a result, it is possible for the extra I/O to make the modified approach slower, not faster. To avoid a slowdown, the optimization is used only if the total size of the extra columns in the sort tuple does not exceed the value of the max_length_for_sort_data system variable. (A symptom of setting the value of this variable too high is that you should see high disk activity and low CPU activity.)

[edit] Bulk Insert

The logic behind bulk insert optimization is simple.

Instead of writing each key value to B-tree (that is, to the key cache, although the bulk insert code doesn't know about the key cache), we store keys in a balanced binary (red-black) tree, in memory. When this tree reaches its memory limit, we write all keys to disk (to key cache, that is). But since the key stream coming from the binary tree is already sorted, inserting goes much faster, all the necessary pages are already in cache, disk access is minimized, and so forth.

[edit] How MySQL Does Caching

MySQL has the following caches. (Note that the some of the filenames contain an incorrect spelling of the word cache.)

A shared cache for all B-tree index blocks in the different NISAM files. Uses hashing and reverse linked lists for quick caching of the most recently used blocks and quick flushing of changed entries for a specific table. (mysys/mf_keycash.c)

This is used for quick scanning of all records in a table. (mysys/mf_iocash.c and isam/_cash.c)

This holds the most recently used tables. (sql/sql_base.cc)

For quick lookup (with reverse name resolving). This is a must when you have a slow DNS. (sql/hostname.cc)

To allow quick change between databases, the last used privileges are cached for each user/database combination. (sql/sql_acl.cc)

Many uses of GROUP BY or DISTINCT cache all found rows in a HEAP table. (This is a very quick in-memory table with hash index.)

For every full join in a SELECT statement the rows found are cached in a join cache. (A full join here means there were no keys that could be used to find rows for the next table in the list.) In the worst case, one SELECT query can use many join caches.

[edit] How MySQL Uses the Join Buffer Cache

Basic information about the join buffer cache:

Assume you have the following join:

Table name      Type
t1              range
t2              ref
t3              ALL

The join is then done as follows:

- While rows in t1 matching range
 - Read through all rows in t2 according to reference key
  - Store used fields from t1, t2 in cache
  - If cache is full
    - Read through all rows in t3
      - Compare t3 row against all t1, t2 combinations in cache
        - If row satisfies join condition, send it to client
    - Empty cache

- Read through all rows in t3
 - Compare t3 row against all stored t1, t2 combinations in cache
   - If row satisfies join condition, send it to client

The preceding description means that the number of times table t3 is scanned is determined as follows:

S = size-of-stored-row(t1,t2)
C = accepted-row-combinations(t1,t2)
scans = (S * C)/join_buffer_size + 1

Some conclusions:


A slightly more user-friendly explanation of join buffering is available here: http://s.petrunia.net/blog/?p=18

[edit] How MySQL Handles FLUSH TABLES

[edit] Full-text Search

MySQL uses Ranking with Vector Spaces for ordinary full-text queries.

Rank, also known as relevance rank, also known as relevance measure, is a number that tells us how good a match is.

Vector Space, which MySQL sometimes calls "natural language", is a well-known system based on a metaphor of lines that stretch in different dimensions (one dimension per term) for varying distances (one distance unit per occurrence of term). The value of thinking of it this way is: once you realize that term occurrences are lines in a multi-dimensional space, you can apply basic trigonometry to calculate "distances", and those distances are equatable with similarity measurements. A comprehensible discussion of vector space technology is here: http://www.miislita.com/term-vector/term-vector-1.html. And a text which partly inspired our original developer is here: ftp://ftp.cs.cornell.edu/pub/smart/smart.11.0.tar.Z ("SMART").

But let's try to describe the classic formula:

w = tf * idf

This means "weight equals term frequency times inverse of document frequency", or "increase weight for number of times term appears in one document, decrease weight for number of documents the term appears in". (For historical reasons we're using the word "weight" instead of "distance", and we're using the information-retrieval word "document" throughout; when you see it, think of "the indexed part of the row".)

For example: if "rain" appears three times in row #5, weight goes up; but if "rain" also appears in 1000 other documents, weight goes down.

MySQL uses a variant of the classic formula, and adds on some calculations for "the normalization factor". In the end, MySQL's formula looks something like:

w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)

Where:

dtf     is the number of times the term appears in the document
sumdtf  is the sum of (log(dtf)+1)'s for all terms in the same document
U       is the number of Unique terms in the document
N       is the total number of documents
nf      is the number of documents that contain the term

The formula has three parts: base part, normalization factor, global multiplier.

The base part is the left of the formula, "(log(dtf)+1)/sumdtf".

The normalization factor is the middle part of the formula. The idea of normalization is: if a document is shorter than average length then weight goes up, if it's average length then weight stays the same, if it's longer than average length then weight goes down. We're using a pivoted unique normalization factor. For the theory and justification, see the paper "Pivoted Document Length Normalization" by Amit Singhal and Chris Buckley and Mandar Mitra ACM SIGIR'96, 21-29, 1996: http://ir.iit.edu/~dagr/cs529/files/handouts/singhal96pivoted.pdf. The word "unique" here means that our measure of document length is based on the unique terms in the document. We chose 0.0115 as the pivot value, it's PIVOT_VAL in the MySQL source code header file myisam/ftdefs.h.

If we multiply the base part times the normalization factor, we have the term weight. The term weight is what MySQL stores in the index.

The global multiplier is the final part of the formula. In the classic Vector Space formula, the final part would be the inverse document frequency, or simply

log(N/nf)

We have replaced it with

log((N-nf)/nf)

This variant is more often used in "probabilistic" formulas. Such formulas try to make a better guess of the probability that a term will be relevant. To go back to the old system, look in myisam/ftdefs.h for "#define GWS_IN_USE GWS_PROB" (i.e. global weights by probability) and change it to "#define GWS_IN_USE GWS_IDF" (i.e. global weights by inverse document frequency).

Then, when retrieving, the rank is the product of the weight and the frequency of the word in the query:

R = w * qf;

Where:

w      is the weight (as always)
qf     is the number of times the term appears in the query

In vector-space speak, the similarity is the product of the vectors.

And R is the floating-point number that you see if you say: SELECT MATCH(...) AGAINST (...) FROM t.

To sum it up, w, which stands for weight, goes up if the term occurs more often in a row, goes down if the term occurs in many rows, goes up / down depending whether the number of unique words in a row is fewer / more than average. Then R, which stands for either Rank or Relevance, is w times the frequency of the term in the AGAINST expression.

The Simplest Possible Example

First, make a fulltext index. Follow the instructions in the "MySQL Full-Text Functions" section of the MySQL Reference Manual. Succinctly, the statements are:

CREATE TABLE articles (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       title VARCHAR(200),
       body TEXT,
       FULLTEXT (title,body)
     );
INSERT INTO articles (title,body) VALUES
     ('MySQL Tutorial','DBMS stands for DataBase ...'),
     ('How To Use MySQL Well','After you went through a ...'),
     ('Optimizing MySQL','In this tutorial we will show ...'),
     ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
     ('MySQL vs. YourSQL','In the following database comparison ...'),
     ('MySQL Security','When configured properly, MySQL ...');

Now, let's look at the index.

There's a utility for looking at the fulltext index keys and their weights. The source code is myisam/myisam_ftdump.c, and the executable comes with the binary distribution. So, if exedir is where the executable is, and datadir is the directory name that you get with "SHOW VARIABLES LIKE 'datadir%'", and dbname is the name of the database that contains the articles table, then this works:

>/exedir/myisam_ftdump /datadir/dbname/articles 1 -d
       b8            0.9456265 1001
       f8            0.9560229 comparison
      140            0.8148246 configured
        0            0.9456265 database
       f8            0.9560229 database
        0            0.9456265 dbms
        0            0.9456265 mysql
       38            0.9886308 mysql
       78            0.9560229 mysql
       b8            0.9456265 mysql
       f8            0.9560229 mysql
      140            1.3796179 mysql
       b8            0.9456265 mysqld
       78            0.9560229 optimizing
      140            0.8148246 properly
       b8            0.9456265 root
      140            0.8148246 security
       78            0.9560229 show
        0            0.9456265 stands
       b8            0.9456265 tricks
        0            0.9456265 tutorial
       78            0.9560229 tutorial
       f8            0.9560229 yoursql

Let's see how one of these numbers relates to the formula.

The term 'tutorial' appears in document 0. The full document is "MySQL Tutorial / DBMS stands for DataBase ...". The word "tutorial" appears once in the document, so dtf = The word "for" is a stopword, so there are only 5 unique terms in the document ("mysql", "tutorial", "dbms", "stands", "database"), so U = Each of these terms appears once in the document, so sumdtf is the sum of log(1)+1, five times. So, taking the first two parts of the formula (the term weight), we have:

(log(dtf)+1)/sumdtf * U/(1+0.0115*U)

which is

(log(1)+1)/( (log(1)+1)*5) * 5/(1+0.0115*5)

which is

0.9456265

which is what myisam_ftdump says. So the term weight looks good.

Now, what about the global multiplier? Well, myisam_ftdump could calculate it, but you'll see it with the mysql client. The total number of rows in the articles table is 6, so N = And "tutorial" occurs in two rows, in row 0 and in row 78, so nf = So, taking the final (global multiplier) part of the formula, we have:

log((N-nf)/nf)

which is

log((6-2)/2)

which is

0.6931472

So what would we get for row 0 with a search for 'tutorial'? Well, first we want w, so: Multiply the term weight of tutorial (which is 0.9456265) times the global multiplier (which is 0.6931472). Then we want R, so: Multiply w times the number of times that the word 'tutorial' appears in the search (which is 1). In other words, R = Here's the proof:

mysql> select round(0.9456265 * 0.6931472 * 1, 7) as R;
 +-----------+
 | R         |
 +-----------+
 | 0.6554583 |
 +-----------+
 1 row in set (0.00 sec)
 
 mysql> select round(match(title,body) against ('tutorial'), 7) as R
    -> from articles limit 1;
 +-----------+
 | R         |
 +-----------+
 | 0.6554583 |
 +-----------+
 1 row in set (0.00 sec)
 

You'll need memory

The MySQL experience is that many users appreciate the full-text precision or recall, that is, the rows that MySQL returns are relevant and the rows that MySQL misses are rare, in the judgment of some real people. That means that the weighting formula is probably justifiable for most occasions. Since it's the product of lengthy academic research, that's understandable.

On the other hand, there are occasional complaints about speed. Here, the tricky part is that the formula depends on global factors -- specifically N (the number of documents) and nf (the number of documents that contain the term). Every time that insert/update/delete occurs for any row in the table, these global weight factors change for all rows in the table.

If MySQL was a search engine and there was no need to update in real time, this tricky part wouldn't matter. With occasional batch runs that redo the whole index, the global factors can be stored in the index. Search speed declines as the number of rows increases, but search engines work.

However, MySQL is a DBMS. So when updates happen, users expect the results to be visible immediately. It would take too long to replace the weights for all keys in the fulltext index, for every single update/insert/delete. So MySQL only stores the local factors in the index. The global factors are more dynamic. So MySQL stores an in-memory binary tree of the keys. Using this tree, MySQL can calculate the count of matching rows with reasonable speed. But speed declines logarithmically as the number of terms increases.

Weighting in boolean mode

The basic idea is as follows: In an expression of the form A or B or (C and D and E), either A or B alone is enough to match the whole expression, whereas C, D, and E should all match. So it's reasonable to assign weight 1 to each of A, B, and (C and D and E). Furthermore, C, D, and E each should get a weight of 1/3.

Things become more complicated when considering boolean operators, as used in MySQL full-text boolean searching. Obviously, +A +B should be treated as A and B, and A B - as A or B. The problem is that +A B can not be rewritten in and/or terms (that's the reason why thisextendedset of operators was chosen). Still, approximations can be used. +A B C can be approximated as A or (A and (B or C)) or as A or (A and B) or (A and C) or (A and B and C). Applying the above logic (and omitting mathematical transformations and normalization) one gets that for +A_1 +A_2 ... +A_N B_1 B_2 ... B_M the weights should be: A_i = N, B_j=1 if N==0, and, otherwise, in the first rewriting approach B_j = B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1)).

The second expression gives a somewhat steeper increase in total weight as number of matched B_j values increases, because it assigns higher weights to individual B_j values. Also, the first expression is much simpler, so it is the first one that is implemented in MySQL.

[edit] FLOAT and DOUBLE data types and their representation.

The MySQL Reference Manual has a discussion of floating-point numbers in Section 11.2 Numeric Types, including details about the storage. Let us now take up the story from where the MySQL Reference Manual leaves off.

The following discussion concentrates on the case where no display width and decimals are given. This means that FLOAT is stored as whatever the C type float is and REAL or DOUBLE [PRECISION] is stored as whatever the C type double is. The field length is selected by the MySQL code.

This document was created when [http://bugs.mysql.com/4457 Bug#4457] (Different results in SQL-Statements for the same record) was fixed at the end of August 2004. Until then there was some confusion in the double-to-string conversion at different places in the code.

The bugfix for [http://bugs.mysql.com/4937 Bug#4937] (INSERT + SELECT + UNION ALL + DATE to VARCHAR(8) conversion problem) produced a conversion function which was a promising approach to the conversion problems. Unfortunately it was only used for direct field conversions and not for function results etc. It did not take small numbers (absolute value less than 1) and negative numbers into account. It did not take the limited precision of float and double data types into account. The bugfix was developed in two steps: The first attempt looked like this (in principle):

length= sprintf(buff, "%.*g", field_length, nr);
if (length > field_length)
  length= sprintf(buff, "%.*g", field_length-5, nr);

If the libc conversion produces too many characters, the precision is reduced by the space required for the scientific notation (1.234e+05). Thus the printf() conversion is forced to switch to the scientific notation, since the value would not fit otherwise. Or, if it was scientific already, the precision is reduced and also uses less space. I left out some important stuff around limit checking just to show the idea. This simple algorithm should work quite well in most cases, but has been discarded for the sake of performance. The double call to the slow printf() conversion %g didn't seem reasonable, though it would only be used for extreme values and small fields. During my explorations of the code I didn't find places where float or double were to be converted into small fields. Remember that I talk only of conversions where field length and precision are not given. In this case a sufficient field length is selected at several places, except for a bug where it was selected wrongly. If a field length is given, a different conversion is used anyway. But since the code is quite complex, I don't claim to grasp it in full, and therefore may be in error. So let us look further:

The second attempt to fix the bug looked like this:

bool use_scientific_notation=TRUE;
if (field_length < 32 && nr > 1)
{
  double e[]={1, 1e1, 1e2, 1e4, 1e8, 1e16 }, p=1;
  for (int i=sizeof(e), j=1<<i-- ; j; i--,  j>>=1 )
  {
    if (field_length & j)
      p*=e[i];
  }
  use_scientific_notation=(p < nr);
}
length= sprintf(buff, "%.*g", use_scientific_notation ?
                              field_length-5 : field_length, nr);

Here we evaluate if the string representation of a given number fits into field_length characters. If not, we reduce the precision to make it fit. Again, I left out important details. For example, the evaluation is done only once per field for the sake of performance. The downside here is the unconditional reduction of precision for field length > 31 (which doesn't really matter), for negative numbers and for small numbers (absolute value less than 1).

Both algorithms do not take the limited precision of float and double values into account. This could lead to conversions with ridiculous bogus precision output. For example a value of 0.7 converted with %.30g will give a lot of digits, which pretend to tell about deviations from the value 0.7 and are completely absurd: 0.699999988079071044921875. To understand more about the %g conversion, I quote from a comment introduced in the source at the beginning of bugfixing #4937 (this comment was removed because it mainly describes, how the printf() conversion works, but I think it's valuable enough to include it here):

/*
   Let's try to pretty print a floating point number. Here we use
   '%-*.*g' conversion string:
     '-' stands for right-padding with spaces, if such padding will take
   place
     '*' is a placeholder for the first argument, field_length, and
   signifies minimal width of result string. If result is less than
   field length it will be space-padded. Note, however, that we'll not
   pass spaces to Field_string::store(const char *, ...), due to
   strcend in the next line.
     '.*' is a placeholder for DBL_DIG and defines maximum number of
   significant digits in the result string. DBL_DIG is a hardware
   specific C define for maximum number of decimal digits of a floating
   point number, such that rounding to hardware floating point
   representation and back to decimal will not lead to loss of
   precision. That is: if DBL_DIG is 15, number 123456789111315 can be
   represented as double without precision loss.  As one can judge from
   this description, choosing DBL_DIG here is questionable, especially
   because it introduces a system dependency.
     'g' means that conversion will use [-]ddd.ddd (conventional) style,
   and fall back to [-]d.ddde[+|i]ddd (scientific) style if there is not
   enough space for all digits.
   Maximum length of result string (not counting spaces) is (I guess)
   DBL_DIG + 8, where 8 is 1 for sign, 1 for decimal point, 1 for
   exponent sign, 1 for exponent, and 4 for exponent value.
   XXX: why do we use space-padding and trim spaces in the next line?
 */
 sprintf(to,"%-*.*g",(int) field_length,DBL_DIG,nr);
 to=strcend(to,' ');
 

There is one small misapprehension in the comment. %g does not switch to scientific notation when there is 'not enough space for all digits'. As the commentator says, the field length gives the minimal output length. printf() happily outputs more characters if required to produce a result with 'precision' digits. In fact it switches to scientific when the value can no longer be represented by 'precision' digits in conventional notation. The man page says "Style e is used if the exponent from its conversion is less than -4 or greater than or equal to the precision." In explanation, a precision of 3 digits can print a value of 345 in conventional notation, but 3456 needs scientific notation, as it would require 4 digits (a precision of 4) in conventional notation. Thus, it is printed as 3.46e+03 (rounded).

Since we don't want spaces in the output, we should not give a field length, but always use "%.*g". However, the precision matters, as seen above. It is worth its own paragraph.

Since MySQL uses the machine-dependent binary representation of float and double to store values in the database, we have to care about these. Today, most systems use the IEEE standard 754 for binary floating-point arithmetic. It describes a representation for single precision numbers as 1 bit for sign, 8 bits for biased exponent and 23 bits for fraction and for double precision numbers as 1-bit sign, 11-bit biased exponent and 52-bit fraction. However, we can not rely on the fact that every system uses this representation. Luckily, the ISO C standard requires the standard C library to have a header float.h that describes some details of the floating point representation on a machine. The comment above describes the value DBL_DIG. There is an equivalent value FLT_DIG for the C data type float.

So, whenever we print a floating-point value, we must not specify a precision above DBL_DIG or FLT_DIG respectively. Otherwise we produce a bogus precision, which is wrong. For the honor of the writer of the first attempt above, I must say that his complete algorithm took DBL_DIG into account, if however only for the second call to sprintf(). But FLT_DIG has never been accounted for. At the conversion section of the code, it was not even known whether the value came from a float or double field.

My attempt to solve the problems tries to take all this into account. I tried to concentrate all float/double-to-string conversions in one function, and to bring the knowledge about float versus double to this function wherever it is called. This solution managed to keep the test suite happy while solving the new problem of [http://bugs.mysql.com/4457 Bug#4457]. Luckily the first problem was not big, as the test cases have been very carefully selected, so that they succeed as long as the machine uses IEEE 754.

Nevertheless, the function is still not perfect. It is not possible to guess how many significant digits a number has. Given that, it is not simple to tell how long the resulting string would be. This applies to numbers with an absolute value smaller then 1. There are probably ways to figure this out, but I doubt that we would win in terms of performance over the simple solution of the first attempt, and besides we might cause new bugs. The compromise taken here is to accept that the resulting string may exceed the destination field length by five characters in the worst case.

if (nr < 0.0)
{
  abs_nr= -nr;
  extra_space= 1;
}
else
{
  abs_nr= nr;
  extra_space= 0;
}
precision= is_float ? FLT_DIG : DBL_DIG;
if (precision > field_length)
  precision= field_length;

if (! initialized)
{
  /* Better switch to scientific too early than too late. */
  double mult;
  mult= 1e0;
  for (length= DBL_DIG; length++)
    mult/= 1e1;
  mult= 1e1 - mult;

  double val;
  val= 1.0;
  for (int idx= DBL_DIG+1; idx++)
  {
    DBUG_PRINT("info",("double_to_string_conv: big[%d] %.*g",
                       idx, DBL_DIG+3, val));
    big_number[idx]= val;
    val*= mult;
  }
  small_number[0]= 1e0;
  small_number[1]= 1e0;
  small_number[2]= 1e0;
  small_number[3]= 1e-1;
  small_number[4]= 1e-2;
  small_number[5]= 1e-3;
  small_number[6]= 1e-4;
  /* %g switches to scientific when exponent < -4. */
  for (int idx= DBL_DIG+1; idx++)
    small_number[idx]= 1e-4;
  initialized= TRUE;
}

use_scientific_notation= (abs_nr != 0.0) &&
                         ((abs_nr >   big_number[precision]) ||
                          (abs_nr < small_number[precision]));

if (use_scientific_notation)
{
  if (((nr >= 0.0) && ((nr >= 1e+100) || (nr <= 1e-100))) ||
      ((nr < 0.0) && ((nr <= -1e+100) || (nr >= -1e-100))))
    extra_space+= 6; /* .e+100 or .e-100 */
  else
    extra_space+= 5; /* .e+99  or .e-99 */
}

if (field_length < extra_space)
  precision= 0;
else if (precision > (field_length - extra_space))
  precision= field_length - extra_space;

length= sprintf(buff, "%.*g", precision, nr);

This solution takes performance into account by initializing the limiting numbers arrays only once into static space. It copes with negative numbers and tries to decide even over small numbers. The latter has only small implications, as the prefix 0.000 is exactly the same size as the postfix e-100. But knowing if scientific notation will be selected by sprintf() allows for saving one digit when the exponent is larger than -100.

The calculations for the big number array are less precise than in the second attempt, but faster. The precision is sufficient for the guess whether sprintf() uses scientific notation. There may be number to field length combinations which exploit the gap, but these won't emerge anyway as I found no situation where this function is called with small field lengths. Remember again that it is not called with user-supplied field lengths.

However in the current stable releases (including gamma) we have some places where the field length is too small by one character. Thus, the precision is sometimes one digit smaller than DBL_DIG would allow for. Consequently, we cannot use the simple algorithm in the stable releases. There is a chance of doing it in a development release, though.

Addendum:

There turned out to be a new solution to the "big number array" problem. We have a statically initialized array log_10, which holds the necessary values. But I did not check whether these values are safe. Even if computed by the compiler, they could carry values slightly above the decimal powers, which would be bad. In this case we needed to initialize by 9.99999999e+xxx, where the number of nines is equal to DBL_DIG. This must be protected by #if DBL_DIG == yy, so that a new DBL_DIG on a new platform is detected. And the array is of limited length. We must at least protect it by a DBUG_ASSERT(sizeof(log_10)/sizeof(log_10[0]) > DBL_DIG).

But all of this is probably completely unnecessary, since we are only speaking of cases where no user-supplied field length is given. So MySQL selects the field length on its own. So it is totally possible, indeed highly desirable, that MySQL selects a field length, which allows for a maximum of precision for all possible values. And these are DBL_DIG+7 or FLT_DIG+6 respectively as far as IEEE 754 is used. In this case we can have values of about +/-1e-307 to +/-1e+308 for double and +/-1e-37 to +/-1e+38 for float. That is, for example -1.<DBL_DIG-1 digits>e+100. For cases where a precision above IEEE 754 is possible, we may need +8 instead. We can detect this with #if DBL_MAX_10_EXP >= So using a field length of DBL_DIG+8 in all cases should be sufficient for a simple sprintf(buff, "%.*g", DBL_DIG, nr) or sprintf(buff, "%.*g", FLT_DIG, nr), respectively. To be safe, we should not use the machine dependent constants everywhere, but instead concentrate them into definitions like these:

 #if (DBL_MAX_10_EXP > 9999) || (DBL_MIN_10_EXP < -9999)
 #  error "Need new definition for UNSPECIFIED_DOUBLE_FIELD_LENGTH"
 #elif (DBL_MAX_10_EXP > 999) || (DBL_MIN_10_EXP < -999)
 #  define UNSPECIFIED_DOUBLE_FIELD_LENGTH (DBL_DIG+8)
 #else
 #  define UNSPECIFIED_DOUBLE_FIELD_LENGTH (DBL_DIG+7)
 #endif
 
 #if (FLT_MAX_10_EXP > 999) || (FLT_MIN_10_EXP < -999)
 #error "Need new definition for UNSPECIFIED_FLOAT_FIELD_LENGTH"
 #elif (FLT_MAX_10_EXP > 99) || (FLT_MIN_10_EXP < -99)
 #  define UNSPECIFIED_FLOAT_FIELD_LENGTH (FLT_DIG+7)
 #else
 #  define UNSPECIFIED_FLOAT_FIELD_LENGTH (FLT_DIG+6)
 #endif

These definitions should be used wherever an item or field of type float or double without an explicit field length specification is encountered. We have to propagate these lengths though all derived items and fields and we have to select the maximum of all field lengths wherever in two or more of them are used in an expression or a function.

We need to treat the precision (DBL_DIG/FLT_DIG) similarly, but have to select the minimum in expressions or functions.

[edit] Threads

Threads in mysqld can run at four different priorities, defined in mysql_priv.h:

      #define INTERRUPT_PRIOR 10
      #define CONNECT_PRIOR    9
      #define WAIT_PRIOR    8
      #define QUERY_PRIOR    6

Some threads try to set their priority; others don't. These calls are passed along to pthread_setschedparam() if the native threading library implements it.

The different threads are:

In InnoDB, all thread management is handled through os/os0thread.c InnoDB's threads are:

InnoDB's internal os_thread_set_priority() function implements three priorities (Background, normal, and high) but only on windows. The function is a no-op on unix.

[edit] Character/Collation Sets

Character sets are used by MySQL when storing information, both to ensure that the information is stored (and returned) in the correct format, but also for the purposes of collation and sorting. Each character set supports one or more collations, and so these are collectively known as Collation Sets, rather than character sets.

Character sets are recorded against individual tables and returned as part of the field data. For example, the MYSQL_FIELD data type definition includes the field charsetnr:

typedef struct st_mysql_field {
  char *name;                 /* Name of column */
  char *org_name;             /* Original column name, if an alias */
  char *table;                /* Table of column if column was a field */
  char *org_table;            /* Org table name, if table was an alias */
  char *db;                   /* Database for table */
  char *catalog;              /* Catalog for table */
  char *def;                  /* Default value (set by mysql_list_fields) */
  unsigned long length;       /* Width of column (create length) */
  unsigned long max_length;   /* Max width for selected set */
  unsigned int name_length;
  unsigned int org_name_length;
  unsigned int table_length;
  unsigned int org_table_length;
  unsigned int db_length;
  unsigned int catalog_length;
  unsigned int def_length;
  unsigned int flags;         /* Div flags */
  unsigned int decimals;      /* Number of decimals in field */
  unsigned int charsetnr;     /* Character set */
  enum enum_field_types type; /* Type of field. See mysql_com.h for types */
} MYSQL_FIELD;

Character set and collation information are specific to a server version and installation, and are generated automatically from the sql/share/charsets/Index.xml file in the source distribution.

You can obtain a list of the available character sets configured within a server by running SHOW COLLATION, or by running a query on the INFORMATION_SCHEMA.COLLATION table. A sample of the information from that table has been provided here for reference.

Collation Id Charset Collation Default Sortlen
64 armscii8 armscii8_bin  ?? 1
32 armscii8 armscii8_general_ci Yes 1
65 ascii ascii_bin  ?? 1
11 ascii ascii_general_ci Yes 1
84 big5 big5_bin  ?? 1
1 big5 big5_chinese_ci Yes 1
63 binary binary Yes 1
66 cp1250 cp1250_bin  ?? 1
44 cp1250 cp1250_croatian_ci  ?? 1
34 cp1250 cp1250_czech_cs  ?? 2
26 cp1250 cp1250_general_ci Yes 1
50 cp1251 cp1251_bin  ?? 1
14 cp1251 cp1251_bulgarian_ci  ?? 1
52 cp1251 cp1251_general_cs  ?? 1
23 cp1251 cp1251_ukrainian_ci  ?? 1
51 cp1251 cp1251_general_ci Yes 1
67 cp1256 cp1256_bin  ?? 1
57 cp1256 cp1256_general_ci Yes 1
58 cp1257 cp1257_bin  ?? 1
29 cp1257 cp1257_lithuanian_ci  ?? 1
59 cp1257 cp1257_general_ci Yes 1
80 cp850 cp850_bin  ?? 1
4 cp850 cp850_general_ci Yes 1
81 cp852 cp852_bin  ?? 1
40 cp852 cp852_general_ci Yes 1
68 cp866 cp866_bin  ?? 1
36 cp866 cp866_general_ci Yes 1
96 cp932 cp932_bin  ?? 1
95 cp932 cp932_japanese_ci Yes 1
69 dec8 dec8_bin  ?? 1
3 dec8 dec8_swedish_ci Yes 1
98 eucjpms eucjpms_bin  ?? 1
97 eucjpms eucjpms_japanese_ci Yes 1
85 euckr euckr_bin  ?? 1
19 euckr euckr_korean_ci Yes 1
86 gb2312 gb2312_bin  ?? 1
24 gb2312 gb2312_chinese_ci Yes 1
87 gbk gbk_bin  ?? 1
28 gbk gbk_chinese_ci Yes 1
93 geostd8 geostd8_bin  ?? 1
92 geostd8 geostd8_general_ci Yes 1
70 greek greek_bin  ?? 1
25 greek greek_general_ci Yes 1
71 hebrew hebrew_bin  ?? 1
16 hebrew hebrew_general_ci Yes 1
72 hp8 hp8_bin  ?? 1
6 hp8 hp8_english_ci Yes 1
73 keybcs2 keybcs2_bin  ?? 1
37 keybcs2 keybcs2_general_ci Yes 1
74 koi8r koi8r_bin  ?? 1
7 koi8r koi8r_general_ci Yes 1
75 koi8u koi8u_bin  ?? 1
22 koi8u koi8u_general_ci Yes 1
47 latin1 latin1_bin  ?? 1
15 latin1 latin1_danish_ci  ?? 1
48 latin1 latin1_general_ci  ?? 1
49 latin1 latin1_general_cs  ?? 1
5 latin1 latin1_german1_ci  ?? 1
31 latin1 latin1_german2_ci  ?? 2
94 latin1 latin1_spanish_ci  ?? 1
8 latin1 latin1_swedish_ci Yes 1
77 latin2 latin2_bin  ?? 1
27 latin2 latin2_croatian_ci  ?? 1
2 latin2 latin2_czech_cs  ?? 4
21 latin2 latin2_hungarian_ci  ?? 1
9 latin2 latin2_general_ci Yes 1
78 latin5 latin5_bin  ?? 1
30 latin5 latin5_turkish_ci Yes 1
79 latin7 latin7_bin  ?? 1
20 latin7 latin7_estonian_cs  ?? 1
42 latin7 latin7_general_cs  ?? 1
41 latin7 latin7_general_ci Yes 1
43 macce macce_bin  ?? 1
38 macce macce_general_ci Yes 1
53 macroman macroman_bin  ?? 1
39 macroman macroman_general_ci Yes 1
88 sjis sjis_bin  ?? 1
13 sjis sjis_japanese_ci Yes 1
82 swe7 swe7_bin  ?? 1
10 swe7 swe7_swedish_ci Yes 1
89 tis620 tis620_bin  ?? 1
18 tis620 tis620_thai_ci Yes 4
90 ucs2 ucs2_bin  ?? 1
138 ucs2 ucs2_czech_ci  ?? 8
139 ucs2 ucs2_danish_ci  ?? 8
145 ucs2 ucs2_esperanto_ci  ?? 8
134 ucs2 ucs2_estonian_ci  ?? 8
146 ucs2 ucs2_hungarian_ci  ?? 8
129 ucs2 ucs2_icelandic_ci  ?? 8
130 ucs2 ucs2_latvian_ci  ?? 8
140 ucs2 ucs2_lithuanian_ci  ?? 8
144 ucs2 ucs2_persian_ci  ?? 8
133 ucs2 ucs2_polish_ci  ?? 8
131 ucs2 ucs2_romanian_ci  ?? 8
143 ucs2 ucs2_roman_ci  ?? 8
141 ucs2 ucs2_slovak_ci  ?? 8
132 ucs2 ucs2_slovenian_ci  ?? 8
142 ucs2 ucs2_spanish2_ci  ?? 8
135 ucs2 ucs2_spanish_ci  ?? 8
136 ucs2 ucs2_swedish_ci  ?? 8
137 ucs2 ucs2_turkish_ci  ?? 8
128 ucs2 ucs2_unicode_ci  ?? 8
35 ucs2 ucs2_general_ci Yes 1
91 ujis ujis_bin  ?? 1
12 ujis ujis_japanese_ci Yes 1
83 utf8 utf8_bin  ?? 1
202 utf8 utf8_czech_ci  ?? 8
203 utf8 utf8_danish_ci  ?? 8
209 utf8 utf8_esperanto_ci  ?? 8
198 utf8 utf8_estonian_ci  ?? 8
210 utf8 utf8_hungarian_ci  ?? 8
193 utf8 utf8_icelandic_ci  ?? 8
194 utf8 utf8_latvian_ci  ?? 8
204 utf8 utf8_lithuanian_ci  ?? 8
208 utf8 utf8_persian_ci  ?? 8
197 utf8 utf8_polish_ci  ?? 8
195 utf8 utf8_romanian_ci  ?? 8
207 utf8 utf8_roman_ci  ?? 8
205 utf8 utf8_slovak_ci  ?? 8
196 utf8 utf8_slovenian_ci  ?? 8
206 utf8 utf8_spanish2_ci  ?? 8
199 utf8 utf8_spanish_ci  ?? 8
200 utf8 utf8_swedish_ci  ?? 8
201 utf8 utf8_turkish_ci  ?? 8
192 utf8 utf8_unicode_ci  ?? 8
33 utf8 utf8_general_ci Yes 1

Note that it is the collation ID, not the character set ID, that is used to identify the unique combination of character set and collation. Thus, when requesting character set information using one of the character set functions in mysys/charset.c, such as get_charset(), different IDs may return the same base character set, but a different collation set.

The following functions provide an internal interface to the collation and character set information, enabling you to access the information by name or ID:

static uint get_collation_number_internal(const char *name)
uint get_collation_number(const char *name)
uint get_charset_number(const char *charset_name, uint cs_flags)
const char *get_charset_name(uint charset_number)
static CHARSET_INFO *get_internal_charset(uint cs_number, myf flags)
CHARSET_INFO *get_charset(uint cs_number, myf flags)
CHARSET_INFO *get_charset_by_name(const char *cs_name, myf flags)
CHARSET_INFO *get_charset_by_csname(const char *cs_name,
                                    uint cs_flags,
                                    myf flags)

The table below details the functions, the key argument that is supplied, and the return value.

Function Supplied Argument Return Value
get_collation_number_internal() Collation name Collation ID
get_collation_number() Collation name Collation ID
get_charset_number() Character set name Collation ID
get_charset_name() Collation ID Character set name
get_internal_charset() Collation ID Character datatype
get_charset() Collation ID Character datatype

An example of using the collation/character set functions is available in the extras/charset2html.c, which outputs an HTML version of the internal collation set table.

[edit] Error flags and functions

The following flags can be examined or set to alter the behavior during error handling:

thd->net.report_error is set in my_message_sql() if the error message was registered. (my_message_sql() is called by my_error(), my_printf_error(), my_message()).

Like net.report_error, but is always set to 1 in my_message_sql() if error was not caught by an error handler. Used by replication to see if a query generated any kind of errors.

Normally an error also generates a warning. The warning can be disabled by setting thd->no_warnings_for_error. (This allows one to catch all error messages generated by a statement)

This is set to in case likes INSERT IGNORE ... SELECT. In this case we ignore all not fatal errors generated by the select.

Set this if we should abort the current statement (and any multi-line statements) because something went fatally wrong. (for example, a stored procedure continue handler should not be able to catch this). This is reset by mysql_reset_thd_for_next_command().

Strict mode flag, which means that we should abort the statement if we get a warning. In the field::store function this changes the warning level from WARN to ERROR. In other cases, this flag is mostly tested with thd->really_abort_on_warning() to ensure we don't abort in the middle of an update with not transactional tables.

If set, we generate warning for field conversations (normal case for INSERT/UPDATE/DELETE). This is mainly set to 0 when doing internal copying of data between fields and we don't want to generate any conversion errors at any level.

Set in case of error in connection protocol or in case of 'kill'. In this case we should abort the query and kill the connection.

Error functions

This function returns 1 if a warning should be converted to an error, like in strict mode when all tables are transactional. The conversion is handled in sql_error.cc::push_warning().

Should be called if we want to abort the current statement and any multi-line statement.

Resets thd->net.report_error and thd->query_error.

[edit] Functions in the mysys Library

Functions in mysys: (For flags see my_sys.h)

Copy file from from to to.

Rename file from from to to.

Delete file name.

Delete from before rename of to to from. Copies state from old file to new file. If MY_COPY_TIME is set, sets old time.

Get and set working directory.

Make a unique temporary file name by using dir and adding something after pfx to make the name unique. The file name is made by adding a unique six character string and TMP_EXT after pfx. Returns pointer to malloc()'ed area for filename. Should be freed by free().

Use instead of open, open-with-create-flag, close, read, and write to get automatic error messages (flag MYF_WME) and only have to test for != MY_NABP</code>).

Same read-interface for streams as for files.

malloc(size,myflag) is mapped to these functions if not compiled with -DSAFEMALLOC.

Writes malloc() info on stdout if compiled with -DSAFEMALLOC.

Change size of file fd to newlength.

Writes message using error number (see mysys/errors.h) on stdout, or using curses, if MYSYS_PROGRAM_USES_CURSES() has been called.

Writes str on stdout, or using curses, if MYSYS_PROGRAM_USES_CURSES() has been called.

Start each program (in main()) with this.

Gives info about program. If infoflag & MY_CHECK_ERROR, prints if some files are left open. If infoflag & MY_GIVE_INFO, prints timing info and malloc() info about program.

Copy state from old file to new file. If MY_COPY_TIME is set, sets old time.

Returns filename of open file.

Copy name of directory from filename.

Test if dir_name is a hard path (starts from root).

Convert dirname according to system. On Windows, changes all characters to capitals and changes '/' to '\'.

Returns pointer to extension in filename.

Format a filename with replacement of library and extension and convert between different systems. The to and name parameters may be identical. Function doesn't change name if name != to. flag may be:

1 Force replace filenames library with 'dsk'
2 Force replace extension with 'form' */
4 Force unpack filename (replace ~ with home directory)
8 Pack filename as short as possible for output to user

All open requests should always use at least open(fn_format(temp_buffer, name, "", "", 4), ...) to unpack home and convert filename to system-form.

Copies directory and extension from name to toname if needed. Copying can be forced by same flags used in fn_format().

Compare if str matches wildstr. wildstr can contain '*' and '?' as wildcard characters. Returns 0 if str and wildstr match.

Get current date in a form ready for printing.

Makes in_pntr to a 5 char long string. All words that sound alike have the same string.

Use caching of keys in MISAM, PISAM, and ISAM. KEY_CACHE_SIZE is a good size. Remember to lock databases for optimal caching.

End key caching.

[edit] Bitmaps

Inside the mysys directory is a file named my_bitmap.c. It contains functions for manipulating bitmaps. Specifically there are functions for setup or teardown (bitmap_init, bitmap_free), for setting and clearing individual bits or whole sections of the bitmap (bitmap_set_bit, bitmap_fast_test_and_set, bitmap_clear_all, bitmap_set_all, bitmap_set_prefix, bitmap_set_above), and for performing comparisons and set operations on two bitmaps (bitmap_cmp, bitmap_intersect, bitmap_subtract, bitmap_union). Bitmaps are useful, so the functions are called from several places (opt_range.cc, slave.cc, mysqld.c, sql_insert.cc, log_event.cc, sql_show.cc) and we're expecting to make more use of them in the next version of MySQL, MySQL 5.1.

There are a few warnings and limitations that apply for the present bitmap implementation. First: the allocation is an integral number of bytes, and it is not possible to determine whether the last few bits are meaningful. Second: the whole bitmap might have to be protected by a mutex for manipulations; this is settable by passing appropriate flag values. Third: the bitmap is allocated with a 'uint' size, which means that ordinarily it can't have more than 2^32 bytes. Fourth: when unioning two bitmaps, they must be of the same size.

Retrieved from "http://forge.mysql.com/wiki/MySQL_Internals_Algorithms"

This page has been accessed 17,028 times. This page was last modified 13:22, 1 August 2007.

Find

Browse
MySQLForge
Main Page
Current events
Recent changes
Random page
Help
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Special pages
New pages
File list
Statistics
Bug reports
More...