WL#2771: Usage of multi_read_range in nested loop join

Affects: 9.x — Status: Assigned — Priority: Low

With implementation of WL#2126 Multi_read_range, the current
nested loop join implementation can be optimized by reading
several values (ranges) in the inner loop at once. Since the
implementation of WL#2126 supports ranges of any order, duplicates,
and even overlapping ranges, the nested loop implementation
can take any values (of a certain batch size) from the outer 
loop and batch them in one call in the inner loop.

Note added by Trudy Pelzer, 2006-08-06
At the Dev-MT Offsite meeting in Santa Cruz, Brian
and Monty made the following time estimate for this
task:
- hash and merge joins (WL#2241/2771/no WL# yet); 6 mths
* This is three tasks:
** WL#2241 "One-Pass Hash Join"; already in progress (Timour 
estimates 2.5 mths left).
** WL#2771 "Usage of multi_read_range in nested loop join";
partial spec available (assigned to Sergey Petrunia, Monty and 
Brian estimate this is a 2 mth task).
** WL#? (new task, will take 4 mths).
* Current estimate: Brian and Monty agree that these tasks are 
interdependent and should thus take a total of 6 months.
Copyright (c) 2001-2007 by MySQL AB. All rights reserved.

Batched Key Access Method
=========================

Contents:
---------
1. Notation, definitions, assumptions
2. Nested loops join algorithms
2.1 Regular nested loops join
2.2 Batched nested loops join
2.3 Comparing batched and regular NLJ algorithms
2.4. Batched NLJ vs. Blocked NLJ
3. Batched key access procedure
3.1 Usage of the MRR interface in the BKA method
3.2 Possible data layouts in record buffers
3.3. Handling of outer joins and semi-joins
3.4. Mixing Batched NLJ with Blocked NLJ
4. Cost considerations
4.1 Number of records in a buffer
4.2 The number of different keys in a buffer
4.3 The cost of an n-way join using the BKA method
4.4 Optimal memory allocation for buffers
4.5. Search for an execution plan when using BKA
4.6. Open and partly open problems
References


1. Notation, definitions, assumptions
-------------------------------------
In this document we consider join queries over tables T1,...,Tn of the form
SELECT s1,...,sm FROM T1,...,Tn WHERE P(T1,...,Tn) (Q)

We are interested in the left-handed execution plans for such queries, where
all selection and projection operation are pushed down to the leaves of the
plan execution tree as far as possible. In such a plan the right operand for
any join operation is always a relational expression over only one of the
tables T1,...,Tn.
Let Ti (i>1) be the table accessed when evaluating the right operand of the (i-
1)-th join operation. Then the result set yielded by the left operand of this
join operation is called i-th partial join and denoted as Ri.

Any left-handed plan can be represented by:
- the order T1,...,Tn in which the tables are accessed
- methods A1,...,An to access tables T1,...,Tn respectively
- pushdown conditions PT1,...,PTn applied to the
accessed rows of table T1,...,Tn respectively
- pushdown conditions P1,P2,...,Pn applied to the rows of join operations
(P1=PT1)
- projections PJ1,...,PJn applied to results of the
previous selections.

Here is a graphical representation the tree of the relational expression for
this execution plan:

A1-S/P1-PJ1--A2-S/P2-PJ2-- ... --An-S/Pn-PJn
| | |
S/PT1 S/PT2 S/PTn
| | |
T1 T2 ... Tn

We say that condition PTi is a pushdown condition for the query Q pushed to
table Ti if:

- PTi contains no other column references to tables T1,...Tn except
references to table Ti;
- for any tuple t in <T1,...,Tn> we have:
PTi(t) => P(t)
with any interpretation of the predicates and functions occurred in P

Similarly, condition Pk is a pushdown condition for query Q, pushed to tables
T1,...Tk if:
- Pk contains no other column references to tables T1,...Tn except references
to tables T1,...,Tk;
- for any tuple t in <T1,...,Tn> we have:
Pk(t) => P(t).
with any interpretation of the predicates and functions occurred in P.

We say that a pushdown condition Pi over tables T1,...,Tk
is maximal if for any other such condition Pi we have
P(t) => P(t) for any tuple t from <T1,...,Tn> with any
interpretation of the predicates and functions occurred in P.

We assume that the pushdown conditions PT1,...,PTn and P1,...Pn used in query
plans are maximal.

NOTE. How to build maximal pushdown conditions is beyond the scope of this HLS.
Currently MySQL code contains a procedure that allows us to build such
conditions.

If T is a table or a result of a partial join then we denote its cardinality by
|T|.

Consider cardinalities of partial joins R1,...Rn
|R1|,...,|Rn|.

We have
|Rj|=|T1|*...*|Tj|*sel(Pj)
where Pj is the selectivity of the pushdown predicate Pj.
We always have
sel(Pj) <= sel(PT1)*...*sel(PTj).
How to calculate sel(Pj) is beyond the scope of this HLS.

We denote the number of different values for the column a of table T as V(T,a).
The number of different tuples comprised of the values of columns a1,...,am
from table T as V(T,<a1,...,am>).

Assuming Ai is the method to access table Ti we define fan(Ai) as follows:
fan(Ai)=(i>1) ? |Ri|/|R[i-1]| : |Ri|


2. Nested loops join algorithms
-------------------------------

2.1 Regular nested loops join
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The execution of an n-way join of tables T1,...,Tn by the regular Nested Loops
Join (NLJ) algorithm can be represented by the following schema with explicit
nested loops:

t=<>;
/* Table T1 is accessed by access method A1() */
Initialize access of T1 by A1();
DO /* Loop of Level 1 */
t1= Get next record of table T1 accessed by A1();
IF (!t1)
BREAK;
/* PT1 is the condition pushed to table T1 */
IF (!PT1(t1))
CONTINUE;
t= <t,t1>;
IF (!P1(t))
CONTINUE;
/* Table T2 is accessed from T1 by method A2(t1) */
Initialize access of T2 by method A2();
DO /* Nested Loop of Level 2 */
t2= Get next record of table T1 by A2(t);
IF (!t2)
BREAK;
/* PT2 is the condition pushed to table T2 */
IF (!PT2(t2))
CONTINUE;
t= <t,t2>;
IF (!P2(t))
CONTINUE;
...
/* Table Tn is accessed from T1,...T[n-1]
by method An(t1,...,t[n-1]) */
Initialize access of T2 by An(t1,...,t[n-1]);
DO /* Nested Loop of Level n */
tn= Get next record of table Tn by An(t);
IF (!tn)
BREAK;
/* PTn is the condition pushed to table Tn */
IF (!PTn(tn))
CONTINUE;
t= <t,tn>;
IF (!Pn(tn))
CONTINUE;
OUTPUT(t);
ENDDO /* Loop n */
...
ENDDO /* Loop 2 */
ENDDO /* Loop 1 */

This schema with nested loops obviously can be converted into a single call of
the following recursive schema:

PROCEDURE NestedLoopsJoin(
<T1,...,Tn> /* tables to join */
<A1,...,An> /* Method Ai is used to access table Ti */
<PT1,...,PTn> /* PTi is condition pushed to table Ti */
<P1,...,Pn> /* Pi is pushdown condition for i-th
partial join */
n /* number of tables to join */
t, /* partial join record */
k /* number of rows joined in t */
)
BEGIN
IF (k==n)
OUTPUT (t);
RETURN;
ENDIF;
k++;
Initialize access of Tk by Ak(t);
DO
tk= Get next record of table Tk by Ak(t);
IF (!tk)
RETURN;
IF (!PTk(tk))
CONTINUE;
t= <t,tk>;
IF (!Pk(t))
CONTINUE;
NestedLoopsJoin(<T1,...,Tn>,
<A1,...,An>,
<PT1,...,PTn>,
<P1,...,Pn.
n, t, k);
ENDDO
END /* NestedLoopsJoin */


2.2 Batched nested loops join
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When working by the schema of the Batched Nested Loops Join (Batched NLJ)
algorithm we accumulate the records of each partial join of tables T1,...,T[j-
1] in a buffer of a limited size before accessing records of table T[j]:

PROCEDURE BatchedNestedLoopsJoin(
<T1,...,Tn> /* tables to join */
<A1,...,An> /* Method Ai is used to access table Ti */
<PT1,...,PTn> /* PTi is condition pushed to table Ti */
<P1,...,Pn> /* Pi is pushdown condition for i-th
partial join */
<B1,...,B[n-1]> /* Bi is the buffer to save i-th
partial join */
n /* number of tables to join */
j /* number of the join operation */
)
BEGIN
t= 0; /* to store partial join record */
tj= 0; /* to store records read from Tj */
j++;
DO
IF (j < n &&
(!t && buffer is not empty ||
buffer B[j] is full))
BatchedNestedLoopsJoin(<T1,...,Tn>,
<A1,...,An>,
<PT1,...,PTn>,
<P1,...,Pn.
<B1,...,B[n-1]>
n, j);
Clean buffer Bj;
IF (!t)
RETURN;
ENDIF
IF (!tj)
IF (j > 1)
r= Read next record from buffer B[j-1];
IF (!t)
CONTINUE;
ENDIF;
Initialize access of Tj by Aj(t);
ENDIF
tj= Get next record of table Tj by Aj(t);
IF (!tj)
CONTINUE;
IF (!PTj(tj))
CONTINUE;
t= <t,tj>;
IF (!Pj(t))
CONTINUE;
IF (j < n)
Add record t into the buffer B[j];
ELSE
OUTPUT(t);
ENDDO
END /* BatchedNestedLoopsJoin */


2.3 Comparing batched and regular NLJ algorithms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At the first glance we hardly can gain anything when employing Batched NLJ. For
any row from R[i-1] we still have to access the matched rows from Ti access
method Ai.
The accessed rows are appended to row t, filtered through the pushdown
condition Pi, after which the projection operation PJi is applied. In total we
have to perform
|R[i-1]| * fan(Ai) logical I/O operations to build the partial join Ri from the
partial join R[i-1]. With Batched NLJ we have to use additional memory for
buffers. However with batched NLJ at some conditions we could optimize access
to rows of the table Ti. For example we could access rows matched for records
from the buffer in the order of row IDs for MyISAM, and in the order of record
primary keys for InnoDB. In many cases it will give us a significant
performance improvement. Besides if Ai is an index based access method we can
benefit on the number of index accesses. For all occurrence of key k in the
buffer for records from R[i-1] we need only one index access.
Let <a1,...am> be the attributes from R[i-1] used by key access method Ai. Then
the number of index accesses
for records from buffer B[i-1] can be estimated as
T(B[i-1])/V(B[i-1], <a1,...am>).
The value of V(B[i-1], <a1,...am>) is always not less than 1 and we always gain
from less number of index accesses if this value is greater than 1.
Another case with evident benefits of the Batched NJL is provided by engines
with remote execution of access operation Ai. The NDB Cluster engine gives us a
typical example of such engines. There usage of the Batched NLJ methods allows
us to save a lot on transport costs since keys and accessed records are
transported in batch packages.

2.4. Batched NLJ vs. Blocked NLJ
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Blocked NLJ like the Batched NLJ also exploits the idea of accumulating
records of partial joins in separate buffers. When producing records of partial
join Rj by this algorithm we read records from table Tj and try to match them
against the records from partial join R[j-1] that are stored in buffer Bj. As
soon as a match of a record tj from Tj against a record t from the buffer for
Rj records is found and it was checked that the pushdown predicate Pn is true
for <t,tj>, the record <t,tj> is passed to produce records for the whole n-way
join. When its done we check possible matches of the record tj against all
remaining records in the buffer. We move to the next record from Tj when all
records in the buffer have been checked and all possible joins for them have
been produced. After the last record from Tj has produced all possible joins
with itself we refill the buffer from B[j-1] by new records from
R[j-1].
It makes sense to use the Blocked NLJ algorithm when joining records from Tj if
there is no good index-based method to access records of the table Tj and we
have to retrieve the whole table to check join matches.
Blocked NLJ schema benefits from the fact we read each record from Tj once per
each refill of the buffer B[j-1] by records from Rj, while when using an NLJ
schema we read record from Tj for each record from R[j-1].
The Blocked NLJ is a well known algorithm presented in many text-books, in
particular in [1].
The current MySQL code employs this algorithm in the frame of the regular NLJ
schema.
Yet we are not aware of any open description of the Batched NLJ algorithm.


3. Batched key access procedure
--------------------------------

3.1 Usage of the MRR interface in the BKA method
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The MRR interface [2] allows us to exploit the optimizations proposed in the
section 2.3. At the same time it allows us to disengage ourselves from the
specifics of how the joined records are accessed.
When using the MRR interface the Batched NLJ procedure can be refined as
follows.

PROCEDURE BatchedNestedLoopsJoin(
<T1,...,Tn> /* tables to join */
<Ai,...,An> /* Index based method Ai is used to access
table Ti */
<PT1,...,PTn> /* PTi is condition pushed to table Ti */
<P1,...,Pn> /* Pi is pushdown condition for i-th
partial join */
<B1,...,B[n-1]> /* Bi is the buffer to save i-th
partial join */
n /* number of tables to join */
j /* number of the join operation */
)
BEGIN
t= 0; /* to store partial join record */
tj= 0; /* to store records read from Tj */
k=0; /* to store access keys */
j++;
Prepare to use keys for records from buffer B[j-1]
to access rows from Ti by index method Ai employing
MRR interface;
DO
IF (j < n &&
(!t && buffer is not empty ||
buffer B[j] is full))
BatchedNestedLoopsJoin(<T1,...,Tn>,
<A1,...,An>,
<PT1,...,PTn>,
<P1,...,Pn>,
<B1,...,B[n-1]>
n, j);
Clean buffer Bj;
IF (!t)
RETURN;
ENDIF
IF (!k)
tj,t= Read next record by MRR interface;
IF (!t)
CONTINUE;
ENDIF
t,k = Read next record from buffer B[j-1] with
the given key value(t,k);
IF (!k)
CONTINUE;
t= <t,tj>
if (!Pj(t))
CONTINUE;
IF (j < n)
Add record t into the buffer B[j];
ELSE
OUTPUT(t);
ENDDO
END /* BatchedNestedLoopsJoin */

The operators used in the pseudo-code above have the following meaning.

<< Prepare to use keys for records from buffer B[j-1]
to access rows from Ti by index method Ai employing
MRR interface >>
searches for equal access keys in the records of buffer
B[j-1], organize such records in subsequences and initialize the process of
getting records from Tj that can be accessed by different keys found in B[j-1].

<< tj,t= Read next record by MRR interface >>
reads the record tj returned by a call to the MRR function multi_range_read
into a buffer for records from Tj and sets t to the first record from B[j-1]
from the sequence of records that contains the key by which tj is accessed.

<< t,k = Read next record from buffer B[j-1] with the given
key value(t,k) >>
returns the record next in the sequence of records from B[j-1] with key k if
there is any remaining records; otherwise sets k to 0.

3.2 Possible data layouts in record buffers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To be able to build sequences of records from the buffer Bj with the same keys
we leave pointers to records at the very end of the buffer. The records
themselves are put into the buffer starting from the beginning of it. As they
are written there in a packed format each record generally can take variable
number of bytes. This happens when we place into the buffer the fields of the
variable size or such fields that can take null values.
If the image of the key can be found in the packed record then the offset to
this image in the record is placed just before the packed record itself unless
the offset is the same for all records.
Thus in this case we have the following data layout in the buffer when putting
records there.

| beginning of Bj
V
[*]|record1:...key1...|[*]|record2:...key2...| ...
| ^ ^ | ^ ^
| | | | | |
+------------ + +-------------+
| |
| | ...|[*]|[*]|
| | | | ^
| +---------------------+ | |
+-------------------------------------------------+ |
end of Bj
(LO1)
If the key we use is compound comprising of several fields then we dont have
its image in the record. In this case the key is built and placed just before
the record. We have to act similarly in the following situations:
- the key is the function of a record field
- the key is a field from one of the previous buffers Bi where i < j.

In these cases we use a different data layout in the buffer Bj.

| beginning of Bj
V
|key1|record1|key2|record2| ...
^ ^
| |
| |
| |
| | ...|[*]|[*]|
| | | | ^
| +-------------------------------+ | |
+------------------------------------------------+ |
end of Bj
(LO2)

Before calling the function multi_range_read_init to initiate the process of
accessing data from the table T[j+1] we build the sorting index for records
keys. As a result the state of the buffer Bj changes to the following one:

| beginning of Bj
V
|key1|record1|key2|record2| ...
^ ^
| |
| |
| |
| | ...|[*]|...|[*]|...|
| | | | ^
| +-----------------------+ | |
+--------------------------------------------+ |
end of Bj
(LO2)

Here the array of pointers to records at the end of the buffer is sorted by the
corresponding key values.
Let ptr(i) denote a pointer a the i-th position on in this array and key(ptr
(i)) denote the key that is associated with the record which the pointer refers
to. Then for any i and j such that i<j we have key(ptr(i))<=key(ptr(j)).
It means, in particular, that pointers referring to the keys with the key value
equal to k form a slice in the array of pointers.

If the length of a key is comparable with the average length of a record and
the number of distinct keys in buffer Bj is just a small fraction of the
number of records then it makes sense to store only one copy of a each key in
the buffer linking the records with the same key in a chain. We could use a
hash table allocating it at the end of buffer Bj to check whether a key has
been already met.
In this case the data layout in the buffer Bj would be like this one:

| linked records and keys --->
+---> ...
|
|key k|[*]first record with key k|
^
| ...
| |
| V
| |[*]i-th record with key k|
| |
| V
| |[*](i+1)-th record with key k|
| | ...
| ... |
+-------+ |
| V
|[*]last record with key k|
^
|
+---------+
|
| ... |[*][*]| ... | ... |[*][*]| ... |
^ | |
| ... |
+--------------------------+
<--- overflow area ---> <--- hash table entries --->
(LO3)

Here we assumes that a regular hashing with separate chaining is applied to
search for the key value occurred in earlier records.
After the buffer has been filled the hash table data structure is replaced by a
sorted array of pointers to the record chains with equal keys.

We can increase the number of records written into buffer Bj if only fields of
records from table Ti are put there while the fields of tables T1,...,T[i-1]
are stored in buffers B1,...,B[i-1].
In this case each record of buffer Bj is preceded by a reference to the fields
of the joined record from the table T[i-1] stored in B[i-1]. Such references
allow us to assemble the result records from the parts belonging to the base
tables T1,...Tn.
Here we have the following data layout assuming separate access keys are placed
just before links to the previous record parts.

| beginning of B[j-1]
|
| ...
V |
|key_j-1_1|[*]record_j-1_1| ...
^ |key_j-1_m|[*|record_j-
m|
| ^
| | ...|[*]|...|
| | | ^
+-------------------+-------------------+ |
^ | end of B[j-1]
| |
+------------+<-----+-------------+
^ | |
| beginning of Bj | | |
| +----------------+ | |
| | +-------+ |
V | | |
|key1|[*]record1|key2|[*]record2| ... |keyi|[*]recordi|
^ ^
| |
| | ...|[*]|...|[*]|...|
| | | | ^
| +--------------------+ | |
+--------------------------------------------+ |
end of Bj
(LO2)

If we write a record t into buffer Bj and this record has been accessed from
several records of buffer B[j-1] then it makes sense to store only one image of
the fields of record t. Other occurrence of records t in Bj could refer to this
image. In this case we have the following data layout:

| beginning of B[j-1]
|
| ...
V |
|key_j-1_1|[*]record_j-1_1| ...
^ |key_j-1_m|[*|record_j-
m|
| ^
| | ...|[*]|...|
| | | ^
+-------------------+-------------------+ |
^ | end of B[j-1]
| |
+------------+<-----+-------------+
^ | |
| beginning of Bj | | |
| +----------------+ | |
| | +-------+ |
V | | |
|key1|[*]record1|key2|[*][*]| ... |keyi|[*]recordi|
^ ^ ^ |
| | | |
| +------------+---+ ...|[*]|...|[*]|...|
| | | | ^
| +--------------------+ | |
+--------------------------------------------+ |
end of Bj
(LO2^)
Note that with layout LO2^ we have a kind of network in buffers B1,...,B[n-1].

3.3. Handling of outer joins and semi-joins
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So far we assumed that all join operations in the execution plan are just
regular joins. What if some of them are outer joins or semi-joins?
When some of the join operations are outer joins and even when nested outer
joins operations are used we still can apply the technique with match flags for
nests of inner tables that we employ now in regular NLJ schema. One can get the
idea of this technique from [3].
The technique utilizes special bit flags for nest of inner tables in outer
joins. When using regular NLJ schema we can place this bits into join structure
that are interpreted at execution time. There cant be more that one flag for a
table when using the regular NLJ methods. With the Batched NLJ method we have
to store as many match bit flags as many records we have in the buffer used to
store rows of the table just preceding the nest of inner tables of an outer
join operation (the inner tables here never can be interleaved with other
tables).
With any layout above we can easily find an extra bitin a record
representation that can be used for the match flag.
As to semi-joins MRR functions themselves can take care of producing only one
first match when joining a table used as a right semi-join operand.

3.4. Mixing Batched NLJ with Blocked NLJ
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When performing an n-way join of tables T1,...,Tn we can mix joins executed by
the Batched NLJ schema with those executed by Blocked NLJ schema.
This can easily be seen from the fact that both algorithms are, in a way,
derivatives from a more general NLJ schema that uses buffers of records for
partial joins.
We could even easily incorporate Blocked NLJ method into the BKA procedure if
we introduced an implementation of the multi_range_read_next function from the
MRR interface that iterated through all matches for the records in the buffer
used for the left join operand.
Note.
Here we have to employ a modification of the current implementation of the
Blocked NLJ method that stores in buffers only fields from separate tables
T1,...,T[n-1] and never stores the records from partial joins as packed records.
Like the regular Blocked NLJ method the Batched NLJ allows us to materialize
records of any partial join in a buffer Yet, it makes sense to do without this
materialization in any case.


4. Cost considerations
----------------------

4.1 Number of records in a buffer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let |Bj| designate the size of the buffer Bj and lj stands for the average
length of a record in the buffer.
With data layout LO2 if the size of the keys is equal to kj then we can expect
on average the following number of records in buffer Bj:
|T(Bj)|=|Bj|/(lj+kj+(j>1?1:0)+size(void*))

To calculate the expected number of records in buffer Bj when a hash table is
used we have to know how many different keys are expected for N records.
Lets assume that keys in Bj are built from column a of table Tj. The whole
table Tj contains V(Tj,a) different keys. If the buffer Bj contains N records
then expected number of keys is equal to
V(Bj,a)= V(Tj,a)*(1-(1-1/V(Tj,a))^N)
Let oj be the average number of bytes used for each key
in the hash table.
Then we must have
|Bj|=(lj+size(void*))*N +
V(Tj,a)*(1-(1-1/V(Tj,a))^N)*(kj+oj)
(Here we make a random selection of keys for N records from V(Tj,a) different
values).
The root of the equation above yields the expected number of records in the
buffer.

4.2 The number of different keys in a buffer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Assuming that keys in buffer Bj are built from column a of table Tj the
expected number of the different keys in the buffer V(Bj,a) depends on the
expected number of records in the buffer |T(Bj)|:
V(Bj,a)= V(Tj,a)*(1-(1-1/V(Tj,a))^|T(Bj)|)

If keys in buffer Bj are built from column ai belonging to one of the previous
tables Ti (i < j), then the expected number of different keys in Bj can be
calculated as follows.
First calculate the expected number of records in Bi that fill the full buffer
Bj. This number designated as |T(Bi,Bj)| can be extracted from the following
equation
|T(Bi,Bj)|*fan(Ri,Rj)= |T(Bi,Bj)|* |Rj|/|Ri|=|T(Bj)|.
Thus |T(Bi,Bj)|= |T(Bj)|* |Ri|/|Rj|
Now we can calculate V(Bj,a) using |T(Bi,Bj)|:
V(Bj,a)= V(Ti,a)*(1-(1-1/V(Ti,a))^|T(Bi,Bj)|)=
V(Ti,a)*(1-(1-1/V(Ti,a))^(|T(Bj)|* |Ri|/|Rj|))

4.3 The cost of an n-way join using the BKA method
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We assume that the sizes of the buffers B1,...,B[n-1] used when performing a
join of tables T1,...,Tn by BKA method are |B1|,...,|Bn-1| correspondingly.
When we employ MRR-interface to produce records for buffer Bj we need an
additional buffer BA[j-1] of the size |BA[j-1]|.
The size of the additional buffer BA[j-1] is returned by a call of the function
multi_range_read_info to which we pass an expected number of different keys.
This function also returns the cost of filling buffer Bj in the following
structure:
typedef st_cost_vect
{
double io_count; /* number of I/O * */
double avg_io_cost; /* cost of an average I/O oper. */
double cpu_cost; /* cost of operations in CPU */
double mem_cost; /* cost of used memory */
double import_cost; /* cost of remote operations */
} COST_VECT
The total cost of producing records from the buffer B[j-1]
is calculated as:
C1j = f1 * io_count[j] * avg_io_cost[j] +
f2 * cpu_cost[j] +
f3 * mem_cost[j] +
f4 * import_cost[j]
where f1, f2, f3, f4 are system variables that allow to tune the cost of usage
of different resources.
The buffer B[j-1] is filled the number of times equal to
R[j-1]/|T(B[j-1]).
So the cost of accessing records of Tj is
Cj= C1j * R[j-1]/|T(B[j-1])
Here we have to add the CPU cost of filling the buffer B[j]
and applying pushdown predicates. This value is proportional to |Rj|. There is
additional CPU cost incurred by the actions to build an index for keys in the
buffer Bj. These actions are performed the number of times equal to R[j-1]/|T
(Bj)|.
We also have to add the cost of memory for the buffer Bj.

4.4 Optimal memory allocation for buffers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppose we have to produce an n-way join of tables T1,...Tn
using BKA methods and suppose we have memory of size Mj to be used both for
buffer Bj and buffer BAj.
We dont know how the needed size of the buffer BAj is calculated as its done
in multi_range_read_info. Yet we can assume that this size is proportional to
the number of different keys passed to this function.
|BAj|=c*V(Bj,a).
Let Lj be equal to |Bj|/|T(Bj)|
Thus we have:
V(Bj,a)= V(Tj,a)*(1-(1-1/V(Tj,a))^(|Bj|/Lj))
|BAj|= c * V(Tj,a)*(1-(1-1/V(Tj,a))^(|Bj|/Lj))
Mj-|Bj|= c * V(Tj,a)*(1-(1-1/V(Tj,a))^(|Bj|/Lj))
From the last equation we can find the size of Bj.

Given memory of size M lets partition it into parts of size M1,...,M[n-1] in
the proportion:
M1:M2:...:M[n-1]=(|R1|*L1):(|R2|*L2):...:(|R[n-1]*L[n-1]).
It can be proven that such a partition of memory for buffers is optimal.

Unfortunately the problem of searching for a join order with the minimal cost
becomes very complicated if we allow using buffers of variable sizes M1,...,M[n-
1] such that
M1+...+M[n-1]=M.

Therefore when looking for a good plan we use buffers of fixed predefined sizes
M1,...,M[n-1].
After the best plan is chosen we calculate optimal sizes
of the buffers.
With this strategy we cannot guarantee that we find the cheapest plan. Yet the
cost of the chosen plan will not differ too much from the minimal cost.

4.5. Search for an execution plan when using BKA
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Whenever we can apply an index-based access method we can use BKA as well. With
fixed sized buffers the cost of every partial join can be estimated and an
extension of the current partial plan with minimal cost can be chosen. It is
guaranteed that this strategy brings us to the cheapest plan for a given table
order and given buffer sizes. If we compare costs of all tables order and
choose one with the plan of a minimal cost then we get the cheapest execution
plan for the query.
Consider the following property (A).
Given two cheapest plans p,p for two join orders T1,...,Tn and T1,...,Tn
such that they coincide at the beginning: Ti=Ti for each i from [1..j] then
the starting segments of these two plans of length j, p[1..j] and p[1..j] are
guaranteed to be the same.
This property holds for simple NLJ plans and for plans that employ BKA method
assuming sizes of the used buffers are fixed.
As it was noted before the property does not hold if we allow to choose the
most optimal partition of given memory for the query execution.
With the property (A) the greedy algorithm preserves its validity. Therefore,
we still can use this search algorithm if we add the BKA method to the set of
possible access methods (as far we are satisfied with a fixed partition of
memory used for buffers).
The plan search algorithm suggested in [4] that takes into account the
possibility of joining tables by the hash join method can be employed without
any changes if we add BKA as well.

4.6. Open and partly open problems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We have to be able to get good estimates for |R1|...|Rn| to use formulas above.
This task is not easy by itself. We do provide ways of getting these estimates
in the current code but they may differ significantly from the real numbers.
More statistical data (e.g. histograms) could help here.
Yet we admit that having a good statistical data is equally important for any
access methods.
We would like also to attract an attention to the methods of getting values of V
(Tj,A). In general, attributes from A do not belong to the same table.
Therefore we cant hope to utilize any statistics to calculate such values.
Even in the cases when the attributes from A belong to the same table we are
not guaranteed to have statistics for it, as we may not have an index defined
for table T over these attributes. Statistics/histograms over non-index columns
would help here as well.
If we do not any statistical data for the value of V(Tj,a) where is a column
from table Tj, then we assume that V(Tj,a)=|Tj| (i.e. all values of a are
different). Yet this may bring us to yielding too pessimistic estimates.



References
----------
[1] Database Systems, The complete book,
Garcia-Molina, Ullman, Widom, 2002.
[2] WL#2474: Multi Range Read (MRR) interface.
[3] MySQl 5.0 Reference Manual: 7.2.10. Nested Join Optimization
[4] WL#2241: One-pass Hash Join.

You must be logged in to tag this worklog

No Comments yet

Votes

Not yet rated.
You must be logged in to vote.

Watches

0 members are watching this worklog
You must be logged in to track this worklog.

Provide Feedback

Please note:
HTML will be purified, but we allow for a number of HTML tags so that you have the flexibility to decorate your comment text to some extent. The comments allow the following HTML tags:

strong, b, em, blockquote, a, code, pre

To put code into your comment, simply encapsulate your code with
[code language="XXX"][/code], where XXX is any common language, for instance "PHP", "SQL", "C", etc.



You must be logged in to comment