WL#3740: Subquery optimization: Semijoin: Pull-out of inner tables

Affects: Server-6.0 — Status: Complete — Priority: Low

This is a subtask of WL#2980 "Subquery optimization: Semijoin".

It includes:
 - Conversion of subquery IN predicates to semi-join nests (i.e. TABLE_LISTs)
 - Elimination of nested semi-join nests.
 - A procedure that pulls out tables out of semi-join nests when this is 
   possible.
<contents>
1. Task scope
2. SJ applicability criteria
2.1 MAX_TABLES limitation
3. Subquery to semijoin conversion
3.1 Conversion steps and their location
3.2 The conversion procedure
3.1 Nested semi-join nests removal
3.2 Treatment of semi-join nests by other optimizations
4. The Pull-out procedure

</contents>

1. Task scope
=============
This WL entry covers:

- Conversion of subquery IN predicates to semi-join nests (i.e. TABLE_LISTs)
- Elimination of nested semi-join nests.
- A procedure that pulls out tables out of semi-join nests when this is
possible.

After this WL is implemented, the code will be in this state:

- Subqueries that can't be executed with SJ will be handled as before;
- Subqueries that can be executed by SJ and allow to eliminate all
semi-join nests (by pulling out all of the tables) will be handled
correctly.
- Subqueries that can be executed with SJ but do not allow for all tables
to be pulled out will not be executed, a query failure will be reported
(without any special error message).

That is, this WL will leave the subquery implementation in a state where some
subqueries are not handled. The next task, WL#3741, will restore server's
ability to handle all kinds of subqueries.


2. SJ applicability criteria
============================
(WL#2980 has a copy of this section contents)

A subquery predicate is replaced with semi-join nest if the following
conditions are met:

SUBQ-TO-SJ-CRITERIA:
1. Subquery predicate is of the form
(...) IN (SELECT ...)
(...) =ANY (SELECT ...)

2. Subquery is a single SELECT (no UNIONs)

3. Subquery does not have GROUP BY or ORDER BY

4. Subquery does not use aggregate functions or HAVING

5. Subquery predicate is at the AND-top-level of ON/WHERE clause

6. #tables-in-parent-query + #tables-in-subquery < MAX_TABLES

Note that the subquery may have DISTINCT or LIMIT clauses.

A more precise wording for condition #5 is:

The WHERE/ON clause can be converted to "subquery_predicate AND remainder"

while the check will be performed as follows:

Subquery is at the AND-top-level of the clause if we can reach it by starting
from the root Item* and descending into the arguments of Item_cond_and()
functions.


2.1 MAX_TABLES limitation
~~~~~~~~~~~~~~~~~~~~~~~~~
The condition #6 is:

#tables-in-parent-query + #tables-in-subquery < MAX_TABLES

The reason for this limitation is that we must not end up in a situation
where the total number of tables in a join is >= MAX_TABLES. If there are
several candidate subqueries and the total number of tables is >= MAX_TABLES
we'll have to make a choice which of the subqueries to convert to semi-joins.

If the "children" subqueries have "grandchildren" subqueries we may also have
a choice between pulling grandchild into the child or child into the parent.

We are not able to make an educated choice due to lack of information, in
particular we don't know subquery's output cardinality. We'll settle for this
set of preferences, in the order of their importance:

Parents/children
1. Pull the most deeply nested subqueries first;

Siblings
2.1. Prefer correlated subqueries over uncorrelated;
2.2. Prefer subqueries that have greater number of outer tables;

The preferences place some restrictions on how the subqueries to semi-join
nests conversion will need to be organized.
- The transformation must proceed as a depth-first traversal.
- In order to make the choice we will first need to run JOIN::prepare() for
all "sibling" subqueries, then collect and compare all candidates, and then
perform the transformation.

3. Subquery to semijoin conversion
==================================

3.1 Conversion steps and their location
---------------------------------------

Currently the execution proceeds according to this scenario:

parent_join->prepare()
{
where->fix_fields()
{
Item_subselect->fix_fields()
{
engine->prepare()
{
child_join->prepare()
{
// Preparation part #1 (fix_fields(), etc)
[row|single_value]_select_transformer(child_join)
{
Inject predicates into the child select,
convert IN to EXISTS
}
[ select_lex->fix_prepare_information();
// GROUP BY and PROCEDURE fix-ups; ](**)
}
}
}
}
}

parent_join->optimize();

parent_join->exec()
{
...
subq_predicate->val_int()
{
child_join->optimize();
child_join->exec();
}
}


The two notable properties are
1. IN->EXISTS conversion is performed "early"
2. Subqery's optimization is performed lazily

In order to use the strategy described in section 2.1 we must return out of
child_join->prepare() without doing the IN->EXISTS conversion. Since
subqueries are processed recursively, this applies to parent_join->prepare()
as well.

The actions marked (**) seem to have a prerequisite that we either have
perfomed IN->EXISTS conversion or have decided not to do it at all, so they
will have to be delayed as well.

This gives us the following strategy:

parent_join->prepare()
{
where->fix_fields()
{
Item_subselect->fix_fields()
{
engine->prepare()
{
child_join->prepare()
{
// Preparation part #1 (fix_fields(), etc)
if (this is a subquery that meets SUBQ-TO-SJ-CRITERIA #1-#5)
{
mark it for further processing // see below
}
else
{
[row|single_value]_select_transformer(child_join);
select_lex->fix_prepare_information();
// GROUP BY and PROCEDURE fix-ups;
}
}
}
}
}
}

if (this a top-level-select)
{
fix_subqueries()
{
// 1. fix children
for each marked subquery S
S->fix_subqueries();

// 2. Pick which subqueries to convert, and convert as many as we can
// Materialization-check:
// 3. Not coded as part of this WL: Pick which subqueries are to be
// handled via Materialization.
// 4. Finalize join->prepare() for the subqueries that we haven't
// converted:
for every marked subquery that wasn't processed
{
select_lex->fix_prepare_information();
// GROUP BY and PROCEDURE fix-ups;
}
}
}
parent_join->optimize();
parent_join->exec();
...

Things to note:
- The above is not completely in line with WL#1110
"Subquery optimization: Materialization". Timour's intent was to have the
"Materialization-check" (see above) at the beginning of JOIN::optimize()
and we've moved it out and ahead (but didn't move it over anything
significant?)
- In the current code subqueries are optimized lazily (if the subquery is
never invoked it is never optimized). For subqueries executed by SJ this
property will be lost as such subqueries are optimized/executed together
with their parent select.
- Conversion to semi-join nests will occur recursively, in depth-first
fashion.

3.2 The conversion procedure
----------------------------

In order to turn subquery predicate into a semi-join nest we'll need to do
the following:

* In the parent's WHERE/ON clause, replace Item_subselect with Item_int(1).
An alternative approach is to remove the Item_subselect item, i.e:
- replace it with NULL if the said item is the complete WHERE/ON clause,
or
- remove it from the list of arguments of its parent Item_cond_and and
remove the Item_cond_and if it becomes "unary" AND.
We'll make a choice between the two approaches in LLD or at coding time.

* Walk through child's tables and adjust table->map

* Adjust the parent_join->tables counter

* Create a TABLE_LIST join nest that describes the semi-join (its exact
definition will be given in the LLD)

* Attach this TABLE_LIST as a child of parent select's global TABLE_LIST or
of the appropriate outer join nest.

* Move the subquery's WHERE into semi-join's ON condition.
- For subquery's WHERE/ON conditions, recalculate all attributes that
involve table bitmaps:
= call update_used_tables()
= do something to update not_null_tables()
= adjust other attributes as necessary (to be covered in the LLD)

* Insert the IN equalities into semi-join's ON condition, i.e for
(oe1 .. oeN) IN (SELECT ie1, ... ieN)
create
oe1=ie1 AND ... AND oeN=ieN
and insert it into the semi-join's ON condition.

Note: the question whether semijoin's ON condition should be in
TABLE_LIST::on_expr is to be resolved in the LLD.
- Pro: this is the "natural place"
- Contra: all over the code the (on_expr!=NULL) check is used to check if
this is an outer join.

* Do nothing for "Item_refs".
When a subquery is located in the WHERE clause, references "out of the
subquery" are Item_direct_ref objects which are just call redirectors, and
consequently can be simply ignored.
(The reason for their creation is that we must have
Item_direct_ref(
Item_field(outer_tbl.col)).used_tables() == OUTER_REF_TABLE_BIT
while
Item_field(outer_tbl.col).used_tables() = outer_tbl.map
).

* (TODO: if any of the execution strategies will need more information to be
saved we could cache it here. One candidate is the is_correlated flag)

The above list is not inclusive. Details to be covered in the LLD.

Once these actions are performed we can discard the child st_select_*/JOIN
and operate only on the structures that are in the parent join.


3.1 Nested semi-join nests removal
----------------------------------

We do not need to keep nested semi-join TABLE_LISTs to produce correct query
results. This comes from the fact that for some nested semi-join

ot_i SJ (it_k SJ iit_j ON sj_cond2) ON sj_cond1

the result is defined as

all row combinations of ot_i
such that exists a matching row combination it_k
such that exists a matching row combination iit_j

which is the same as

all row combinations of ot_i
such that exist a matching row combination {it_k, iit_j}.

Consequently, we can make simplify_joins() to remove the nested semi-joins.

NOTE: This flattening will destroy e.g. information about whether the
grandchild subselect was uncorrelated. We don't need it now but may
need it for some further optimizations.


3.2 Treatment of semi-join nests by other optimizations
-------------------------------------------------------

This WL code will cause semi-join nests to be seen by the following
optimizations:

Equality Propagation
optimize_cond() can use semi-join nest's ON condition together with parent
select's WHERE condition.

Partition Pruning
For an inner table, its semi-join nest's ON condition must be used.

Aggregate Functions optimization, opt_sum_query call and code around it.
It seems we don't need to do anything here.

Constant table detection
Like with outer joins, constant table detection should detect constant inner
tables. The pull-out procedure will pull them out of the semijoin.

Ref-analysis
update_ref_and_keys() should process semi-join's ON expression. It is ok to
create candidate ref accesses that mix inner and outer tables (e.g. for
outer_tbl.key =inner_tbl.key two KEYUSE elements should be created).


4. The Pull-out procedure
=========================

Consider a semi-join

(ot1 ... otN) SJ (it1 , ... itK) ON sj_cond

Suppose we execute it as inner join. We will violate INNER-UNIQ-REQ when we
construct two row combinations

(ot1.row1 ... otN.row1, it1.rowX, it2.row1, it3.row1 ...)
(ot1.row1 ... otN.row1, it1.rowY, it2.row1, it3.row1, ...)

that have the same rows for outer tables but different rows for the inner
table. We can guarantee that this won't happen if there is a functional
dependency between the inner table and the outer tables:

it1.row = func(ot1.row, ... otN.row)

We can detect and use those two cases of functional dependency:

* the inner table has at most one row
* the inner table can be accessed using eq_ref(consts,outer-tables).

Presense of functional dependency allows us to pull the table out of the
semi-join nest.

The pull-out can be performed once we've detected the constant tables and
ran update_ref_and_keys(). These two actions are performed inside
make_join_statistics():

make_join_statistics()
{
...
update_ref_and_keys();
...
do {
for each non-const table
if (can make table constant)
const_tables |= table;
} while(const_tables was widened);
...
// do range analysis
}

We can use analogous approach to pull tables out of the semi-join nests:

for (each semi-join nest)
{
// Action #1: Mark the constant tables to be pulled out
pulled_tables= {};
for (each table T in semi-join nest)
{
if (T has at most one row)
{
pulled_tables.insert(T);
}
}

// Action #2: Find which tables we can pull out based on
// update_ref_and_keys() data. Note that pulling one table out
// can allow us to pull out some other tables too.
do {
pulled_a_table= FALSE;
for each possible ref access to some inner table T
{
if (ref access is eq_ref &&
all used tables are outer tables)
{
pulled_a_table=TRUE;
pulled_tables.insert(T)
}
}
} while (pulled_a_table);

// Remove pulled_tables from the join nest;
if (we've removed all of the tables)
destroy the nest;
else
{
//we're not capable of executing semi-join nests yet
report query failure;
}
}

NOTE:
It is possible that the above may be accomplished by another strategy: create
a graph of functional dependencies, topologically sort the graph, and then
retrieve all const tables in one pass.
This is potentially more efficient/elegant but it is harder to do:
- "standard" topological sort requires the graph to be acyclic. Functional
dependency graph is not (and there seems to be no trivial way to remove
the cycles).
- The graph edges are actually "multi-edges", they have one source and
several destinations.
This seems to make building the graph too complicated and not worth it?

<contents>
1. SJ applicability check
1.1 Finding out if we at the AND-top-level of ON/WHERE
2. Conversion
3. Interplay with other optimizations
4. Nested semi-joins removal
5. Table pull-out
6. Testcases
7. Schedule
8. Relationship with Prepared Statements
8.1 subquery to semi-join nest conversion
8.2 Nested semi-joins flattening
8.3 Table pullout
8.3.1 Implementation of per-execution table pull-out


</contents>


1. SJ applicability check
=========================

SUBQ-TO-SJ-CRITERIA (copying from the HLS) are:

1. Subquery predicate is of the form (...) IN/=ANY (SELECT ...)
2. Subquery is a single SELECT (no UNIONs)
3. Subquery does not have GROUP BY or ORDER BY
4. Subquery does not use aggregate functions or HAVING
5. Subquery predicate is at the AND-top-level of ON/WHERE clause
6. #tables-in-parent-query + #tables-in-subquery < MAX_TABLES

These conditions will be checked as follows:
#1: check if the subquery is an Item_in_subselect
#2: test(master_unit()->children == 1)
#3: single 'if'
#4: single 'if'

1.1 Finding out if we at the AND-top-level of ON/WHERE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The condition #5 is:

Subquery predicate is at the AND-top-level of ON/WHERE clause:

Currently it is not possible to determine out of item->fix_fields() if we're
at the AND-top-level or not.

We can check this as follows:
* Add THD::marker= 0;

* In setup_conds(), for WHERE/ON conditions do:
thd->marker= MAGIC_NO;
cond->fix_fields();
thd->marker= 0;

* In Item_cond_and:
- Save thd->marker into local variable and restore after each
arg_i->fix_fields() call.

* In Item_cond_or::fix_fields() and Item_func::fix_fields():
thd->marker= 0;

* In Item_subselect::fix_fields():
if (thd->marker == MAGIC_NO)
{ we're at AND-top-level of the WHERE or ON }

This will also check condition #6.

We'll collect a list of Item_subselects that are to be converted to
semi-joins. The list will be made of
in class JOIN: Item_subselect** sj_subqs;
in Item_subselect: Item** next;

We'll store pointers-to-pointers as we will need to replace Item_subselect
with Item_int(1) later on.

Startegy for the satisfying #6 is described in the HLD.


2. Conversion
=============

The subquery predicate will be converted to Item_int(1).

When wrapping the tables into semi-join nest, we'll need to keep the following
invariants:

* The semi-join nest will have
sj_nest->nested_join = {NESTED_JOIN structure with the list of the
children};
sj_nest->join_list= { table list this semi-join is in. This is either
embedding->nested_join->list or
select_lex->top_level_list };
sj_nest->next_leaf= NULL ; // nest can't be leaf

* When we insert a semi-join nest into the subquery we'll need to make sure
that all leaves (i.e. base tables) are connected via TABLE_LIST::next_leaf
chain.
* ::next_local is same as ::next_leaf but views are considered leaves.
* ::next_global - it seems we don't need to update this.

NOTE Look at information_schema and see what's with the Gluh's in_subquery
flag.

Re semi-join condition we have those considerations:
- semi-join nests need to be distinguishable from outer join nests and
from inner join nests.
- it seems there is no merit in placing semi-join's ON condition into the
sj_nest->on_expr, as on_expr is used to identify outer joins.

- Igor's idea is to add semi-joins' ON condition into the WHERE (and use some
flag in TABLE_LIST to tell semi-join nests from inner join nests).
This makes dealing with other optimizations simpler. On the other hand we
might need semi-join's ON condition for some forthcoming optimizations.


3. Interplay with other optimizations
=====================================

Make those optimizations treat semi-join nest's ON conditions as if they were
AND-parts of the WHERE:
* Equality Propagation
* Ref-analysis
* Constant table detection

Verify that those optimizations "properly ignore" semi-join nests:
* Aggregate Functions optimization


4. Nested semi-joins removal
============================

We can first do all the outer->inner conversions (and we will not miss
anything due to semi-join nest bounds), and then, at "Flatten nested joins"
phase we can remove the nested semi-joins.


5. Table pull-out
=================
See HLS, nothing to add here.


6. Testcases
============
Create a set of testcases that demonstrate that subqueries that are fully
pulled-out are executed correctly.
Query EXPLAINSs should show that the tables are moved into the upper SELECT.

Create a testcase with a subquery that can't be fully pulled out into the
parent and show that it produces an error.


7. Schedule
===========

2007-03-05 StartOfDay:
Implement the SJ applicability check 8 hrs
Implement the subquery to semi-join nest conversion
procedure. 12 hrs

2007-03-09 EndOfDay:
MS1: subquery predicates are converted to [possibly nested] semi-join
nests.

Implement flattening of nested semi-joins. 12 hrs
Adjust all "other optimizations" (see LLD text) to
handle the semi-join nests. 12 hrs

2007-03-19 EndOfDay:
MS2: semi-join nests are flattened, i.e. everything except the pull-out
prodcedure is done.

Implement PullOut procedure 4 hrs
Create a set of testcases (actually some of testcases 12 hrs
will probably be created for earlier milestones but
the delivery is in this milestone)

2007-03-23 EndOfDay
MS3: WL entry complete, the code is in the state as described in "Task scope"
section of the HLD.

Address some unforeseen problems that will pop up 16 hrs

2007-03-29 EndOfDay
MS4: WL entry complete


8. Relationship with Prepared Statements
========================================

8.1 subquery to semi-join nest conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(See WL#3813 for a general contract between PS runtime and the optimizer)

The conditions listed in SUBQ-TO-SJ-CRITERIA remain true (or false) for the
entire lifetime of the statement, so subquery predicate to semi-join nest
conversion should be done once per statement execution.

To achieve that, we'll need
- Block search/conversion attempts on non-1st execution
- Make sure the TABLE_LIST allocations and Item tree changes are performed on
statement's arena.

8.2 Nested semi-joins flattening
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To be done on per-statement arena.


8.3 Table pullout
~~~~~~~~~~~~~~~~~

1. According to the WL#3813 rule, we can pull out a table permanently if
it is functionally dependent on outer tables. This is because we can count
on corresponding UNIQUE/PK index not to be dropped.

2. We can't permanently pull out a 1-row constant table, because it may have
more than one row on a subsequent execution.

There are also cases when a table can be pulled with rule #1 once a table with
rule #2 was pulled out. It looks like there is little merit in having distinct
"permanent" and "per-execution" pull-out phases. A simpler solution of one
single phase that is invoked for every execution seems to be more appropriate.

8.3.1 Implementation of per-execution table pull-out
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Don't really pull the tables out, leave the SJ-nest as is.

// Join flattening:
for each sj-nest
{
Reset the bitmap of pulled-out tables;
Accumulate a bitmap of tables that are pulled out
if (!all tables are pulled out)
{
abort query with an error;
}
}


// Join optimizer, plan refinement:
Ignore the SJ-nests when processing outer join nests


EOF

You must be logged in to tag this worklog

No Comments yet

Votes

Not yet rated.
You must be logged in to vote.

Watches

0 members are watching this worklog
You must be logged in to track this worklog.

Provide Feedback

Please note:
HTML will be purified, but we allow for a number of HTML tags so that you have the flexibility to decorate your comment text to some extent. The comments allow the following HTML tags:

strong, b, em, blockquote, a, code, pre

To put code into your comment, simply encapsulate your code with
[code language="XXX"][/code], where XXX is any common language, for instance "PHP", "SQL", "C", etc.



You must be logged in to comment