SE API QFPD
[edit] Storage Engine API Extensions For Query Fragment Pushdown
This document summarizes the ideas and plans for MySQL storage engine API extensions primarily focused on supporting query fragment pushdown.
[edit] Version History
Date Author Comments
04-Sep-2009 Thava Initial version after discussions with Mikael Ronstrom,
Timour and Storage Engine API Team.
21-Jan-2009 Thava serverapi.h updates
05-Feb-2009 Thava serverapi.h updates
13-apr-2009 Thava Renamed the updated api file to aqtdict.h
[edit] Background
In many instances, it is much more efficient to push parts of query conditions to storage engines rather than doing it in server. Such conditions could be either trivial scan filters (e.g: "a = 30") or more complex (e.g. join conditions on different tables). Pushing down the constant scan filter was added recently for cluster storage engine. However, pushing down more complex conditions is required and not yet supported. This requirement is especially critical for cluster storage engine. Without this, all entire tables involved in the query are fetched to the server from different nodes before applying "joins", for example. This could be eliminated by pushing such conditions to storage engines to limit the amount of data shipped across nodes.
Current storage engine API exposes too many internal details, hence too fragile if anything changes internally. We need API to expose the data structure through abstract and stable interfaces. The new API should be very well documented so that it would be easier to understand and use.
[edit] Goals
The primary goals are mentioned below :
- Enable query fragment pushdown(QFPD) to storage engines. The immediate next step should support pushing down join conditions on tables belonging to same storage engine. More complex conditions would follow later. However the design would be easily extensible in future. The API Extensions would make use of easy-to-use and clearly documented interfaces. New AQT (Abstract Query Tree) data structure will be created that will be used as parameters for new API extensions.
- Extend the AQT interfaces that could be potentially be used by both server and/or storage engines. Such use is not only limited to QFPD. The benefit would be-- easier to use APIs across different server modules and storage engines. (This is long term goal).
[edit] Abstract Query Tree
The Abstract Query Tree is the core theme which enables easy-to-use and stable interface between server and storage engine. As a first step, this will be used to pass query fragments to storage engine for pushdown. In longer term, this is envisioned to be extensible and usable within server itself whenever appropriate.
[edit] Worklogs Reference
- See worklogs:
- [http://forge.mysql.com/worklog/task.php?id=4292 WL#4292 Pushdown of query fragments to storage engines]
- [http://forge.mysql.com/worklog/task.php?id=4533 WL#4533 QFPD - Abstract Query tree (AQT) ]
- [http://forge.mysql.com/worklog/task.php?id=4535 WL#4535 QFPD: Storage Engine Pushdown API (SEPD API)]
- [http://forge.mysql.com/worklog/task.php?id=4292 WL#4292 Pushdown of query fragments to storage engines]
- See Also:
- [http://forge.mysql.com/worklog/task.php?id=4536 WL#4536 QFPD: Query decomposition and optimization for query fragments]
- [http://forge.mysql.com/worklog/task.php?id=4537 WL#4537 QFPD: Virtual Tables]
- [http://forge.mysql.com/worklog/task.php?id=3288 WL#3288 Foreign Keys: Storage engine API for foreign key support]
[edit] Main Tasks
The main tasks involved are :
- Clearly document the existing storage engine API: Both in a separate document and in code.
- Define AQT data structure for expressions to pushdown
- Define new storage engine API extensions which make use of AQT for pushdown of join conditions
- Clearly document new API extensions
[edit] WL#4535 scratchpad
[edit] Example applications of AQT designs to imaginary and real storage engines
[edit] Engine capabilities
- Engine 1:
- support only 2-way joins
- all conditions are atomic
- no aggregation, group by, order by
- Engine 2:
- support N-way joins
- all conditions are atomic
- no aggregation, group by, order by
[edit] Application of AQT designs to the engines above
- Multiple iterators approach (by Thava)
- Engine 1:
my_se::pushdown_analysys(aqt *qf)
{
aqt_cond cur_join_cond = qf->get_join_cond();
if (cur_join_cond is not an atomic condition)
qf->mark_non_pd()
return;
aqt_abstract_table *cur_join_operand;
cur_join_operand = qf->get_left();
if (cur_join_operand->type() != AQT_BASE_TABLE)
qf->mark_non_pd()
return;
cur_join_operand = qf->get_right();
if (cur_join_operand is not a base table)
qf->mark_non_pd()
return;
}
- Engine 2:
my_se::pushdown_analysys(aqt *qf)
{
aqt_cond cur_join_cond = qf->get_join_cond();
if (cur_join_cond is not an atomic condition)
qf->mark_non_pd()
return;
.................
}
- Single abstract iterator (Timour)
TODO
[edit] How to expose Table as abstract Table (Serg's Draft)
struct TABLE
{
TABLE_SHARE *s;
TABLE *share_next, **share_prev;
};
class Public_Table : private TABLE {
char *get_name() { return s->table_name.str; }
Public_Table *next_in_list() { return (Public_Table *)next; }
bool GIS_Is_Supported () { return s->db_type->flags & HA_CAN_GEOMETRY; }
Public_Table *next_InnoDB_table() {
TABLE *n;
for (n=next; n; n=n->next)
if (n->s->db_type == innobase_hton_ptr)
return n;
return NULL;
}
void mark_as_pushdownable();
}
//*+++++++****************************************+
...
TABLE *table;
...
handler->AQT_push_down((Public_Table*)table);
Item_func()....
Public_Item_func() {
next_child(iterator_state *state) {
if (state->current == 0)
return arg0;
return arguments[state->current++ - 1];
}
}
[edit] List of Objects and functionalities that need abstraction
1) Table Instance. Currently: TABLE 2) Table Definition: Currently: TABLE_SHARE 3) Column Definition/Instance. Currently: Field 4) From clause TableJoin Specification. Currently: TABLE_LIST 5) Index Information 6) Key Information 7) Table JOIN instance : Currently JOIN 8) Query Execution Plan: Currently list of JOIN_TAB's 9) Expression: Currently: Item (it is fairly OK already)
[edit] Class Hierarchies of important classes
Item Class Hierarchy :
http://forge.mysql.com/w/images/1/16/ClassItem_inherit_graph.png
[edit] NDB Current pushdown implementation
We describe NDB implementation since this will be a good example
input for AQT implementation.
Current NDB pushdown is supported only for single table used as column
filters. There are certain restrictions on the type of filters.
Examples of supported pushdown where conditions:
a > 2 and b > 3; // i.e. select * from t where a>2 and b>3;
a > 2 and b > 3 and c > 4; // composite logical and/or OK
a > 2 or b > 3;
a > 2+3 and b < mod(10, 8); // constant functions are OK
str like '%hello%' ;
mydate > '2008-02-25' ; // date can be compared with const date string
a in (2, 3, 4); // rule is rewritten using OR
a in (2, 3+5, 10); // constant expressions are OK
a between 5 and 10; // rule rewritten using AND
Column Types Supported : string,real,decimal,int, varbin,
date, time, datetime, year
Other column types such as blob types are not supported.
Field Result Types Supported : String, real, int, decimal
The handler infrastructure for this push down condition is provided by handler
class COND *cond_push(const COND *cond) function which is implemented by NDB
engine. The COND structure is same as Item structure.
The handler function ha_ndbcluster::cond_push(COND *cond) which implements the
current pushdown method is fairly complicated (850+ lines of code) though the
problem specification may sound simple. Here is a simplified *pseudo* code of
current NDB implementation (omitting much of lower level details) :
// Return NULL on successful push else return rejected condition
COND * ha_ndbcluster::cond_push(COND *cond)
{
context = Initialize a dynamic context object which specifies
things like
(1) what Item Types are expected e.g. COND_ITEM, FUNC_ITEM
(2) what result types are expected e.g. INT_RESULT,REAL_RESULT
etc.
item->traverse_cond(&ndb_serialize_cond,
(void *) &context, Item::PREFIX);
if (context.supported) return NULL
else return cond;
}
The Item class (and all it's subclasses) provides a traverse_cond() virtual
function to traverse it's tree. For example, for Item_ident, this function will
be a trivial function to visit it's only node. For Item_func, it will invoke
the given function on it's operands in the specified order.
The ndb_serialize_cond() function is used to traverse the condition and
generate filter condition for NDB in it's own list format. We will focus on the
logic of traversing and evaluating of the pushdown condition is supported or not.
Pseudo code :
Step 0:
At the top level, boolean expression is expected.
context := { Expected Item types = FUNC_ITEM, COND_ITEM }
Note: FUNC_ITEM used for <, >, >=, <=, =, NOT, IN, BETWEEN
COND_ITEM used for AND, OR
Internally Item_cond is a subclass of Item_func.
However Item_cond is used only for logical AND/OR operations.
Step 1: Call ndb_serialize_cond() recursively on PREFIX order on item tree:
item->traverse_cond(&ndb_serialize_cond,
(void *) &context, Item::PREFIX);
ndb_serialize_cond(Item *item, Context *ctxt)
{
if ( item type or item result type is not one of currently expected
types as per context)
mark context as "unsupported" and return;
switch item type {
case COND_ITEM : Make sure it is AND/OR nothing else.
Push this AND/OR condition into NDB condition list.
ctxt holds same item types to expect next:
{Expect: FUNC_ITEM, COND_ITEM type nodes}
break;
case FUNC_ITEM : /* It is one of <, >, =, IN, BETWEEN, NOT etc */
[If the operator is one of IN, BETWEEN,
equivalent AND/OR conditions are pushed into NDB list.
The context holds enough information to remember
the states across calls. Further details are omitted
in the interest of brevity regarding IN, BETWEEN ops.]
Push this condition node into NDB list.
Set Context to expect the type of operands next:
{ItemType:string,real,decimal,int,field,varbin;}
{FieldResultType:- String, real, int, decimal}
break;
case FIELD_ITEM : assert field->type() is one of supported types.
Note: blob types are not supported here.
Push this node into NDB list including column num info.
Mark Context to expect:
{No more Field Items} // atmost 1 field per comp
{Expect compatible item types}
e.g. If field->type() is MYSQL_TYPE_LONG,
then expect INT_ITEM
case INT_ITEM : If (no field item has been processed before)
{ set context to expect: {Only Field Item next}}
else { /* both field and INT item processed now
Reset context to accept new conditions.
e.g. a < 5 and b > 3 */
Context := {Expect FUNC_ITEM or COND_ITEM types}
}
push this item into NDB serialized list.
case Item::DECIMAL_ITEM: /* Similar */
case Item::VARBIN_ITEM: /* Similar */
case Item::REAL_ITEM: /* Similar */
case Item::STRING_ITEM: /* Similar */
}
/** @file aqt_dict.h
@mainpage Abstract API Declarations for database objects.
AQT dictionary API provides abstract dictionary interface for database
objects such as Table, Column, Index, Database, etc.
Last updated: 2009 Apr 07
*/
/* Import standard portable declarations like int32 etc from my_global.h */
#include "my_global.h"
namespace aqtdict {
/* Imported class declarations */
class String;
template<typename E> class RList; /* Read only list */
class A_query_expr;
class A_object {
/* place holder here. definition imported later. */
};
/* List of classes exported */
class A_object;
class A_database;
class A_table;
class A_base_table;
class A_view_table;
class A_column;
class A_index;
class A_fkey;
class A_trigger;
class A_table_constraint;
class A_ukey;
class A_pkey;
class A_index_member;
/* Imported function declarations */
extern A_table *aqtdict_get_table(String *dbname, String *tablename);
extern A_base_table *aqtdict_get_base_table(String *dbname,
String *tablename);
extern A_column *aqtdict_get_column(String *dbname, String *tablename,
String *colname);
extern A_database *aqtdict_get_database(String *dbname);
extern A_view_table *aqtdict_get_view(String *dbname, String *viewname);
extern A_trigger *aqtdict_get_trigger(String *dbname, String *tablename,
String *triggername);
/* List of class definitions */
/**
@class A_database
@brief Abstract database class
*/
class A_database : public A_object {
virtual int32 get_database_name(); /*< Get database name */
virtual RList<String> *get_table_names(); /*< Get the list of tables*/
virtual RList<String> *get_view_names(); /*< Get the list of views */
};
/**
@class A_table
@brief Abstract Table base class
*/
class A_table : public A_object {
public :
virtual A_database *get_database(); /*< Get database */
virtual int32 get_table_name(String *str); /*< Get table name */
virtual int32 get_alias_name(String *str); /*< Get alias name */
virtual RList<A_column> *get_columns(); /*< Get column list */
enum A_table_type {
A_BASE_TABLE=1, /*< Base Table */
A_DERIVED_TABLE, /*< Derived Table */
A_VIEW_TABLE /*< View Table or View */
} ;
virtual A_table_type get_table_type(); /*< Get table type */
virtual int32 get_engine_name(String *str); /*< Get storage engine name*/
virtual int32 get_charset_name(String *str);/*< Get charset name */
virtual int32 get_collation_name(String *str);/*< Get collation name */
enum A_row_format {
A_ROW_FORMAT_DEFAULT, A_ROW_FORMAT_DYNAMIC, A_ROW_FORMAT_FIXED,
A_ROW_FORMAT_COMPRESSED, A_ROW_FORMAT_REDUNDANT,
A_ROW_FORMAT_COMPACT };
virtual A_row_format get_row_format(); /*< Get row format. */
};
/* This enum was copied from mysql_com.h */
enum base_field_type { MYSQL_TYPE_DECIMAL, MYSQL_TYPE_TINY,
MYSQL_TYPE_SHORT, MYSQL_TYPE_LONG,
MYSQL_TYPE_FLOAT, MYSQL_TYPE_DOUBLE,
MYSQL_TYPE_NULL, MYSQL_TYPE_TIMESTAMP,
MYSQL_TYPE_LONGLONG,MYSQL_TYPE_INT24,
MYSQL_TYPE_DATE, MYSQL_TYPE_TIME,
MYSQL_TYPE_DATETIME, MYSQL_TYPE_YEAR,
MYSQL_TYPE_NEWDATE, MYSQL_TYPE_VARCHAR,
MYSQL_TYPE_BIT,
MYSQL_TYPE_NEWDECIMAL=246,
MYSQL_TYPE_ENUM=247,
MYSQL_TYPE_SET=248,
MYSQL_TYPE_TINY_BLOB=249,
MYSQL_TYPE_MEDIUM_BLOB=250,
MYSQL_TYPE_LONG_BLOB=251,
MYSQL_TYPE_BLOB=252,
MYSQL_TYPE_VAR_STRING=253,
MYSQL_TYPE_STRING=254,
MYSQL_TYPE_GEOMETRY=255,
MAX_NO_FIELD_TYPES /* Should always be last */
};
class A_datatype : public A_object {
virtual base_field_type get_base_type();
/* Todo: Include type specific things here including :
- length for bit, INT types, char types
- (length, decimals) for float, decimal, "numeric" types
- charset, collation info for char, text, enum, set types
- value set for enum types
- value set for set type
*/
};
/**
@class A_column
@brief Abstract column base class.
This represents the column of an active table in database.
*/
class A_column : public A_object {
public:
virtual int get_column_name(String *str); /*< Get column name */
virtual A_table *get_table(); /*< Get associated table */
virtual bool is_nullable(); /*< Is column nullable? */
/* Get default value for this column in String form */
virtual int32 get_default_value_str(String *def);
/* Get column number. First column starts from 1. */
virtual int32 get_column_number();
virtual bool is_unique(); /*< Is unique column? */
virtual bool is_pkey(); /*< Is primarykey column? */
virtual bool is_sequence(); /*< Is auto increment col?*/
virtual A_datatype *get_datatype(); /*< Get SQL data type. */
/* Get Charset name. Applies to only char based types */
virtual int32 get_charset_name(String *str);
/* Get collation name. Applies to only char based types */
virtual int32 get_collation_name(String *str);
};
class A_base_table : public A_table {
public :
virtual bool is_temp_table(); /*< Is this temp table?*/
virtual RList<A_index> *get_indexes(); /*< Get index list */
virtual RList<A_trigger> *get_triggers(); /*< Get triggers list */
virtual int32 get_table_comment(String *str);/*< Get table comment. */
virtual int32 get_tablespace_name(String *str); /*< Get tablespace name*/
/*
Get all kinds of constraints for this table:
This includes: PrimaryKey, ForeignKey, Unique constraints.
*/
virtual RList<A_table_constraint> *get_constraints();
virtual A_pkey *get_primary_key(); /*< Get Primary Key. */
virtual RList<A_fkey> *get_foreign_keys();/*< Get Foreign keys */
/* Get the list of unique constraints. Includes pkey if exists */
virtual RList<A_ukey> *get_unique_constraints();
/* Get auto-increment starting value. */
virtual uint64 get_autoincrement();
/*
Get keyblock size in bytes. It is hint to storage engine.
Value 0 indicates use of default.
*/
virtual int32 get_key_block_size();
/* Todo: Get partition info. */
};
class A_view_table : public A_table {
public:
virtual A_query_expr *get_query_expr(); /*< Get query expr */
virtual int32 get_sql(String *str); /*< Get associated SQL */
};
/*
@class A_index
@brief Abstract class for Index.
Note: An index may internally be created due to:
1) "primary key" constraint specification
2) "unique" constraint specification
3) explicit index specification
Non-unique indexes are always created as part of explicit specification.
Explicit specification of index always creates new index and mysql
does not attempt to "share" internally generated indexes as part of
unique/pkey specification.
The index name and constraint name are same when the index is created
for the constraint (either primary key or unique key).
Currently foreign key specification does not auto generate index, hence
an index is not associated with any foreign key in referencing table.
*/
class A_index : public A_object {
public:
virtual int32 get_index_name(String *str);/*< Get index name */
virtual A_table *get_table(); /*< Get associated table */
virtual RList<A_index_member> *get_members();/*< Get index members */
virtual bool is_unique(); /*< Is unique index? */
virtual bool is_clustered(); /*< Is clustered index? */
virtual bool is_pkey_index(); /*< Is primary key index?*/
virtual bool is_system_generated(); /*< System generated index?*/
/*
This returns associated source constraint for this index.
Note: Single index can atmost be associated with single constraint:
This could be primary or unique constraint.
Non-unique index is not associated with any constraints
and will return NULL.
*/
virtual A_table_constraint *get_index_constraint();
/* Enum type for index types. */
enum enum_index_type { A_INDEX_TYPE_PRIMARY, A_INDEX_TYPE_UNIQUE,
A_INDEX_TYPE_MULTIPLE, A_INDEX_TYPE_FULLTEXT,
A_INDEX_TYPE_SPATIAL,
A_INDEX_TYPE_FOREIGN_KEY};
virtual enum_index_type get_index_type(); /*< Get index type. */
enum enum_index_algo { A_INDEX_ALGO_UNDEF=0,
A_INDEX_ALGO_BTREE,
A_INDEX_ALGO_RTREE,
A_INDEX_ALGO_HASH,
A_INDEX_ALGO_FULLTEXT };
virtual enum_index_algo get_index_algo(); /*< Get Index Algorithm*/
/*
Get keyblock size in bytes. It is hint to storage engine.
Value 0 indicates use of default.
*/
virtual int32 get_key_block_size();
};
class A_index_member : public A_object {
public :
virtual A_column *get_column(); /* Get associated column */
/*
Index may be created on leading part of column values.
Return value of 0 indicates entire column is used for index.
*/
virtual int32 get_index_member_partial_length();
/*
Index member can be ascending or descending.
Currently it is always used as ascending in mysql.
*/
virtual bool is_ascending();
};
/*
@class A_table_constraint
@brief Abstract table constraint clause.
Note: Check SQL Standard 5WD-02-Foundation-2003-09 4.17
Integrity Constraints for more description about the constraints.
Constraint := Table Constraint | Domain Constraint | Assertion
Table Constraint := Involves one or more columns in a table
Domain constraint := Restricted datatype e.g integer value > 3
can be used against any column
Assertion := Involves multiple tables
MySQL currently supports table constraints only.
MySQL assumes all constraints are non-deferrable and checked immediate.
[ SQL standard supports the notion of constraint being either deferrable
or not. If deferrable, the current constraint mode can be
"deferred" or "immediate". ]
TableConstraint :- unique | referential(foreign_key) | check
unique-constraint(column-list) :- defined by: "unique" | "primary key"
referential-constraint(referencing-columns) :-
specifies referenced-table(referenced-columns);
The <referenced-columns> should be of type "unique".
The match type (full,partial,simple) defines semantics of match when
some columns can be null.
The referential triggered action specifies what to do :
i.e. ON delete|update restrict|cascade|set-null|no-action
referenced table may be same as referencing table.
check-constraint : constraint applied per row on a given table.
currently parsed and ignored by MySQL.
unique key and primary key must be associated with an index.
Foreign key constraint does not require an associated index in
referencing table.
*/
class A_table_constraint : public A_object {
public:
/* Get associated base table. */
virtual A_base_table *get_base_table();
virtual int get_constraint_name(String *name); /*<Get constraint name */
/*
Currently all MySQL constraints are not deferrable.
so this always returns false for now.
*/
virtual bool is_deferrable();
/*
For mysql, all constraints are not deferrable and
always checked immediate. So this always returns false for now.
*/
virtual bool is_initially_deferred();
enum A_constraint_type {
UNIQUE_KEY_CONSTRAINT,
PRIMARY_KEY_CONSTRAINT,
FOREIGN_KEY_CONSTRAINT
};
/* Get constraint type : unique | primary_key | foreign_key */
virtual A_constraint_type get_constraint_type();
};
/*
@class A_fkey
@brief Abstract foreign key referential constraint
*/
class A_fkey : public A_table_constraint {
public:
/* Get associated referencing table. */
virtual A_base_table *get_referencing_table();
/* Get associated referenced table. */
virtual A_base_table *get_referenced_table();
/* Get referencing columns. */
virtual RList<A_column> *get_referencing_columns();
/* Get referenced columns. */
virtual RList<A_column> *get_referenced_columns();
/*
@enum A_fkey_match_type
@brief Match type for foreign key referential constraint.
Match type provides semantics of match when some
referencing/referenced columns could be null.
*/
enum A_fkey_match_type {
/*
Either all referencing columns are null or non-nulls. Every
non-null referencing column must match with a referenced column.
*/
FKEY_MATCH_FULL=1,
/* Non-null values of referencing cols must match with referenced.*/
FKEY_MATCH_PARTIAL,
/* If atleast one referencing column is NULL, no constraint imposed.
Otherwise all non-null columns must match.
*/
FKEY_MATCH_SIMPLE
};
virtual A_fkey_match_type get_match_type();
/*
@enum A_fkey_action_type
@brief Referential action types for implementing foreign keys.
The action specifies what to do after delete|update of the
referenced row. This could be one of:
restrict|cascade|set-null|no-action|set-default
The no-action is the default action type.
*/
enum A_fkey_action_type {
FKEY_NO_ACTION = 0, /* No action. */
FKEY_RESTRICT = 1, /* Reject update/delete */
FKEY_CASCADE = 2, /* Cascade update/delete */
FKEY_SET_NULL = 3, /* Set referencing columns as null*/
FKEY_SET_DEFAULT = 4 /* Set referencing cols to def value*/
};
/* Returns action to take when a row in referenced table is deleted. */
virtual A_fkey_action_type get_on_delete_action();
/* Returns action to take when a row in referenced table is updated. */
virtual A_fkey_action_type get_on_update_action();
};
/*
@class A_unique_constraint
@brief Abstract unique key constraint
*/
class A_ukey : public A_table_constraint {
public :
virtual A_index *get_unique_index();
};
/*
@class A_pkey
@brief Abstract primary key constraint
*/
class A_pkey : public A_ukey {
};
} // End namespace aqtdict