MySQL Storage Handler Interface
← Back to MySQL University main pagePresenter: Brian Aker Time: Thursday 15. March 2007, at 6am PST = 9am EST = 15 CET = 16 EET Estimated length of Session: presenter to fill in here - note max 3 hours Moderator: Patrik Backman Scribe: MC Brown Scribe (for the Heidelberg Developer's Conference): Stefan Attendees: To register for this Session - Please enter your name here: Alexander Nozdrin Andrey Hristov HakanKuecuekyilmaz Marc Alff Martin MC Brown Sergei Golubchik Ingo Strüwing Hartmut Holzgraefe Giuseppe Maxia Dmitri Lenev Eric Herman (will have to leave early) Axel Schwenke Lars Thalmann Chris Powers Omer BarNir IgnacioGalarza User:TimSmith ArjenLentz User:GuilhemBichot Oleksandr Byelkin User:DanielFischer User:UlfWendel User:JanKneschke GeorgiKodinov
[edit] MySQL_Storage_Handler_Interface
http://krow.net/talks/New%20Storage%20Engine%20Talk.pdf
[edit] First Let us Talk about a st_plugin
mysql_declare_plugin(memcache)
{
MYSQL_STORAGE_ENGINE_PLUGIN,
&memcache_storage_engine,
"MEMCACHE",
"Brian Aker, Tangent Org",
"Simple Interface for working with memcache as a storage engine",
PLUGIN_LICENSE_GPL,
memcache_init_func, /* Plugin Init */
memcache_done_func, /* Plugin Deinit */
0x0005,
NULL, /* status variables */
NULL, /* system variables */
NULL /* config options */
},
[edit] Simple Handlerton
static int example_init_func(void *p)
{
DBUG_ENTER("example_init_func");
if (!example_init)
{
example_hton= (handlerton *)p;
example_init= 1;
VOID(pthread_mutex_init(&example_mutex,MY_MUTEX_INIT_FAST));
(void) hash_init(&example_open_tables,system_charset_info,32,0,0,
(hash_get_key) example_get_key,0,0);
example_hton->state= SHOW_OPTION_YES;
example_hton->db_type= DB_TYPE_EXAMPLE_DB;
example_hton->create= example_create_handler;
example_hton->flags= HTON_CAN_RECREATE;
}
DBUG_RETURN(0);
}
[edit] A Little More Complex
federated_hton->db_type= DB_TYPE_FEDERATED_DB; federated_hton->commit= federated_commit; federated_hton->rollback= federated_rollback; federated_hton->create= federated_create_handler; federated_hton->panic= federated_db_end; federated_hton->flags= HTON_ALTER_NOT_SUPPORTED;
(If you want more complex, look at ha_ndbcluster.cc)
[edit] The handler
(think of it as a cursor)
- Implemented in handler.h
- Has both table level operations and cursor operations.
- There is more to it then can be covered in three hours (not surprising)
[edit] Table Object Methods
- base_ext()
- create()
- delete_table()
- rename_table()
- optimize(), repair(), etc...
[edit] bas_ext()
static const char *ha_myisam_exts[[[]] = {
".MYI",
".MYD",
NullS
};
const char **ha_myisam::bas_ext() const
{
return ha_myisam_exts;
}
[edit] create()
int ha_federated::create(const char *name, TABLE *table_arg,
HA_CREATE_INFO *create_info)
{
int retval;
FEDERATED_SHARE tmp_share; // Only a temporary share, to test the url
DBUG_ENTER("ha_federated::create");
if (!(retval= parse_url(&tmp_share, table_arg, 1)))
retval= check_foreign_data_source(&tmp_share, 1);
my_free((gptr) tmp_share.scheme, MYF(MY_ALLOW_ZERO_PTR));
DBUG_RETURN(retval);
}
[edit] delete_table()
(not everyone has one)
int ha_myisam::delete_table(const char *name)
{
return mi_delete_table(name);
}
[edit] rename_table()
void ha_heap::drop_table(const char *name)
{
heap_drop_table(file);
close();
}
[edit] Insert!
T@3 : | | | | | | | | | | >ha_archive::open T@3 : | | | | | | | | | | | >ha_archive::get_share T@3 : | | | | | | | | | | | <ha_archive::get_share T@3 : | | | | | | | | | | | >ha_archive::create_record_buffer T@3 : | | | | | | | | | | | <ha_archive::create_record_buffer T@3 : | | | | | | | | | | <ha_archive::open T@3 : | | | | >ha_archive::start_bulk_insert T@3 : | | | | <ha_archive::start_bulk_insert T@3 : | | | | | >ha_archive::write_row T@3 : | | | | | | | >ha_archive::pack_row T@3 : | | | | | | | <ha_archive::pack_row T@3 : | | | | | <ha_archive::write_row T@3 : | | | | >ha_archive::end_bulk_insert T@3 : | | | | <ha_archive::end_bulk_insert
[edit] write_row()
int ha_tina::write_row(byte * buf)
{
int size;
DBUG_ENTER("ha_tina::write_row");
if (share->crashed)
DBUG_RETURN(HA_ERR_CRASHED_ON_USAGE);
ha_statistic_increment(&SSV::ha_write_count);
if (table->timestamp_field_type & TIMESTAMP_AUTO_SET_ON_INSERT)
table->timestamp_field->set_time();
size= encode_quote(buf);
stats.records++;
DBUG_RETURN(0);
}
[edit] Read (Scan)
T@3 : | | | | | | | >ha_archive::info T@3 : | | | | | | | <ha_archive::info T@3 : | | | | | | | | >ha_archive::rnd_init T@3 : | | | | | | | | <ha_archive::rnd_init T@3 : | | | | | | | >ha_archive::rnd_next T@3 : | | | | | | | | | | >ha_archive::unpack_row T@3 : | | | | | | | | | | <ha_archive::unpack_row T@3 : | | | | | | | | <ha_archive::get_row T@3 : | | | | | | | <ha_archive::rnd_next T@3 : | | | | | | | >ha_archive::rnd_next T@3 : | | | | | | | <ha_archive::rnd_next
[edit] info()
int ha_archive::info(uint flag)
{
DBUG_ENTER("ha_archive::info");
stats.records= share->rows_recorded;
stats.deleted= 0;
/* Costs quite a bit more to get all information */
if (flag & HA_STATUS_TIME)
{
MY_STAT file_stat; // Stat information for the data file
VOID(my_stat(share->data_file_name, &file_stat, MYF(MY_WME)));
stats.mean_rec_length= table->s->reclength + buffer.alloced_length();
stats.data_file_length= file_stat.st_size;
stats.create_time= file_stat.st_ctime;
stats.update_time= file_stat.st_mtime;
stats.max_data_file_length= share->rows_recorded * stats.mean_rec_length;
}
stats.delete_length= 0;
stats.index_file_length=0;
if (flag & HA_STATUS_AUTO)
{
azflush(&archive, Z_SYNC_FLUSH);
stats.auto_increment_value= archive.auto_increment;
}
DBUG_RETURN(0);
}
[edit] rnd_init()
int
ha_innobase::rnd_init(
/*==================*/
/* out: 0 or error number */
bool scan) /* in: ???????? */
{
int err;
row_prebuilt_t* prebuilt = (row_prebuilt_t*) innobase_prebui
if (prebuilt->clust_index_was_generated) {
err = change_active_index(MAX_KEY);
} else {
err = change_active_index(primary_key);
}
start_of_scan = 1;
return(err);
}
[edit] rnd_next()
int
ha_innobase::rnd_next(
/*==================*/
/* out: 0, HA_ERR_END_OF_FILE, or error number */
mysql_byte* buf)/* in/out: returns the row in this buffer,
in MySQL format */
{
int error;
DBUG_ENTER("rnd_next");
statistic_increment(current_thd->status_var.ha_read_rnd_next_count,
&LOCK_status);
if (start_of_scan) {
error = index_first(buf);
if (error == HA_ERR_KEY_NOT_FOUND) {
error = HA_ERR_END_OF_FILE;
}
start_of_scan = 0;
} else {
error = general_fetch(buf, ROW_SEL_NEXT, 0);
}
DBUG_RETURN(error);
}
[edit] update_row()
T@3 : | | | | >ha_tina::info
T@3 : | | | | <ha_tina::info
T@3 : | | | | | >ha_tina::rnd_init
T@3 : | | | | | <ha_tina::rnd_init
T@3 : | | | | | >ha_tina::extra
T@3 : | | | | | <ha_tina::extra
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::update_row
<------------------Update occurs here
T@3 : | | | | <ha_tina::update_row
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::update_row
T@3 : | | | | <ha_tina::update_row
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::extra
T@3 : | | | | <ha_tina::extra
T@3 : | | | | >ha_tina::rnd_end
T@3 : | | | | | >ha_tina::write_meta_file
T@3 : | | | | | <ha_tina::write_meta_file
T@3 : | | | | <ha_tina::rnd_end
[edit] update_row()
int ha_tina::update_row(const byte * old_data, byte * new_data)
{
int size;
DBUG_ENTER("ha_tina::update_row");
ha_statistic_increment(&SSV::ha_read_rnd_next_count);
if (table->timestamp_field_type & TIMESTAMP_AUTO_SET_ON_UPDATE)
table->timestamp_field->set_time();
size= encode_quote(new_data);
if (chain_append())
DBUG_RETURN(-1);
DBUG_RETURN(0);
}
[edit] delete_row()
T@3 : | | | | >ha_tina::info
T@3 : | | | | <ha_tina::info
T@3 : | | | | | >ha_tina::rnd_init
T@3 : | | | | | <ha_tina::rnd_init
T@3 : | | | | | >ha_tina::extra
T@3 : | | | | | <ha_tina::extra
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::delete_row
<--------------------- delete occurs here
T@3 : | | | | <ha_tina::delete_row
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::rnd_next
T@3 : | | | | <ha_tina::rnd_next
T@3 : | | | | >ha_tina::extra
T@3 : | | | | <ha_tina::extra
T@3 : | | | | >ha_tina::rnd_end
T@3 : | | | | | >ha_tina::write_meta_file
T@3 : | | | | | <ha_tina::write_meta_file
T@3 : | | | | <ha_tina::rnd_end
[edit] delete_row()
int ha_tina::delete_row(const byte * buf)
{
DBUG_ENTER("ha_tina::delete_row");
ha_statistic_increment(&SSV::ha_delete_count);
if (chain_append())
DBUG_RETURN(-1);
stats.records--;
/* DELETE should never happen on the log table */
DBUG_ASSERT(!share->is_log_table);
DBUG_RETURN(0);
}
[edit] index_read()
int ha_memcache::read_range_first(const key_range *start_key,
const key_range *end_key,
bool eq_range, bool sorted)
{
DBUG_ENTER("ha_memcache::read_range_first");
int ret;
ret= find_row(table->record[[[0]], start_key->key,
start_key->length);
DBUG_RETURN(ret);
}
[edit] index_next()
int index_next(byte * buf)
{
return HA_ERR_END_OF_FILE;
}
[edit] What about deining features?
- tables_flags()
- index_flags()
- handlerton flags
[edit] Locks anyone?
- external_lock()
- store_lock()
[edit] store_lock()
THR_LOCK_DATA **ha_tina::store_lock(THD *thd,
THR_LOCK_DATA **to,
enum thr_lock_type lock_type)
{
if (lock_type != TL_IGNORE && lock.type == TL_UNLOCK)
lock.type=lock_type;
*to++= &lock;
return to;
}
[edit] external_lock()
(yes, its a historical, and now horrible name)
int ha_federated::external_lock(THD *thd, int lock_type)
{
int error= 0;
ha_federated *trx= (ha_federated *)thd->ha_data[[[ht->slot]];
DBUG_ENTER("ha_federated::external_lock");
if (lock_type != F_UNLCK)
{
if (!(thd->options & (OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN)))
{
error= connection_autocommit(TRUE);
trans_register_ha(thd, FALSE, ht);
}
else
{
DBUG_PRINT("info",("not autocommit"));
if (!trx)
{
error= connection_autocommit(FALSE);
thd->ha_data[[[ht->slot]]= this;
}
else
{
ha_federated *ptr;
for (ptr= trx; ptr; ptr= ptr->trx_next)
if (ptr == this)
break;
else if (!ptr->trx_next)
ptr->trx_next= this;
}
}
}
DBUG_RETURN(0);
}
[edit] Optimizer
- info()
- scan_time()
- read_time()
- records_in_range()
- pushdown conditions
[edit] What is missing?
- A lot!
- plug.in
- how to build as both a static and as a dynamic engine
- http://hg.tangent.org/skeleton-mysql-engine
- How to add a dynamic I_S (look at the memcache engine, it has one)
- Transactions... push down conditions...
[edit] Session notes
- Voice Recording: https://docsrva.mysql.com/MySQLU/MySQLU-Storage_Engine_API.ogg
[edit] Questions & Answers
- Hartmut, Slide 2: "Are there plans for a plugin dependency mechanism to determine init/deinit order? (like we have for extensions in PHP land?)''
- (Serg): Yes
- (Brian): Yes there is. We didn't do this in the first version, but there is a plan so that you can say 'we require this plugin before startup'.
- dlenev, Slide 5: 'How 'cheap' is the handler? I mean one may want to have several cursors for the same operation and this case 'cheapness' comes into play.'
- (Monty): The handler is relatively cheap to open again, especially if is already open. One current exception is the partition handler. Currently we don't support several cursors for the same handler, although it can easily be fixed with either cursor store/restore methods or a relatively cheap clone call. When using a handler as a cursor, the handler becomes a very lightweight dispatcher over the native handler implementation.
- TheK, Slide 11: 'Where did they get the handler for the fields?'
- (Brian): The fields themselves are members of tables, and tables are a variable inside of the handler. For example:
int ha_tina::encode_quote(byte *buf)
{
char attribute_buffer[1024];
String attribute(attribute_buffer, sizeof(attribute_buffer),
&my_charset_bin);
my_bitmap_map *org_bitmap= dbug_tmp_use_all_columns(table, table-
buffer.length(0);
for (Field **field=table->field ; *field ; field++)
{
const char *ptr;
const char *end_ptr;
/*
Write an empty string to the buffer in case of a NULL value.
Basically this is a safety check, as no one ensures that the
field content is cleaned up every time we use
in the code.
*/
if ((*field)->is_null())
{
buffer.append(STRING_WITH_LEN("\"\","));
(cheaper then current_thd())
continue;
}
- (Brian, cont.): You can see that it's extracting the fields, fields are a member of the table, and the table is a member of the handlerton.
- TheK : 'When did we get the handle to the table?'
- (Brian): When the handler is created, we populate the handler with a reference to the actual table itself.
- Hakan, Slide 14: 'Does rnd == random || rnd == read next data?'
- (Serg): It stands for Random, although the same functions are used for linear table reads.
- (Brian): Yeah, random.
- Timothy, Slide 15: 'Why are statistics incremented in each handler, not at the upper level? it seems error-prone, or different engines could report statistics differently, which could make it confusing.'
- (Serg): Correct, should be fixed (moved to ha_write_row()/etc wrappers). and it should be done in backward-compatible way (by defining ha_statistic_increment to be no-op)
- (Brian): The answer to that is an historical question. We didn't have all handler functions in all engines, so we implemented them in lower layers, rather than upper ones. But yes, this should be fixed so that they are in upper layers in the wrapper functions.
- dlenev, Slide 22: 'What is the difference between table_flags() and handlerton flags ? Are there any capabilities which are dynamical ?'
- (Serg): table_flags() is for dynamic capabilities, handlerton flags for static. You need to create/open a handler to access to the table_flags. So every flag that you need to know about in advance (i.e. that needs to be known before opening the table) should be in the handlerton flags.
- (Brian): index_flags are dynamic. I'm not aware of any engine at the moment that uses dynamic table_flags. Table flags should probably be merged with handlerton flags. It's cheap to access handlerton flags because they are global in nature.
- Marc Alff, Slide 25: 'Why providing the full thd->ha_data array to each engine, as opposed to provide only the engine slot and rely on the storage engine code to usethe proper ht->slot ?'
- (Brian): We don't have a special transaction object, which is why we have to use this current method.
- Sanja, Slide 25: 'How do you think external_lock() should be named?'
- (Brian): It should still exist, but I think it should be used only for people who interested ion something like the flock() call. But we should remove the transaction state from it. I think the transaction state should be handled by the handlerton for the storage engine. So external_locks() should exist, it just shouldn't be handling transactions.
- Kostja, Slide 25: 'We don't call store_lock/external_lock for temporary tables. Why?
- (Brian): We do not call store_lock/external_lock on internal temporary tables which are created either as memory or MyISAM.
- (Monty): We do call external_lock for all transactional temporary tables, it is only non-transational temporary tables where we don't call extgern_lock/unlock.
- Brian-> The difference in the above two answers is that Monty is referring to tmp tables created by users, I am talking about tmp tables created internally for use in GROUP BY, etc...
Questions and answers at the Heidelberg Developers Meeting
- write_row(): Why hasn't the statistics info (on/off) been moved to the layer above?
- For historical reasons; it should actually be moved above.
- update_row(): Does Falcon use the FIELD class?
- Yes.
