WL#4895: PERFORMANCE_SCHEMA Instrumenting Table IO

Affects: Server-9.x — Status: Un-Assigned — Priority: Very High

Add the necessary instrumentation for tracking
table input/output. Include row-level accesses
to temporary/persistent base tables and indexes.
We will treat table io as a "wait" so data will
appear in EVENTS_WAITS_CURRENT, in wait histories,
and in wait summaries. The EVENT_NAME column will be
"wait/io/table/..." -- compare "wait/io/file/...".

The big change is that EVENTS_WAITS_CURRENT has two rows.

Molecules and Atoms
-------------------

Other waits (mutexes, file io, net io, memory io) are
"atomic", that is, they do not include other waits,
they do not overlap with other waits. Table io can
be different; sometimes table io might include file io
or memory movement (for example "update row" might
be done with "read file" or might cause "memcpy").

We might logically conclude that table io is not
class = wait, it is a separate class in the hierarchy
statements -> stages -> table io -> waits.
But that would significantly increase the number of
performance_schema tables, and the amount of nesting.
We want to avoid such trouble for something that's
after all nearly atomic. We'll use the word "molecular"
for events that are low level like this. Table io is
a molecular wait. Possibly global read lock is also molecular.

Molecular events don't get their own set of tables,
but they do get their own rows in EVENTS_WAITS_CURRENT.
Therefore EVENTS_WAITS_CURRENT will usually have two rows:
#1 The latest wait event of any kind
#2 The latest table io wait event (may be same as #1).

For example, suppose that there is a 'row fetch' which
causes a 'file read'. EVENTS_WAITS_CURRENT might look like this:

Row# EVENT_NAME TIMER_START TIMER_END
---- ---------- ----------- ---------
1 wait/io/file/myisam/dfile 10001 10002
2 wait/io/table/sql/handler 10000 NULL

In this example a table io event started before a
file io event and has not finished, so it's clear
that the file io event is "nested" in the table io.

The table io event will not be discarded from
EVENTS_WAITS_CURRENT until it has finished.

In all other respects, table io treatment is the
same as for other waits.

What to instrument
------------------

Operations that affect whole tables -- CREATE,
ALTER, DROP -- are visible as statements, so
there's no need to monitor them separately.

Some operations that affect blocks -- "block read",
"block write" -- are visible as file io, so
there's no need to monitor them separately.
Other operations that affect blocks -- marking
"has been read" or "will be flushed" -- take no
time and are therefore not appropriate for
instrumenting.

Operations that affect data rows are:
'fetch' (or 'select'), 'insert' (or 'write')
'update', 'delete'.
We don't consider 'lock' to be a table io operation.
We don't include 'binlog' which shows up as file io.
If an insert triggers an insert, that's two operations.
If an update causes a cascade, that's two operations.
In another worklog task we say 'fetch' is a 'stage'.

Instrumenting in handler.cc
---------------------------

The Performance Schema instrumentation will be
in sql/handler.cc, in ha_write_row, ha_update_row,
and ha_delete_row. For example, the current
mysql-6.0-dtrace code for handler.cc is:
"
int handler::ha_write_row(uchar *buf)
{
int error;
Log_func *log_func= Write_rows_log_event::binlog_row_logging_function;
DBUG_ENTER("handler::ha_write_row");

MYSQL_INSERT_ROW_START(...); /* DTrace instrument start */
mark_trx_read_write();
error= write_row(buf); <-- REPLACE THIS WITH INSTRUMENTING CALL
MYSQL_INSERT_ROW_DONE(...); /* DTrace instrument end */

if (unlikely(error != 0))
DBUG_RETURN(error);
if (unlikely(error= binlog_log_row(table, 0, buf, log_func)))
DBUG_RETURN(error); /* purecov: inspected */
DBUG_RETURN(0);
}
"
We will replace "write_row(buf)" with "mysql_write_row(...)".
We will do the same in ha_update_row and ha_delete_row.

This means we have row io instrumentation in
one place, rather than one place per storage engine.

This also means we are instrumenting inside DTrace
instrumentation. Peter had been thinking that it would
be good to instrument in the same places as DTrace so we
could compare our results to DTrace results when
measuring the same things. But that's unimportant.

Sergei Golubchik thinks instrumenting in handler.cc is right.
Marc Alff sees nothing wrong with it.

However, the only operations are 'insert' (i.e. write),
'update', 'delete'. And only for individual rows.

Instrumenting bulk io (Deferred)
--------------------------------

In WL#4541 we see that the DTrace folk are going
to have probes for some of the same operations, and
will distinguish "update" from "multi_update".

In sql/handler.cc we see that there are places
where we could perhaps instrument, such as
ha_bulk_update_row.

We will not do that as part of this task.

We will assume that the number of rows involved
is always '1', or else doesn't matter.

Instrumenting 'fetch' (Deferred)
--------------------------------

We will not instrument rnd_pos(), or
or anything that involves 'fetch' (i.e. read).
Clearly 'fetch' is a table io operation, and
it's implemented in subclasses of handler.
But we won't touch storage-engine code for this
worklog task. This section only notes where it
could be done, perhaps as part of a later task.

Where DTrace folk do it:
The mysql-6.0-dtrace code, in
storage/falcon/ha_falcon.cpp, has
"
MYSQL_INDEX_READ_ROW_START(...);
instrumented code
MYSQL_INDEX_READ_ROW_DONE(...);
"
in several places. This START / DONE
instrumentation is also in sql/ha_ndbcluster.cc,
storage/maria/ha_maria.cc, and some
(not all) other storage engines.

Where Google folk do it:
The SHOW INDEX_STATISTICS code, in
storage/innobase/handler/ha_innodb.cc, has
"
ha_innobase::index_read()
row_search_for_mysql(...);
...
if (ret == DB_SUCCESS) {
...
rows_read++;
if (active_index >= 0 && active_index < MAX_KEY)
index_rows_read[active_index]++;
}
"

Apparently Google is thinking "we only care about
how many rows were read so only count the successes".
We, however, are thinking "we only care about how
much effort went into row-read operations so we
count both successes and failures".

Indexing index operations (Deferred)
------------------------------------

Indexes are like tables. That is, operations
are fetch/insert/update/delete, bytes can move,
and both rows and index keys may be in blocks.
So "table io" includes "index io", and when we say
"table io" we should mean io on both tables and their
associated indexes.

We won't have instrumentation for index key changes
(insert, update, delete) as part of this task.
This section only notes that it could be done,
as part of a later task.

Operations that affect index keys are:
'fetch', 'insert', 'update', 'delete'.
Thus we do not have separate designations
('fetch key', 'insert key', etc.) which would
distinguish index io from table io in the
OPERATION column. It might be evident
from EVENT_NAME that a fetch is of an index key.
We do not instrument block splits or merges.

When we monitor indexes, we will need to store
the index number (the OBJECT_NAME column will
be the table name).

Sergei Golubchik proposed we should at least
monitor index searches. They're visible at the
server level (although index insert/update/delete
operations are not), so it would be possible to
instrument 'fetch' operations only. There was no
support for the proposal.

EVENTS_WAITS_CURRENT
--------------------

We can find table io individual event information with:
SELECT * FROM EVENTS_WAITS_CURRENT;
The columns of EVENTS_WAITS_CURRENT are as described for WL#2360,
and we include the ones that will appear for WL#4678.
Some columns might be NULL for users with insufficient privileges.

THREAD_ID INTEGER Same as for mutexes.
EVENT_ID BIGINT Same as for mutexes.
EVENT_NAME VARCHAR(128) See below.
SOURCE VARCHAR(64) Same as for mutexes.
TIMER_START BIGINT Same as for mutexes.
TIMER_END BIGINT Same as for mutexes.
TIMER_WAIT BIGINT Same as for mutexes.
SPINS INTEGER See below.
OBJECT_SCHEMA VARCHAR(64) See below.
OBJECT_NAME VARCHAR(512) See below.
OBJECT_TYPE VARCHAR(64) See below.
OBJECT_INSTANCE_BEGIN BIGINT See below.
NESTING_EVENT_ID BIGINT Same as for mutexes.
OPERATION VARCHAR(16) See below.
NUMBER_OF_BYTES BIGINT See below.
FLAGS INTEGER NULL. See below.

EVENT_NAME. Following the "taxonomy" convention of WL#2360:
Class: wait
Order io
Family table
Genus sql
Species handler
Knowing that 'table io' is a subclass of
'io' which is a subclass of 'wait', uncontroversially
class/order/family = 'wait/io/table'.
The genus and species are arbitrary, they're based
on the fact that the current instrument is in
sql/handler.cc. If we we were instrumenting some
other method for a subclass of handler, it would
still be appropriate to have the same event name.
This does mean that when one says "GROUP BY EVENT_NAME"
one will get all events (fetch, insert, update, delete,
index read, etc.) in the same group, thus the same row.

SPINS. NULL for a typical table-io operation, but could
be the number of an index if later we instrument certain
index accesses. Certainly SPINS is not an appropriate column
name for an index number, so we are also considering using
OBJECT_INSTANCE_BEGIN for this purpose, or a new column.

OBJECT_SCHEMA. The name of the schema (database) that
contains the table or index.

OBJECT_NAME: The name of the table. Not a correlation name,
synonym, or view name. It would be desirable to specify
that it may not be a partition name partition name or
tablespace name, but we will say that's implementor-defined.
Not an index name, even if the operation affects an index.

OBJECT_TYPE. 'TABLE' (persistent base table), or
'TEMPORARY TABLE'. Even when we support indexes,
we'll say 'TABLE' because index operations will
be visible as "index fetch" in the OPERATION column.

OBJECT_INSTANCE_BEGIN could be a ROWID, if there is a
BIGINT identifier known at the time of the table io.
Otherwise it's NULL. Putting "Block number" or "Offset
within block" could be considered. However, we will
say this is implementor-defined. It can be NULL.

NESTING_EVENT_ID. Originally the wording here was:
"Table io is nested within stages
which are nested within statements. Undecided detail:
should we indicate that an atomic wait (e.g.
file io) is nested within a table io?"
Now see the later section "NESTING_EVENT_ID (Deferred)".

OPERATION is 'insert' or 'update' or 'delete'; later
'fetch' will be possible. In summaries a 'fetch' should
be counted as a 'read', and any other operation should
be counted as a 'write'.

NUMBER_OF_BYTES should be known. If a row has 17 bytes
including the size of a header, NUMBER_OF_BYTES = 17.
It might be unknown when the event starts, so often
NUMBER_OF_BYTES is NULL.

FLAGS. Provided that there is no significant overhead,
this will have things like "index read was done in
KEYREAD mode". Here is Sergei Golubchik's opinion:
"
Different extra hints are important. At least these:
HA_EXTRA_CACHE, HA_EXTRA_KEYREAD, HA_EXTRA_KEY_CACHE,
HA_EXTRA_WRITE_CACHE, HA_EXTRA_DELETE_CANNOT_BATCH,
HA_EXTRA_UPDATE_CANNOT_BATCH, but [... maybe more].
The above are values from enum ha_extra_function.
For flags we'll probably need to remove HA_EXTRA_ prefix or rename them
completely, using names clearer to the end user.
They're important, because they affect how the read, for example, or
delete is performed, and affect the spead. That is, without knowing
these flags the row (in the waits table) cannot be interpreted
correctly.
"
Meanwhile Marc Alff has proposed that this FLAGS column could
have data type SET('cache','keyread','...').
Unfortunately the meaning of the bits can depend
on the event's genus or operation, but Peter Gulutzan
thinks we should try to work out something that's readable.
That's for "later": he likes to think of cosmetic
enhancements as paint, and we should paint after
all the walls of the structure are up.
We will consider FLAGS content to be implementor-defined.

OTHER POSSIBLE COLUMNS: If we can guess that we will
need to read 1000 rows, and we know that we're on the
500th fetch, that's useful information for a progress
indicator. But perhaps it's more useful for a stage.

Generally we do not want to include metadata in
performance schema tables. So if you want to find
out (for example) what the tablespace is, look elsewhere.

NESTING_EVENT_ID (Deferred)
---------------------------

There is disagreement about NESTING_EVENT_ID.

When Peter uses the word "stages" he's thinking of
WL#4813 PERFORMANCE_SCHEMA Instrumenting Stages,
so let's think of an example with three levels:
stage=filesort, table-io=fetch, wait=read.

1. In EVENTS_WAITS_CURRENT for wait=read,
column NESTING_EVENT_ID_1 points to table-io=fetch,
column NESTING_EVENT_ID_2 points to stage=filesort.
Eventually there would be more column for higher levels.
(The row for table=io=fetch would be:
column NESTING_EVENT_ID_1 is NULL,
column NESTING_EVENT_ID_2 points to stage=filesort.)

2. In EVENTS_WAITS_CURRENT for wait=read,
column NESTING_EVENT_ID points to table-io=fetch.
There might have to be another column that says
what type of event NESTING_EVENT_ID is pointing
up to (table-io or stage).
The aggregation information in the stage=filesort
row contains "count(1)" for table-io=fetch, but
"count(0)" for wait=read, because aggregations
of stages are only for "one level below".

3. The filesort has the aggregation of
the atomic waits, but not of the table-io waits.
In EVENTS_WAITS_CURRENT for wait=read,
NESTING_EVENT_ID points to the stage=filesort.

Sergei Golubchik prefers '2'. At least, he indicates
that the column should contain whatever's latest and
at a higher level, regardless of what it is.

Peter prefers '3'. All ordinary waits will point
up to stages, and I think table-io is not important
enough to be worthy of a level with all the trimmings
that we see for stages.

Marc Alff has said he disagrees, his opinion will
appear in this space soon.

Since there is no agreement yet, we will do nothing
(so NESTING_EVENT_ID is NULL) and will try to
resolve the matter as part of a later worklog task.

Summaries
---------

The history and summary tables EVENTS_WAITS_HISTORY,
EVENTS_WAITS_HISTORY_LONG, EVENTS_SUMMARY_BY_THREAD...,
and so on are along the same lines as for other events.

There will not be a summary by table. That is
part of a different task which comes later. See "Object
Summaries" in WL#4816.

Double Timing (Deferred)
------------------------

By default, in SETUP_INSTRUMENTS, we have rows like:
NAME ENABLED TIMED
---- ------- -----
wait/io/table/sql/handler YES YES
That is: table io should be instrumented and timed.

With some storage engines table io is little more
than a memory copy, so the instrumentation might
take more time than the operation. The only slow
operations will be due to file io, so we'll see
this in EVENTS_WAITS_HISTORY:
EVENT_NAME OPERATION TIMER_START TIMER_END TIMER_WAIT
---------- --------- ----------- --------- ----------
wait/io/table.... fetch 100 200 100
wait/io/file... read 101 199 98
In this example the file io starts after the table io starts,
and the file io ends before the table io ends. So statements like
SELECT SUM(TIMER_WAIT) FROM EVENTS_WAITS_HISTORY;
are misleading, they count the same thing twice.

This does not affect EVENTS_WAITS_SUMMARY_BY_EVENT_NAME
or other aggregation which keep summaries separated.

Some possible solutions (pick one):

1. Recommend to users that they should set TIMED='NO'
for table instruments. This will cause TIMER_START and
TIMER_END and TIMER_WAIT values to be NULL or zero.

2. Add a new column TIMER_WAIT_OF_NESTED_EVENTS. This
goes up whenever a nested event finishes. The value in
a table io TIMER_WAIT column is calculated then as
(TIMER_END - TIMER_START) - TIMER_WAIT_OF_NESTED_EVENTS.

3. When calculating a global aggregate, rather than
using a simple SUM(), add only the events which have
no double counting. That is, skip wait events which
are flagged as "nested". This requires a new column.

Voting so far:
Sergei Golubchik prefers "3".

This matter has not been resolved, so for now we will
ignore table io in all timing aggregations. That is,
SUM_TIMER_WAIT and MIN_TIMER_WAIT and MAX_TIMER_WAIT
and AVG_TIMER_WAIT do not include timer information
from anything in the wait/io/table family. We will
have to resolve this in a later worklog task.

Global row io counters
----------------------

We want to be compatible with existing counters
that track global io. At least they might be
useful for checking validity of our results.
But if our instrumentation is in a different
place, our numbers will differ.

Slow log: ROWS_EXAMINED.

SHOW STATUS: Innodb_rows_read, Innodb_rows_updated,
Key_reads, Key_writes.

Google's SHOW ... STATISTICS patch:
number of rows fetched per index,
number of rows fetched or changed per table.

DTrace: see source code, and MySQL Reference Manual "Row-Level Probes"
http://dev.mysql.com/doc/refman/6.0/en/dba-dtrace-ref-rowlevel.html

Example from EVENTS_WAITS_CURRENT
---------------------------------

THREAD_ID: 2
EVENT_ID: 815
EVENT_NAME: wait/io/table/sql/handler
SOURCE: handler.cc:5509
TIMER_START: 405354593873242
TIMER_END: 405354595869556
TIMER_WAIT: 1996314
SPINS: NULL
OBJECT_SCHEMA: test
OBJECT_NAME: t1
OBJECT_TYPE: TABLE
OBJECT_INSTANCE_BEGIN: 200
NESTING_EVENT_ID: NULL
OPERATION: write_row
NUMBER_OF_BYTES: NULL
FLAGS: 0

References
----------

dev-private thread "WL#2360 Performance Schema, Instrumenting Table IO"
starting with
[mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=28725

You must be logged in to tag this worklog

No Comments yet

Votes

Not yet rated.
You must be logged in to vote.

Watches

0 members are watching this worklog
You must be logged in to track this worklog.

Provide Feedback

Please note:
HTML will be purified, but we allow for a number of HTML tags so that you have the flexibility to decorate your comment text to some extent. The comments allow the following HTML tags:

strong, b, em, blockquote, a, code, pre

To put code into your comment, simply encapsulate your code with
[code language="XXX"][/code], where XXX is any common language, for instance "PHP", "SQL", "C", etc.



You must be logged in to comment