Obfuscator
Obfuscator is a proposed project which would allow MySQL users to confidentially submit bug reports and support requests to MySQL AB without revealing any sensitive data. It is a part of Google Summmer of Code Ideas for 2008.
Contents |
[edit] Synopsis
Given a SQL query of interest, it is possible to determine what records and tables participate in the result set of the query. The query, the participating tables and records, and the result set can then be obfuscated and delivered to MySQL AB so that the support department can replicate the execution of the query without leaking any sensitive data outside of the submitting organization.
[edit] Benefits
- Semi-automatic bug submission in the manner described herein should result in increased quality of the bug reports. Even people who are not willing/unable to submit a simplifed test case can use the tool to create one.
- For paying customers, support issues can be resolved even if the customer is concerned about possible information leaks.
- For the rest of the community, anonymous reporting would allow users to submit feature requests asking for specific optimizations, along with queries exibiting the behavoir in question. Such reports can then be aggregated to determine which potential query optimizations will have the biggest impact on the community -- such optimizations can then be implemented in the server itself.
- For individual users, such a tool would provide the opportunity to post questions or usage scenarios to blogs or public mailing lists without revealing any sensitive company information.
- For MySQL, it would be possible to obtain real-life data sets and query workloads from customers and use those for benchmarking and testing purposes, to the benefit of the user who provided them and the database product as a whole.
[edit] Method of Operation
[edit] Input Data
The input of the Obfuscator is the SQL statement of interest and all other factors that influence the execution of that statement. Apart from the actual tables and data, all parameters of the server environment, in particular those that influence the optimizer, should also be collected.
[edit] Obfuscation Process
The goal of the obfuscation process is to remove any sensitive information from both the query and the database data. For the purpose, all participating records are obfuscated with a digest function so that their original properties (e.g. field sizes) are preserved. In case it is impossible to preserve the entire data set, at least its important statistical properties are calculated and recorded.
[edit] Review Process
In order for the reporting process to be facilitated, the final report should be human-readable and concise enough so that the DBA can review it and be convinced that no sensitive information is being leaked. The DBA can then request certain parts of the report to be further summarized or ommited altogether. Alternatively, the DBA may willing to relax the rules and approve more data to be submitted to MySQL AB
Further trust can be established if the tool and its method of operation are published as Open Source so that the DBA can review the documentation and the code of the tool and be satisfied with its privacy-preserving properties. A third-party audit may also increase the trust in the tool. Since the code of the tool will likely not fit on a single computer screen, a good design strategy will allow a stripped-down simplifed version can be offered to very demanding users.
[edit] Secure Submission
After review, the report is packaged into a suitable format and delivered to MySQL AB via a secure channel. The report can then be assigned to a specific bug report number, specific support ticket or a specific support contract.
[edit] Issue Replication
A tool can be constructed which takes the obfuscated report and recreates the original environment and query and executes them in order to arrive at the original result. If this operation succeeds, the bug can be considered replicated.
Furthermore, the tool can select a test machine running the proper architecture, compile a fresh copy of the MySQL server using the original compile flags and start it with a custom my.cnf in order to recreate the original environment as much as possible.
[edit] Diagram
[edit] Obfuscation Strategies
[edit] The Query
The query can be easily parsed so that all (possibly information revealing) table and column names are replaced with non-revealing generic names. Literal values are replaced with information regarding the relationship between the literal value and the data it is being compared to (e.g. literal not found in table, literal beyond maximum index value, etc.). For example, the all-revealing query:
SELECT bribes.senator_name FROM mafia_projects LEFT JOIN bribes ON (YEAR(mob_projects.bribe_given) = bribes.date_taken) WHERE secret_field = 'secret_value'
can be obfuscated to become:
SELECT char_field (width 20; not null; no index) FROM myisam_table (size 500, row_format: fixed) LEFT JOIN innodb_table (size 100) ON (YEAR(date_field(not null)) = int_field WHERE enum_field = valid_value_from_enum
An important goal for the query obfuscator would be to reduce each query to its constituent relational algebra operators. This would allow even significantly different queries to be reduced to their common form in order to statistically aggregate them.
[edit] The Metadata
A list of participating databases and tables can be obtained by parsing the query. The table metadata can then be queries to arrive at a table representation that is close enough to allow the original table structure to be recreated close enough. For example, the stock mysql.user table can become:
myisam_table( size 10 records, primary key on 2 columns; 3 char cols, not null, default default; 14 enum cols, 2 choices; );
Metadata should be recorded so that significant features of the table design ar emphasized and insignificant ones (e.g. blob columns which do not participate in the query) are summarized. This way even disparate database structures can be compared based on their common features (e.g. having a two-part primary key).
[edit] The Table Data
It is possible to use several levels of verbosity when recording data, based on secrecy requriements and the maximum desired report size. The options are listed in order of increasing usefullness:
1. Obfuscate all participating data in its entirety
This would allow almost all bugs to be replicated by providing an almost complete copy of the original execution environment.
2. Obfuscate all records examined by the query
This would allow many table-level bugs to be replicated, that is, bugs that demonstrate themselves on more than one row (e.g. query returning more or less rows than desired)
3. Obfuscate all records returned by the query
This would allow most record-level bugs to be replicated, that is, bugs that occur within a single row (e.g. a function returning an unexpected value)
4. Record table properties only (size, key distribution, etc.)
This would allow the behavoir of the optimizer to be replicated, even though the original data is not available. A supporting tool can be created that creates a new table with random data based on the recorded properties of the original table.
For certain classes of queries and bugs, it would be possible for the system to attempt to replicate the bug or the query result with progressively less information until it arrives at the smallest possible data set that exhibits the behavoir of interest. For bugs, this ammounts to automatic reduction towards the smallest possible test case.
To arrive at a futher reduction in the data being reported, columns that do not participate in the query can be omitted or partially reported (e.g. report only the average column width).
The tool will be able to estimate the size of the report -- if the size is too big for the DBA to be comfortable with, a reduction to a less-useful data collection method should be performed.
Finally, the WHERE clause of the query can be decomposed to arrive at the sets of records that participate in forming the final query result. If the query result is formed as an intersection of those individual sets (the AND operator), the tool may instead decide to collect each set independently, constructing a union of all the sets in order to capture more of the original context of the operation.
[edit] The Indexes
The goal when obfuscating the indexes will be to preserve all their properties that may influence the optimizer (and possibly all other index operations). Obfuscating individual records completely changes the values recorded in the index, so extra effort must be put into selecting suitable replacement values so that the properties of the index remain the same.
This is particularily so for indexes on integer columns. Replacing each integer with another number derived from a digest function completely flattens the distribution of values across the entire integer space. This may have considerable impact on range queries. Therefore, instead of using a digest functions, all integer values can be processed by a series of operations which preserve the behavoir of the index (for example, multiply by a constant random factor and then shift all numbers by another constant random factor to another point in the integer space). This way, if the original numbers pertained to salaries, the obfuscated numbers may end up dispersed over a wide range near the maximum possible integer value, which would remove any association between those numbers and any form of currency.
For b-tree indexes, it would be possible to preserve the structure of the tree by selecting such an obfuscation procedure which would place all values at the same tree positions they had in the original index. This way the tree will look almost the same, only storing different records. A drawback to implementing that would be that while any DBA can understand and accept the obfuscation powers of a strong digest function, it may be difficult for some to understand and audit a more complex transformation.
Primary keys which are AUTO_INCREMENT can be preserved as they are. By extension, tables whose only purpose is to link two other tables into a many-to-many relationship on AUTO_INCREMENT keys can be preserved as they are, since they will not reveal confidential information.
[edit] The Environment
If possible, the entire environment of the MySQL server should be recorded (compilation flags, configuration options, session variables). For the purpose of simplifying matters, the environment can be divided into two -- obviously not confidential items (e.g. the SQL mode) which would require no review by the DBA, and potentially confidential ones (e.g. partitioning schema) which can be included in the final report after DBA approval.
The environment from final report can then be automatically processed to produce a my.cnf file and a pointer to a specific binary build which can then used to replicate the bug (or other behavoir) being reported.
[edit] The Result Set
The result set of the query can be obfuscated using the same methods used for obfuscating table data. Ideally, when the obfuscated query is ran against the obfuscated data, it should produce the same obfuscated result. This test can be used to determine that the report being submitted is comprehensive enough.
Once this baseline has been established, for some queries it will be possible to repeat the same sequence with smaller and smaller subsets of the original data in order to determine the smaller data set that produces the same behavoir. Ideally this reduction step should be performed on the customer's side in order to reduce the amount of data being sent over.
[edit] Data Agregation
If the query obfuscator is carefully constructed, it would be possible to compare queries obtained from several sources. If a significant number of users participate in the program, either by manually submitting queries, submitting their entire query logs or allowing the server to submit each query on their behalf, the data thus obtained can be mined for specific usage patterns. For example, the prevalence of specific types of joins or functions "in the wild" can be estimated, which can in turn be used to direct development effort towards those parts of the server.
[edit] External Links
SummerOfCode2008Ideas MySQL Summer of Code Ideas for 2008
DBIx::MyParse or DBIx::MyParsePP will be used for parsing the queries.