Category:DataWarehousing
Contents |
Overview
Data Warehousing refers to using a database for online archiving and reporting purposes. Wikipedia:Data Warehousing It is a subset of On Line Analytical Processing (OLAP) Wikipedia:OLAP
Data Warehousing (DW) and Online Transaction Processing (OLTP) typically have opposite requirements:
- DW operations are mostly mass load (append), delete and read. OLTP is transactional.
- DW performance is concerned with time to load and throughput. OLTP is concerned with Queries Per Second (QPS)
- DW databases are very large compared to OLTP databases, so reloading data or rebuilding indexes may not be practical.
MySQL focuses on transactions and OLTP, so is not designed specifically for DW. However, many small to medium sized businesses successfully use MySQL for DW.
Existing MySQL 5.0 Data Warehousing Features
- Bulk Importing
- LOAD DATA INFILE, similar to Oracle SQL*Loader
- "connectors" for C/C , Java, PHP, Perl, Ruby, etc.
- Bulk Exporting
- statement-based, asynchronous replication to slave hosts to facilitate data dumping
- mysqldump or SELECT INTO OUTFILE
- file-system copy of databases or tables
- "connectors" for C/C , Java, PHP, Perl, Ruby, etc.
- Reporting
- Crystal Reports and Business Objects available
- friendly text and GUI query tools and manuals for DBAs and end-users
- fairly standard SQL syntax for ad hoc queries
- max-join-size option to prevent overly large JOINs
- --safe-updates option (aliased to i-am-a-dummy) limits number of updates and deletes in one statement
- Resource limits are stored in the max-questions, max-updates, and max-connections columns.
- Mass Storage
- multiple storage engines (table types) with different capabilities and performance features
- MyISAM table is non-transactional, fast and supports full-text search
- merge table
- archive table
- blackhole table for relay slaves and real-time summary counts using triggers on INSERT
- Commercial Table Types
- NitroEDB for MySQL has quick insert and pre-computed aggregates $20,000/server
- [1] BrightHouse for MySQL Available Now for Archiving Multi-Terabyte Data Warehouses
- multiple storage engines (table types) with different capabilities and performance features
- can symlink databases and partitions across multiple partitions
- Stored Procedures
- reduce network bandwidth. useful when summarizing many records.
- SELECT ... GROUP BY ... WITH ROLLUP()
- affordable
Missing Features or Problems with MySQL 5.0 Data Warehousing
MySQL is designed to be an OLTP database, not a DW. So one cannot realistically expect much in the way of DW support compared to say Terradata. However, the addition of a few features may satisfy some MySQL users (80/20 rule.)
- Problem: Indexes must persist and be available without rebuilding
- index should be forward-compatible to all future MySQL versions. Other databases version-number their indexes.
- never drop or rebuild an index without permission. ALTER TABLE behavior needs to never drop or rebuild an index unless requested.
- Problem: loading data must be easy and fast
- working on parallel loader for 5.1
- multi-threaded quick bulk loader to use all available disk and network io
- bulk loader should be able to filter in or out records
- bulk loader should support multiple hosts and databases
- bulk loader import format should be binary or text data
- incorporate all features from Oracle SQL*Loader, like DIRECT option to bypass indexes
- do log or binary logs should be complete for creating an identical slave.
- Problem: storage will likely need to be added on the fly because data warehousing data sets are very large
- database should be able to connect to Disk Space Partitions/Volumes flexibly while online
- flexible merge/partitioning to add/drop partitions while online while allowing SELECTs (not read-locked)
- tolerant of missing partitions, similar to IBM Red Brick
- Problem: raw data needs to be conveniently summarized
- IBM Red Brick allows a persistent summary table to be defined that is updated while loading and is available even after partitions are dropped
- Problem: data warehouses often use denormalized data
- add Materialized View Wikipedia: Wikipedia:Materialized View is a concrete view to pre-joined data. An index is not required.
- Problem: need support from standard ETL loading tools
- Pentaho Kettle - supported
- Informatica - unknown - $100,000 est.
- Ardent - unknown - $100,000 est.
- Ab Initio - unknown - $100,000 est.
- Talend - supported - Open Source and Commercial versions
- Problem: data may be loaded while data warehouse is in operation
- Need atomic online table name cutover after loading, like Oracle
- MySQL does have a rename table command, but seems to have limitations:
- "When you execute RENAME, you cannot have any locked tables or active transactions."
- "As long as two databases are on the same filesystem, you can use RENAME TABLE to move a table from one database to another"
- "As of MySQL 5.0.14, RENAME TABLE also works for views, as long as you do not try to rename a view into a different database."
Typical Small-scale Data Warehousing Scenario with MySQL 5.0
Feedback from MySQL conference attendees indicate that medium-sized companies typically have a data warehousing setup like this:
- 3 databases
- master production database using InnoDB storage engine
- replicated slave database using InnoDB storage engine (same schema as prod)
- fact database using MyISAM storage engine containing historical production and other data (different schema designed for data warehousing)
- replicate from production to slave in real-time.
- hourly or daily slave dump and load into fact schema
- link commonly-used canned data warehousing reports to Intranet
- distribute MySQL Query Browser or Quest TOAD for MySQL tool to users for ad hoc SQL queries against the fact database
Resources
--James Briggs 21:57, 12 November 2006 (UTC)
Articles in category "DataWarehousing"
There are 0 articles in this category.