Creating Large DB

Contents

[edit] Creating a large data set

[edit] Purpose

Testing for performance or new features often requires a large collection of data.

Creating such data is not always an option. You can easily generate records of repeating data, but that's not the same as data coming from the real world. Here we'll explain how to get the data from a large collection of data from a reliable source.


[edit] System requirements

The instructions in this article are suitable for a Unix operating system. They have been testes on several flavors of Linux and on Mac OS X 10.4.

Building the test database require some storage. Depending on the engine you choose, the storage requirements ranges from 2 to 17 GB.

container size notes
raw data 3.5 GB CSV compressed (zip)
MyISAM 13 GB with PK
MyISAM 8 GB partitioned, 1 index
InnoDB 16 GB with PK
InnoDB 13 GB partitioned, 1 index
Archive 2 GB no indexes

The examples given use a dedicated memory buffer of 2 GB.

[edit] Source

The source used for this exercise is provided by Bureau of Transportation Statistics. The data is packaged as CSV files, and they are available from July 1987 to July 2007 (at the time of writing). To get the data easily, you can use the following script:

#!/bin/sh
mkdir flightstats
cd flightstats

BASE_NAME=On_Time_On_Time_Performance
BASE_URL="http://transtats.bts.gov/Download/$BASE_NAME"

Y=1987
for M in `seq 7 12`
do
  echo $M
  wget ${BASE_URL}_${Y}_${M}.zip
done 


for Y in `seq 1988 2008`
do 
  echo $Y
  for M in `seq 1 12`
  do 
    echo $M
    wget ${BASE_URL}_${Y}_${M}.zip
  done 
done

Y=2009
for M in `seq 1 2`
  do 
    echo $M
    wget ${BASE_URL}_${Y}_${M}.zip
done

[edit] Creating the structure

The table description for the downloaded files include 57 fields. Of these, only 14 were used for the test table.

create table flightstats (
AirlineID int not null,
UniqueCarrier char(3) not null,
Carrier char(3) not null,
FlightDate date not null,
FlightNum char(5) not null,
TailNum char(8) not null,
ArrDelay double not null,
ArrTime datetime not null,
DepDelay double not null,
DepTime datetime not null,
Origin char(3) not null,
Dest char(3) not null,
Distance int not null,
Cancelled char(1) default 'n'
)

When given, the primary key is a composite natural key.

primary key (FlightDate, AirlineID, Carrier, UniqueCarrier, FlightNum, Origin, DepTime, Dest)


[edit] Loading the data into a table

Once the data has been saved to the server, you need to load it into the table. Since not all the fields were used, and a few columns need some manipulation to be loaded, you need to use a few extended features from the LOAD DATA INFILE command.

# load.sql
load data local infile '_FILENAME_' into table flightstats
fields terminated by ',' optionally enclosed by '"'
ignore 1 lines
(
@Year, @Quarter, @Month,
AirlineID, UniqueCarrier, Carrier, FlightDate,
@DayofMonth, @DayOfWeek, @Flights,
FlightNum, TailNum,
@ActualElapsedTime, @CRSElapsedTime, @AirTime, @ArrDel15, @ArrDel30, @ArrDelSys15,
@ArrDelSys30, @ArrDelay, @ArrTime, @ArrTimeBlk, @CRSArrTime, @DepDel15, @DepDel30,
@DepDelSys15, @DepDelSys30, @DepDelay, @DepTime, @DepTimeBlk, @CRSDepTime,
Origin,
@OriginCityName, @OriginState, @OriginStateFips, @OriginStateName, @OriginWac,
Dest,
@DestCityName, @DestState, @DestStateFips, @DestStateName, @DestWac,
Distance,
@DistanceGroup, @TaxiIn, @TaxiOut, @WheelsOff, @WheelsOn, @Cancelled, 
@CancellationCode, @Diverted, @CarrierDelay, @WeatherDelay, @NASDelay,
@SecurityDelay, @LateAircraftDelay)
SET ArrTime = concat(FlightDate, ' ', left(@ArrTime,2), ':', right(@ArrTime,2), ':00'),
    DepTime = concat(FlightDate, ' ', left(@DepTime,2), ':', right(@DepTime,2), ':00'),
    Cancelled=if(@Cancelled != 'n' and @Cancelled != 'y', 'n', @Cancelled),
    ArrDelay = if(@ArrDelay is null or @ArrDelay ="", 0, @ArrDelay),
    DepDelay = if(@DepDelay is null or @DepDelay ="", 0, @DepDelay)
;
show warnings;


# load.sh
if [ "$1" == "" ]
then
    echo "missing file name"
    exit
fi
F=$1
FN=`basename $1 .zip` 
FN="$FN.csv"
echo $FN
unzip $F $FN

FN=$FN perl -pe '$BEGIN{$fn = $ENV{FN}}; s/_FILENAME_/$fn/' load.sql | mysql  -v -v -v -t flightstats
rm $FN


And to run ./load.sh for each file you have in your flightstats folder, create this shell script

# batch_load.sh

for i in `ls flightstats`
do
       #see what we are executing
       echo "./load.sh flightstats/$i"
       #for each file, run this command
       `./load.sh flightstats/$i`
       #pause for 0.5 seconds
       `sleep 0.5s`
done

Retrieved from "http://forge.mysql.com/wiki/Creating_Large_DB"

This page has been accessed 4,720 times. This page was last modified 12:17, 3 March 2009.

Find

Browse
MySQLForge
Main Page
Current events
Recent changes
Random page
Help
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Special pages
New pages
File list
Statistics
Bug reports
More...