Creating Large DB
Contents |
[edit] Creating a large data set
[edit] Purpose
Testing for performance or new features often requires a large collection of data.
Creating such data is not always an option. You can easily generate records of repeating data, but that's not the same as data coming from the real world. Here we'll explain how to get the data from a large collection of data from a reliable source.
[edit] System requirements
The instructions in this article are suitable for a Unix operating system. They have been testes on several flavors of Linux and on Mac OS X 10.4.
Building the test database require some storage. Depending on the engine you choose, the storage requirements ranges from 2 to 17 GB.
| container | size | notes |
| raw data | 3.5 GB | CSV compressed (zip) |
| MyISAM | 13 GB | with PK |
| MyISAM | 8 GB | partitioned, 1 index |
| InnoDB | 16 GB | with PK |
| InnoDB | 13 GB | partitioned, 1 index |
| Archive | 2 GB | no indexes |
The examples given use a dedicated memory buffer of 2 GB.
[edit] Source
The source used for this exercise is provided by Bureau of Transportation Statistics. The data is packaged as CSV files, and they are available from July 1987 to July 2007 (at the time of writing). To get the data easily, you can use the following script:
#!/bin/sh
mkdir flightstats
cd flightstats
BASE_NAME=On_Time_On_Time_Performance
BASE_URL="http://transtats.bts.gov/Download/$BASE_NAME"
Y=1987
for M in `seq 7 12`
do
echo $M
wget ${BASE_URL}_${Y}_${M}.zip
done
for Y in `seq 1988 2008`
do
echo $Y
for M in `seq 1 12`
do
echo $M
wget ${BASE_URL}_${Y}_${M}.zip
done
done
Y=2009
for M in `seq 1 2`
do
echo $M
wget ${BASE_URL}_${Y}_${M}.zip
done
[edit] Creating the structure
The table description for the downloaded files include 57 fields. Of these, only 14 were used for the test table.
create table flightstats ( AirlineID int not null, UniqueCarrier char(3) not null, Carrier char(3) not null, FlightDate date not null, FlightNum char(5) not null, TailNum char(8) not null, ArrDelay double not null, ArrTime datetime not null, DepDelay double not null, DepTime datetime not null, Origin char(3) not null, Dest char(3) not null, Distance int not null, Cancelled char(1) default 'n' )
When given, the primary key is a composite natural key.
primary key (FlightDate, AirlineID, Carrier, UniqueCarrier, FlightNum, Origin, DepTime, Dest)
[edit] Loading the data into a table
Once the data has been saved to the server, you need to load it into the table. Since not all the fields were used, and a few columns need some manipulation to be loaded, you need to use a few extended features from the LOAD DATA INFILE command.
# load.sql
load data local infile '_FILENAME_' into table flightstats
fields terminated by ',' optionally enclosed by '"'
ignore 1 lines
(
@Year, @Quarter, @Month,
AirlineID, UniqueCarrier, Carrier, FlightDate,
@DayofMonth, @DayOfWeek, @Flights,
FlightNum, TailNum,
@ActualElapsedTime, @CRSElapsedTime, @AirTime, @ArrDel15, @ArrDel30, @ArrDelSys15,
@ArrDelSys30, @ArrDelay, @ArrTime, @ArrTimeBlk, @CRSArrTime, @DepDel15, @DepDel30,
@DepDelSys15, @DepDelSys30, @DepDelay, @DepTime, @DepTimeBlk, @CRSDepTime,
Origin,
@OriginCityName, @OriginState, @OriginStateFips, @OriginStateName, @OriginWac,
Dest,
@DestCityName, @DestState, @DestStateFips, @DestStateName, @DestWac,
Distance,
@DistanceGroup, @TaxiIn, @TaxiOut, @WheelsOff, @WheelsOn, @Cancelled,
@CancellationCode, @Diverted, @CarrierDelay, @WeatherDelay, @NASDelay,
@SecurityDelay, @LateAircraftDelay)
SET ArrTime = concat(FlightDate, ' ', left(@ArrTime,2), ':', right(@ArrTime,2), ':00'),
DepTime = concat(FlightDate, ' ', left(@DepTime,2), ':', right(@DepTime,2), ':00'),
Cancelled=if(@Cancelled != 'n' and @Cancelled != 'y', 'n', @Cancelled),
ArrDelay = if(@ArrDelay is null or @ArrDelay ="", 0, @ArrDelay),
DepDelay = if(@DepDelay is null or @DepDelay ="", 0, @DepDelay)
;
show warnings;
# load.sh
if [ "$1" == "" ]
then
echo "missing file name"
exit
fi
F=$1
FN=`basename $1 .zip`
FN="$FN.csv"
echo $FN
unzip $F $FN
FN=$FN perl -pe '$BEGIN{$fn = $ENV{FN}}; s/_FILENAME_/$fn/' load.sql | mysql -v -v -v -t flightstats
rm $FN
And to run ./load.sh for each file you have in your flightstats folder, create this shell script
# batch_load.sh
for i in `ls flightstats`
do
#see what we are executing
echo "./load.sh flightstats/$i"
#for each file, run this command
`./load.sh flightstats/$i`
#pause for 0.5 seconds
`sleep 0.5s`
done