Terence Junjie LIU

A pseudo programmer's blog

Loading Big Files in MySQL

TerenceLiu98 / 2021-02-25


Recap

Recently I am working on a project about financial news and index forecasting, which means I may needs to collect multiple news including the traditional newspapers and so called “internet media”’s news. The crawl is hard because every news provider comes up a subscription type recently. But some how, I found a project called “Global Database of Emotions, Language and Tone” and I am trying to get the news I needs from that database.

Preprocessing

For reason, the university’s internet cannot access Google directly, so I just added a proxy for the server I am using. BTW, the proxychains4 is so good that it helps me a lot.

gdeltPyR and gdeltr2

These are two packages for collecting data from GDELT in two platform (gdeltPyR is for python and gdeltr2 is for R), and these are both good packages helping me to escape from paying to Google and Amazon as the official way of querying is Big Query and Althena. You can check the packages with these URL:

The github repos provide some examples so I will skip this part.

Loading data in Mysql

The reason of using a database is that the data is so large that if we just read it from .csv it will cause us multiple time in waiting. For now I just collect the events data from 2014 to 2020 and it cost me about 170G in storage. Hence, store the data into a database maybe a good choice for our querying.

Loading that kinds of huge file is not easy for every system, so I tried to split the file into separated small files and load them one-by-one. This is the shell script:

#!/bin/bash

function loadFile(){
        mysql -u"[USERNAME]" -p"[USERNAME]" --local-infile=1 [DATABASE_NAME] << EOF
        SET GLOBAL local_infile = 'ON'; 
                LOAD DATA LOCAL INFILE "$1" INTO TABLE $2
                        COLUMNS TERMINATED BY ','
                        OPTIONALLY ENCLOSED BY '"'
                        ESCAPED BY '"'
                        LINES TERMINATED BY '\n'
                        IGNORE 1 LINES;
exit
EOF
}

echo "data is: $1"
echo "table is $2"

mkdir $1
mv $1.csv $1/
cd $1/
echo "spliting file..."
split -l 10000 $1.csv -d data
rename 's/$/\.csv/'  *
echo "the split done"
echo "tranfer the csv file into database [DATABASE_NAME] table $2"
sleep 3

for fileName in data*; do
        echo "begin load file $fileName..."
        loadFile $fileName $2;
        rm -rf $fileName;
        echo "finsh load file $fileName and deleted"
done

cd ..
rm -rf $1/