Often, when setting up a new database, you will need to migrate data from an old database to MySQL. In the case of our bookstore, let’s suppose that a vendor has sent us a disk with a list of all of their books in a simple text file. Each record for each book is on a separate line, and each field of each record is separated by a vertical bar. Here’s what the fictitious vendor’s data text file looks like:
ISBN|TITLE|AUTHOR LAST|AUTHOR FIRST|COPYRIGHT DATE| 067973452X|Notes from Underground|Dostoevsky|Fyodor|August 1994| ...
Obviously, an actual vendor file would contain more fields and records than are shown here, but this is enough for our example. The first line contains descriptions of the fields in the records that follow. We don’t need to extract the first line; it’s just instructions for us. So, we’ll tell MySQL to ignore it when we enter our SQL statement.
As for the data, we must consider a few problems. First, the fields
are not in the order that they are found in our tables. We’ll have to tell
MySQL the order in which the data will be coming so that it can make
adjustments. The other problem is that this text table contains data for
both our books table and our authors
table. This is going to be a bit tricky, but we can deal with it. What
we’ll do is extract the author information only in one SQL statement, then
we’ll run a separate SQL statement to import the book information. Before
starting, we’ve copied the vendor’s file called
books.txt to a temporary directory (e.g.,
/tmp). Here we run a LOAD DATA INFILE statement from the
mysql client:
LOAD DATA INFILE '/tmp/books.txt' REPLACE INTO TABLE authors FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n' TEXT_FIELDS(col1, col2, col3, col4, col5) SET author_last = col3, author_first = col4 IGNORE col1, col2, col5, 1 LINES;
First, I should point out that the TEXT_FIELDS and the IGNORE clause for columns are not
available before version 4.1 of MySQL. The IGNORE
n LINES clause has been
around for a while, though. With IGNORE 1 LINES, the
first line of the text file containing the column headings will be
ignored. Going back to the first line in this SQL statement, we’ve named
the file to load and the table in which to load the data. The REPLACE flag
has the effect of the REPLACE statement
mentioned earlier. Of course, since the name fields aren’t set to unique,
there won’t be any duplicates as far as MySQL is concerned. In a real-life
situation, you would have to alter your table to prevent duplicates based
on the author’s name.
In the second line, we specify that fields are terminated by a
vertical bar and that lines are terminated by a carriage return
(\r) and a newline (\n). This is the
format for an MS-DOS text file. Unix files have only a
newline to terminate the line.
In the third line of the SQL statement, we create aliases for each column. In the fourth line, we name the table columns to receive data and set their values based on the aliases given in the previous line. In the final line, we tell MySQL to ignore the columns that we don’t want, as well as the top line, because it doesn’t contain data.
If you’re using an older version of MySQL that isn’t able to
ignore unwanted columns, you will have to perform a couple
of extra steps. There are a few different ways of doing this. One simple
way, if the table into which we’re loading data isn’t too large, is to add
three extra, temporary columns to authors that will
take in the unwanted fields of data from the text file and drop them
later. This would look like the following:
ALTER TABLE authors ADD COLUMN col1 VARCHAR(50), ADD COLUMN col2 VARCHAR(50), ADD COLUMN col5 VARCHAR(50); LOAD DATA INFILE '/tmp/books.txt' REPLACE INTO TABLE authors FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (col1, col2, author_last, author_first, col5); ALTER TABLE authors DROP COLUMN col1, DROP COLUMN col2, DROP COLUMN col5;
These statements will work, but they’re not as graceful as the more
straightforward statement shown earlier. In the second SQL statement here,
notice that the IGNORE clause specifies one line to be
ignored. The last line of the same statement lists the columns in the
authors table that are to receive the data and the
sequence in which they will be imported. In the third SQL statement,
having finished importing the data from the vendor’s text file, we now
delete the temporary columns with their unnecessary data by using a
DROP statement. There’s usually no recourse from
DROP, no undo. So take care in using it.
Once we manage to copy the list of authors into the
authors table from the text file, we need to load the
data for the books and find the correct author_id for
each book. We do this through the following:
LOAD DATA INFILE '/tmp/books.txt' IGNORE INTO TABLE books
FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n'
TEXT_FIELDS(col1, col2, col3, col4, col5)
SET isbn = col1, title = col2,
pub_year = RIGHT(col5, 4),
author_id =
(SELECT author_id
WHERE author_last = col3
AND author_first = col4)
IGNORE col3, col4, 1 LINES;In this SQL statement, we’ve added a couple of twists to get what we
need. On the fifth line, to extract the year from the copyright
field—which contains both the month and the year—we use the string
function RIGHT(). It captures the last four
characters of col5 as specified in the second
argument.
The sixth line starts a subquery that determines the
author_id based on data from the
authors table, where the author’s last and first names
match what is found in the respective aliases. The results of the column
selected within the parentheses will be written to the
author_id column.
Finally, we’re having MySQL ignore col3 and
col4, as well as the column heading line. The
IGNORE flag on the first line instructs MySQL to ignore
error messages, not to replace any duplicate rows, and to
continue executing the SQL statement. Doing this maneuver with earlier
versions of MySQL will require temporary columns or a temporary table
along the lines of the previous example. Actually, using a temporary table
is still a prudent method for staging data. After you’ve verified it, you
can execute an INSERT...SELECT statement (see Chapter 6).