Processing Text with Perl Functions - Page 6
August 29, 2001
|
In this article, we will learn how to effectively leverage Perl's
built-in text handling functions to process CSV files and perform
an e-mail merge.
|
Review
In the last article,
we learned that Perl
is a great language for text processing. We established that
there are three different mechanisms for processing text. They
are regular expressions, built-in functions, and
loadable modules. We learned how to find and replace
strings within files with a simple but powerful recursive script.
The benefits of processing text with Perl functions
As you'll learn in this article, Perl includes many useful built-
in functions for processing text that will reduce the amount of
code that you have to write. Other languages include basic token
processing features, but do not go to the trouble of embedding a
set of common processing routines directly into the language.
Some would argue that defining a set of common routines violates
language purity. That may be true in some circles of thought, but
I would argue that adding these features have allowed Perl
programmers to focus more on getting real work done rather than
getting buried in the intricacies of language semantics. Of
course, the benefit of having a common set of routines is that,
well, you have a common set of routines that other Perl
programmers will understand and expect in your programs. This
makes it much easier to decipher the text processing portions of
code written by other programmers as opposed to having to
decipher the details of individual routines that different
programmers use in their code to do basically the same thing.
Again, this is why Perl is an ideal language for text processing.
It comes with all of the text processing capabilities that you
will probably ever need right out of the box.
Parsing a CSV file
One of the most common text processing operations that I've
performed over the years is reading in and writing out delimited
files. These files may come from Excel spreadsheets, databases,
system logs, and countless other sources.
Users often have the need to move data out of one program or
database and into another. As a programmer, I usually ask for a
CSV or comma delimited file as input and write a program to
import the data into the second application or database. Often
the process will be automated and happen on a routine basis.
The simplest way to process a delimited file is by using the Perl
split() function which takes the delimiter and the
string to process as arguments.
To demonstrate, I've created a spreadsheet in Excel and saved it
as a CSV file using a comma as a delimiter and double quotes to
surround the text fields. The contents are below:
"Name","Email","Phone","City","State","Zip"
"Jonathan Eisenzopf","eisen@pobox.com",
"703-555-1212","Reston","VA",20191
"John Bigboote","bigboote@yoyodyne.com",
"703-555-1213","Fairfax","VA",20814
Given the comma as our delimiter, the syntax of the
split() call will look like the following where the
$line variable contains the delimited line:
my @list = split(/,/,$line);
If the value of $line contained the first line of
the CSV file listed above, split() would assign the
following values to the @list array:
"Name"
"Email"
"Phone"
"City"
"State"
"Zip"
Notice that the first argument of the split()
function call includes the delimiter, a comma, but is also
surrounded by forward slashes. That's because the syntax of the
first argument is actually the match operator, a.k.a. a regular
expression. This feature can be useful if the data files contain
different delimiters.
Something else that you'll probably want to do is get rid of the
beginning and trailing double quote character that surrounds the
contents of each field. We can do this by adding a statement
right after the split() that loops over each item of
the @list array and removes the quotes:
s/^"|"$//g foreach @list;
You might remember the search and replace operator from the last
article. The foreach operator passes each item of
the array to the search and replace operator, which removes the
double quote character if it exists at the beginning or ending of
the string.
In case the regular expression looks a bit confusing, let's
examine it piece by piece. The caret character followed by a
double quote tells the search and replace operator (the s
character in front of the forward slash) to find a double quote
at the beginning of the string. The pipe character is an OR
operator. It's followed by another double quote and a dollar sign
which mean "match a double quote at the end of the string". The
next forward slash closes the regular expression that we're
matching. Between the second and third forward slash, we place
the characters that we want to replace the matched characters
with. In this case, we want to actually remove the characters so
we don't put anything between them. We also have the global
modifier (g) tacked on the the end of the expression which
means, "match it as many times as you can."
So the who expression reads, "At the beginning of the string find
a double quote or find a double quote at the end of the string
and keep searching the string until you find all of them."
Conclusion - Page 5
Weaving Magic With Regular Expressions
Performing a Mail Merge - Page 7
|