Sorting and Searching for Duplicate Records - Page 9
August 29, 2001
Our next task is to sort the records by last and first name and
to eliminate duplicate records. We will eliminate the duplicate
that has the fewest fields of information; or if both have the
same amount of information, we will use the record that has the
most fields. If both records are identical, we'll only keep one.
This program will be a bit longer than the previous because of
the extra comparisons we have to do to check for duplicates, but
we'll step through it.
1 open(NEW1,"newlist.txt") or die "Cannot open newlist.txt for read\n";
2 open(TMP,">$$") or die "Cannot open $$ for write\n";
3 my @records = sort by_last_first();
4
5
6 for (my $i = 0; $i < @records; $i++) {
7 next unless $records[$i] =~ /\w/;
8 chomp($records[$i]);
9 my $skip = 0;
10 my @record1 = split(/,/,$records[$i]);
11 s/^"|"$//g foreach @record1;
12
13 for (my $j = 0; $j < @records; $j++) {
14 next unless $records[$j] =~ /\w/ && $j != $i;
15 chomp($records[$j]);
16 my @record2 = split(/,/,$records[$j]);
17 s/^"|"$//g foreach @record2;
18
19 if (
20 ($record1[0] eq $record2[0])
21 && ($record1[1] eq $record2[1])
22 ) {
23
24 my $rec1_field_cnt = 0;
25 my $rec1_char_cnt = 0;
26 foreach my $field (@record1[2..6]) {
27 $rec1_field_cnt++ if $field =~ /\S+/;
28 $rec1_char_cnt+= length($field);
29 }
30
31 my $rec2_field_cnt = 0;
32 my $rec2_char_cnt = 0;
33 foreach my $field (@record2[2..6]) {
34 $rec2_field_cnt++ if $field =~ /\S+/;
35 $rec2_char_cnt+= length($field);
36 }
37
38 if ($rec1_field_cnt < $rec2_field_cnt) {
39 $skip = 1;
40 last;
41 } elsif (($rec1_field_cnt == $rec2_field_cnt)
42 && ($rec2_char_cnt > $rec1_char_cnt)) {
43 $skip = 1;
44 last;
45 } elsif (($rec1_field_cnt == $rec2_field_cnt)
46 && ($rec1_char_cnt == $rec2_char_cnt)) {
47 $records[$j] = '"","","","","","",""';
48
49 }
50 }
51
52 if ($skip == 1) {
53 next;
54 } else {
55 print TMP '"', join('","',@record1), '"',"\n";
56 }
57 }
58
59 close(NEW1);
60 close(TMP);
61 rename $$, "newlist.txt";
62
63 sub by_last_first {
64 my @first = split(/,/,$a);
65 my @second = split(/,/,$b);
66 s/^"|"$//g foreach @first;
67 s/^"|"$//g foreach @second;
68 if ($first[1] eq $second[1]) {
69 return $first[0] cmp $second[0];
70 } else {
71 return $first[1] cmp $second[1];
72 }
73 }
The first thing we do is open the CSV file and read in the
contents on Lines 1 to 3. On Line 2, we create an output file
handle, which we'll use as a temporary file. On Line 3, we're
going to use the sort() function to sort the entire file
into an array by last and first name. In this case, we're using a
custom sort routine called by_last_first so that we can
control exactly how the lines get sorted. Let's jump down to
Lines 63 through 73 to get a closer look at this subroutine.
By default, the sort() function passes in the two strings
that are being compared. In our case, the strings are a line in
the CSV file, so the first thing we need to do is break apart the
lines into arrays that contain the field values. And that's
exactly what we do in Lines 64 to 67. The code should look
familiar, but you might be wondering where the $a
and $b variables came from. Well, those are the
strings that contain the two lines that we're comparing. These
values are set automatically by the sort() function.
In Lines 68 to 70, we use the cmp() function to
compare the last names of the two lines. If they are equal, we
also have to compare the first names in Line 66, otherwise, a
last name comparison on Line 71 will return the right results.
Ok, now that we have a list of sorted records, we need to
eliminate the duplicates. That's what the rest of the program
does in Lines 6 through 61. It may look like alot of code, so
I'll explain the algorithm in steps by line numbers. The code
loops through each line of the sorted records and splits them
apart into fields on Lines 6 through 11. So for each record, we
have to go back through the records to find a duplicate starting
on Line 13. If we do find a duplicate (Lines 19 to 22), then we
count the number of fields and characters for each record (Lines
24 to 36). If the duplicate record has more fields, then we set
the $skip flag to one, which tells the script not to
write the current record because we have a better one coming up
later. Now, if both records have the same number of fields with
values (Line 41), then we have to do a character count to figure
out which record has more data. If the second record does have
more characters (Line 42), then we skip the current record.
Lastly, if both records are identical, we zero out the record
we're doing the comparison with and keep on searching (Lines 45
to 47). Once the duplicate search for each record has been
completed, it's written to the temporary CSV file as long as the
skip flag was not set to 0 (Lines 52 to 56). Once the duplicate
search has been completed, the filehandles are closed and the
temporary file moved to newlist.txt.
Merging and Cleaning Data - Page 8
Weaving Magic With Regular Expressions
Sending Bulk E-mails - Page 10
|