Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Sorting and Searching for Duplicate Records - Page 9

August 29, 2001

Our next task is to sort the records by last and first name and to eliminate duplicate records. We will eliminate the duplicate that has the fewest fields of information; or if both have the same amount of information, we will use the record that has the most fields. If both records are identical, we'll only keep one. This program will be a bit longer than the previous because of the extra comparisons we have to do to check for duplicates, but we'll step through it.

1  open(NEW1,"newlist.txt") or die "Cannot open newlist.txt for read\n";
2  open(TMP,">$$") or die "Cannot open $$ for write\n";
3  my @records = sort by_last_first();
4
5
6  for (my $i = 0; $i < @records; $i++) {
7      next unless $records[$i] =~ /\w/;
8      chomp($records[$i]);
9      my $skip = 0;
10     my @record1 = split(/,/,$records[$i]);
11     s/^"|"$//g foreach @record1;
12
13     for (my $j = 0; $j < @records; $j++) {
14 	 next unless $records[$j] =~ /\w/ && $j != $i;
15	 chomp($records[$j]);
16	 my @record2 = split(/,/,$records[$j]);
17	 s/^"|"$//g foreach @record2;
18
19	 if (
20	     ($record1[0] eq $record2[0])
21	     && ($record1[1] eq $record2[1])
22	     ) {
23
24	     my $rec1_field_cnt = 0;
25	     my $rec1_char_cnt = 0;
26	     foreach my $field (@record1[2..6]) {
27	 	 $rec1_field_cnt++ if $field =~ /\S+/;
28		 $rec1_char_cnt+= length($field);
29	     }
30
31	     my $rec2_field_cnt = 0;
32	     my $rec2_char_cnt = 0;
33	     foreach my $field (@record2[2..6]) {
34		 $rec2_field_cnt++ if $field =~ /\S+/;
35		 $rec2_char_cnt+= length($field);
36	     }
37
38	     if ($rec1_field_cnt < $rec2_field_cnt) {
39		 $skip = 1;
40		 last;
41	     } elsif (($rec1_field_cnt == $rec2_field_cnt)
42		      && ($rec2_char_cnt > $rec1_char_cnt)) {
43		 $skip = 1;
44		 last;
45	     } elsif (($rec1_field_cnt == $rec2_field_cnt)
46		      && ($rec1_char_cnt == $rec2_char_cnt)) {
47		 $records[$j] = '"","","","","","",""';
48
49 	  }
50     }
51
52     if ($skip == 1) {
53 	  next;
54     } else {
55	 print TMP '"', join('","',@record1), '"',"\n";
56     }
57 }
58
59 close(NEW1);
60 close(TMP);
61 rename $$, "newlist.txt";
62
63 sub by_last_first {
64     my @first = split(/,/,$a);
65     my @second = split(/,/,$b);
66     s/^"|"$//g foreach @first;
67     s/^"|"$//g foreach @second;
68     if ($first[1] eq $second[1]) {
69 	 return $first[0] cmp $second[0];
70     } else {
71	 return $first[1] cmp $second[1];
72     }
73 }

The first thing we do is open the CSV file and read in the contents on Lines 1 to 3. On Line 2, we create an output file handle, which we'll use as a temporary file. On Line 3, we're going to use the sort() function to sort the entire file into an array by last and first name. In this case, we're using a custom sort routine called by_last_first so that we can control exactly how the lines get sorted. Let's jump down to Lines 63 through 73 to get a closer look at this subroutine.

By default, the sort() function passes in the two strings that are being compared. In our case, the strings are a line in the CSV file, so the first thing we need to do is break apart the lines into arrays that contain the field values. And that's exactly what we do in Lines 64 to 67. The code should look familiar, but you might be wondering where the $a and $b variables came from. Well, those are the strings that contain the two lines that we're comparing. These values are set automatically by the sort() function. In Lines 68 to 70, we use the cmp() function to compare the last names of the two lines. If they are equal, we also have to compare the first names in Line 66, otherwise, a last name comparison on Line 71 will return the right results.

Ok, now that we have a list of sorted records, we need to eliminate the duplicates. That's what the rest of the program does in Lines 6 through 61. It may look like alot of code, so I'll explain the algorithm in steps by line numbers. The code loops through each line of the sorted records and splits them apart into fields on Lines 6 through 11. So for each record, we have to go back through the records to find a duplicate starting on Line 13. If we do find a duplicate (Lines 19 to 22), then we count the number of fields and characters for each record (Lines 24 to 36). If the duplicate record has more fields, then we set the $skip flag to one, which tells the script not to write the current record because we have a better one coming up later. Now, if both records have the same number of fields with values (Line 41), then we have to do a character count to figure out which record has more data. If the second record does have more characters (Line 42), then we skip the current record. Lastly, if both records are identical, we zero out the record we're doing the comparison with and keep on searching (Lines 45 to 47). Once the duplicate search for each record has been completed, it's written to the temporary CSV file as long as the skip flag was not set to 0 (Lines 52 to 56). Once the duplicate search has been completed, the filehandles are closed and the temporary file moved to newlist.txt.

Merging and Cleaning Data - Page 8
Weaving Magic With Regular Expressions
Sending Bulk E-mails - Page 10


Up to => Home / Authoring / Languages / Perl / Weave




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers