[prog] Introduction and Perl: Flat File DB Query Question

Wed Mar 15 12:42:23 EST 2006

On Wednesday 15 March 2006 01:12, Jacinta Richardson wrote:
> Katherine Spice wrote:
> > $k = 0;
> > $found = '';
> > foreach $comparezip (@zip1) {
> >         if ($comparezip eq $rc_zip) {
> >                 $found = 'yes';
> >                 last;
> >         }
> >         $k++;
> > }
>
> I don't think you mean to initialise $k where you do.  Otherwise each time
> through the greater loop (for each line...) $k will be set to 0.
>
> While this could have the effect intended as far as ensuring that each zip
> code is unique, it will have the unintended side-effect of slowing the
> program execution down a *lot* on large files.
>
> What this code does is search through a growing array for every new line in
> the file.  So, if your file is 100 lines long, then you'll perform this
> search 100 times, for array lengths of 0 to 99 units long.  In Comp. Sci
> parlance, this means this algorithm is O(N^2), which usually means its not
> efficient for large N (the length of the file in this case).
>
> A better solution would be to use a hash:
>
>    my %seen;
> 15 while (<INPUTFILE>) { #begin while
> 16     chop;
> 17     ($rc_name, $street_address, $city, $state, $rc_zip) = split (/\|/);
> 18
>        # If we've already seen this do something.
>        if( $seen{$rc_zip}++ ) {
>               ....;
>               next;
>        }
>
>        # It's a new zip.
>
>
> The other points you raised highlight the real problems with this code.
>
> 	* the data file is opened and read twice (neither time with a
> 	  mode, so there's a security issue there).  Fortunately this is O(2N)
> 	  which is effectively the same as O(N) in the scheme of things.
>
> 	* The zip array and distance are not linked.  This would be a very
> 	  sensible place to use an array of hashes.
>
> 	* If we have to sort the array after creating it, why don't we just
> 	  use a hash to start with?
>
> If I have some spare time today, I might do a code review and suggest some
> greater changes to make things work better.
>
> All the best,
>
>      Jacinta

Thanks everyone for the input. 

For the duplication fix I went with the approach I received from Jacinta and 
which she also describes in this mail (see above). 

That does exactly what I want. The solution looks so simple, yet when you 
don't know Perl all that well, it's rather hard to figure out. I really don't 
know why people say Perl is easy to learn... 

The other problem we are seeing is that some of the locators are getting quite 
slow. One of the files has over 3000 entries (store locations). After it hit 
the 1500 mark it became noticeable slower. These Perl scripts are rather old, 
and the folks who first created them probably never anticipated more than 50 
or so locations, all with unique zip codes.

Apparently I'm not much of a Perl programmer, so I really do appreciate the 
input I receive. So much to learn.

Thanks.

-- 
Sabine Konhaeuser