[prog] open file with hundreds of thousand lines....

Conor Daly c.daly at met.ie
Mon Oct 14 09:21:35 EST 2002


On Sun, Oct 13, 2002 at 12:50:27AM -0700 or thereabouts, Abel Pires da Silva wrote:
> Hi all,
> 
> I'm testing my simple proxy program (written in C)
> that open a file contain with hundreds of thousand
> line (about 500000 lines) for address comparation
> purposes...(using strstr() function) 
> Program will continue to access web server if the
> address is not on the 500000 lines list.
> I put time function to count the time spent during the
> comparation (code from sneha, friendly...) It took
> only about 4 second to made comparation to the last
> line (500000th line).
> The problem is my program become so slow accessing
> server after made the comparation... 
> 
> Note:
> The process was so fast when I use file that only has
> *several* instead of *hundreds of thousand* line,
> almost immediately respond from the web server....
> 
> What could be the problem 'cousing this? 

I guess it's related to the "hundreds of thousands".  Could this be causing
other stuff to swap out to disk and have to be reloaded?  Are you reading
the 500k lines from disk each time?  Do your time count again but this time,
_after_ rebooting the proxy machine (to make sure your 500k lines are not
sitting in memory / cache) to see if there's a difference.  Use the time
functions to get stats for the "accessing server" bit of the program.  

One thing you might try (if it is down to such a large amount of data in
memory / on disk) is to split it up into domains (.com .net .org),
alphabetical sections (a, b, c) and do the comparison on only those
sections.  You then have 3 or 4 comparisons to make on the "Host" line to
decide which _section_ to scan.  To do this, you would need to have the
program create its sections from the 500k lines datafile at startup time and
then have each thread decide which section it needs and load _that_ from
disk.  Something like this psuedocode:

program start {

	load 500k lines {

		check line for domain {

			check line for first letter {

				temp file = /tmp/proxy/<first_letter>.<domain>
				write line to temp file

			}
		}
	}

thread start {
	
	check "HOST" line for domain {
		check "HOST" line for first letter {
				temp file = /tmp/proxy/<first_letter>.<domain>
				comparison loop {
					read lines from temp file
					do comparison
				}
		}
	}
}
}

Where your proxy program creates files like

/tmp/proxy/a.com
/tmp/proxy/b.com
/tmp/proxy/c.com
/tmp/proxy/a.org
/tmp/proxy/b.org
/tmp/proxy/c.org

Or is that completely crazy?

Conor (speculating...)
-- 
Conor Daly 
Met Eireann, Glasnevin Hill, Dublin 9, Ireland
Ph +353 1 8064276 Fax +353 1 8064247
------------------------------------
bofh.irmet.ie running RedHat Linux  9:04am  up 3 days, 20:53,  4 users,  load average: 0.02, 0.10, 0.08


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept 
for the presence of computer viruses.


**********************************************************************




More information about the Programming mailing list