[Techtalk] Scraping a webpage from a website.

Telsa Gwynne hobbit at aloss.ukuu.org.uk
Fri May 2 17:01:57 EST 2003


On Fri, May 02, 2003 at 11:49:43AM -0400 or thereabouts, Jennifer Davis wrote:
> Hi.  I was wondering if there was a simple way to script and download a
> website so it can be placed somewhere else.  Essentially I need to

> retreive a site for a non-profit provided that is disbanding.  I would
> assume a shell or perl script could do it, but my scripting skills are not
> quite there yet.  Thanks...

Do you mean you want a copy of a "complete" website? 

The first thing I'd do is ask the webmaster for a tarball of it :) 

Failing that, I don't think you need to write a script. wget already
exists. It has a pile of command-line options, some of which are
important for your sanity (don't recurse endlessly until you have 
covered the entire world-wide web for example) and others of which
will keep the webserver admin happy (maximum number of tries per 
page; time to wait between retrievals).

Ethical questions like "whose content is it anyway?" I leave to
you.

Or do you mean you want a tool which will generate a site from
a pile of data? I once wrote myself a script which generated a
(valid HTML!) sitemap for my pages which relied entirely on sed, 
cat, and some very carefully-placed meta tags. So it can be 
done :) I'm not sure I'd recommend the approach I took though! 

Telsa


More information about the Techtalk mailing list