[Techtalk] Scraping a webpage from a website.
Telsa Gwynne
hobbit at aloss.ukuu.org.uk
Fri May 2 17:01:57 EST 2003
On Fri, May 02, 2003 at 11:49:43AM -0400 or thereabouts, Jennifer Davis wrote:
> Hi. I was wondering if there was a simple way to script and download a
> website so it can be placed somewhere else. Essentially I need to
> retreive a site for a non-profit provided that is disbanding. I would
> assume a shell or perl script could do it, but my scripting skills are not
> quite there yet. Thanks...
Do you mean you want a copy of a "complete" website?
The first thing I'd do is ask the webmaster for a tarball of it :)
Failing that, I don't think you need to write a script. wget already
exists. It has a pile of command-line options, some of which are
important for your sanity (don't recurse endlessly until you have
covered the entire world-wide web for example) and others of which
will keep the webserver admin happy (maximum number of tries per
page; time to wait between retrievals).
Ethical questions like "whose content is it anyway?" I leave to
you.
Or do you mean you want a tool which will generate a site from
a pile of data? I once wrote myself a script which generated a
(valid HTML!) sitemap for my pages which relied entirely on sed,
cat, and some very carefully-placed meta tags. So it can be
done :) I'm not sure I'd recommend the approach I took though!
Telsa
More information about the Techtalk
mailing list