[Techtalk] Robust link checker?
sam at myinternet.com.au
Tue Oct 16 14:06:29 EST 2001
I replied yesterday but forgot to cc the list.
On Mon, Oct 15, 2001 at 12:26:10PM -0400, Raven wrote:
> looking for a link checker program that can run through the site and
> give her a list of broken links so that she can fix them. The problem
> is, there are hundreds of thousands of links on the site, and all the
> programs she's tried so far have crashed, unable to handle a site of
> that size.
I could whip something up quite easily if you like, I have written
simple spiders in perl and in java before, and this sounds quite
similar. I can't imagine what crud software she must have been using if
it couldn't handle lots of links - I mean, how much memory does it take
to store a link!? Maybe "windows patform" is a hint.
I don't think we need to write code - all your base are belong to
$ apt-cache search broken | grep -i links
htcheck - Utility for checking web site for dead/external links
linkchecker - check HTML documents for broken links
$ apt-cache show htcheck
Description: Utility for checking web site for dead/external links
ht://Check is more than a link checker. It's a console application
written for Linux systems in C++ and derived from the best search
engine available on the Internet for free (GNU GPL): ht://Dig.
It can retrieve information through HTTP/1.1 and store them in a MySQL
database, and it's particularly suitable for small Internet domains or
Its purpose is to help a webmaster managing one or more related sites:
after a "crawl", ht://Check gives back very useful summaries and
reports, including broken links, anchors not found, content-types and
HTTP status codes summaries, etc.
$ apt-cache show linkchecker
Description: check HTML documents for broken links
o recursive checking
o output can be colored or normal text, HTML, SQL, CSV or a sitemap
graph in GML or XML
o HTTP/1.1, FTP, mailto:, nntp:, news:, Gopher, Telnet and local
file links are supported
o restrict link checking with regular expression filters for URLs
o proxy support
o give username/password for HTTP and FTP authorization
o robots.txt exclusion protocol support
o i18n support
o command line interface
o (Fast)CGI web interface (requires HTTP server)
There are probably others that my simple search missed.
More information about the Techtalk