[Techtalk] Robust link checker?

Sam Watkins sam at myinternet.com.au
Tue Oct 16 14:06:29 EST 2001


I replied yesterday but forgot to cc the list.

On Mon, Oct 15, 2001 at 12:26:10PM -0400, Raven wrote:
> looking for a link checker program that can run through the site and
> give her a list of broken links so that she can fix them.  The problem
> is, there are hundreds of thousands of links on the site, and all the
> programs she's tried so far have crashed, unable to handle a site of
> that size.

I could whip something up quite easily if you like, I have written
simple spiders in perl and in java before, and this sounds quite
similar.  I can't imagine what crud software she must have been using if
it couldn't handle lots of links - I mean, how much memory does it take
to store a link!?  Maybe "windows patform" is a hint.

I don't think we need to write code - all your base are belong to
Debian:

$ apt-cache search broken | grep -i links
htcheck - Utility for checking web site for dead/external links
linkchecker - check HTML documents for broken links

$ apt-cache show htcheck
Package: htcheck
<snip>
Description: Utility for checking web site for dead/external links
 ht://Check is more than a link checker.  It's a console application
 written for Linux systems in C++ and derived from the best search
 engine available on the Internet for free (GNU GPL): ht://Dig.
 .
 It can retrieve information through HTTP/1.1 and store them in a MySQL
 database, and it's particularly suitable for small Internet domains or
 Intranet.
 .
 Its purpose is to help a webmaster managing one or more related sites:
 after a "crawl", ht://Check gives back very useful summaries and
 reports, including broken links, anchors not found, content-types and
 HTTP status codes summaries, etc.

$ apt-cache show linkchecker
Package: linkchecker
<snip>
Description: check HTML documents for broken links
 Features:
  o recursive checking
  o multithreaded
  o output can be colored or normal text, HTML, SQL, CSV or a sitemap
    graph in GML or XML
  o HTTP/1.1, FTP, mailto:, nntp:, news:, Gopher, Telnet and local
    file links are supported
  o restrict link checking with regular expression filters for URLs
  o proxy support
  o give username/password for HTTP and FTP authorization
  o robots.txt exclusion protocol support
  o i18n support
  o command line interface
  o (Fast)CGI web interface (requires HTTP server)


There are probably others that my simple search missed.





More information about the Techtalk mailing list