[Techtalk] website link checking that does orphans?

Mon Apr 13 00:40:35 UTC 2015

Hi Akkana,

I'm not sure if this suits what you want, but I would use wget to 
download the whole site to my hard drive and then run linklint on that.

Wget can be told to mirror a site and rename all text/html files with 
the extension ".html", and to adjust all links inside the pages so they 
point to the local downloaded versions and reflect the altered names. 
This avoids any problem linklint might have with php extensions. (It has 
been a long time since I use linklint so I can't remember what 
peculiarities it might have in that respect.)

I'd use something like:
wget -m -k -p -E -l 0 -np -e robots=off --no-check-certificate 
--user-agent="" "$@"

-m      is a shortcut option that is equivalent to -r -N -l inf -nr
-r      recursive download
-N      don't re-retrieve files unless newer than local
-l inf  infinite recursion depth
-nr     don't remove '.listing' files
-k      converts all links to point to the locally downloaded files
-p      gets all parts of the page (i.e. pictures, etc)
-E      save all text/html documents with '.html' extension
-l 0    maximum recursion depth (inf or 0 for infinite) (in some older 
versions might need to be explicitly given)
-np     don't ascend to parent directories (if only a part of a site is 
wanted)
-e robots=off  ignore "no robots" restriction - should be used with care
--no-check-certificate  bypass site verification
--user-agent=""  don't report that the program is wget

Best wishes,

	- Miriam

On 13/04/15 07:09, Akkana Peck wrote:
> Hi, all --
>
> I've inherited a website that's a complete mess -- I'm sure
> there are tons of orphans there, files that aren't linked to by
> anything, and I'd like to clean them up.
>
> But I'm totally striking out in finding a link checker that will
> tell me about not just broken links, but also orphans.
>
> I found linklint, which has a -orphan flag; but it can only check
> for orphans in a local directory, not with the -http flag. When
> I try using it, I get huge numbers of orphans because one of the
> directories it checks is .. -- in other words, it checks all files
> in my entire filesystem to see whether they're referenced by files
> in my web directory.
>
> Also, since it doesn't go through the web server, it will totally
> miss anything referenced from PHP.
>
> I need something where I can give it a URL, say,
> http://localhost/index.html, plus a directory, say, /var/www/htdocs
> or ~/public_html, and have it start at the URL, go through and
> spider everything accessible from there, and give me a report on
> broken links on the website, plus orphaned files in the directory
> that aren't accessed from the website. Bonus points if I can control
> whether it reports on broken links from external websites, or only
> broken links within localhost.
>
> This seems like such a basic need, and something that would be so
> simple to write, that I'm flabbergasted that I can't find anything
> to do it. But I've been googling for an hour (and this isn't the
> first time I've tried looking for something like this) and I haven't
> found anything that works.
>
>          ...Akkana
> _______________________________________________
> Techtalk mailing list
> Techtalk at linuxchix.org
> http://mailman.linuxchix.org/mailman/listinfo/techtalk
>

-- 
If you don't have any failures then you're not trying hard enough.
  - Dr. Charles Elachi, director of NASA's Jet Propulsion Laboratory
-----
Website: http://miriam-english.org
Blogs:   http://miriam-e.dreamwidth.org
          http://miriam-e.livejournal.com