[Techtalk] compiling a bunch of HTML files

Devdas Bhagat devdas at dvb.homelinux.org
Thu Sep 9 18:01:04 EST 2004


On 09/09/04 13:35 +0300, Jaroslaw Fedevych (UALUG wrote:
> On Thu, Sep 09, 2004 at 11:11:22AM +0100, Noir wrote:
> > I have a bunch of (about 100!) HTML files that I'd
> > like to compile in one file. All the files have
> > frames, redundant pictures I'd like to exclude; they
> 
> Find out which files are actually content; sort them out;
> you can also write a simple perl thing which eliminates
> <img> tags (I bet there's already one-liner for you); cat
> them together.

Actually, there is no "simple" parser for HTML/SGML content. If you want
to use Perl, use HTML::Parse and script it.

I don't think that it would be possible to do this mechanically, since a
lot of information can be context dependent.

If you are just looking for the content, without markup, then lynx -dump
is a good way of extracting content.

Personally, I would extract the text content, insert images as needed
and then do the markup and layout.

Devdas Bhagat


More information about the Techtalk mailing list