[Techtalk] compiling a bunch of HTML files
Devdas Bhagat
devdas at dvb.homelinux.org
Thu Sep 9 18:01:04 EST 2004
On 09/09/04 13:35 +0300, Jaroslaw Fedevych (UALUG wrote:
> On Thu, Sep 09, 2004 at 11:11:22AM +0100, Noir wrote:
> > I have a bunch of (about 100!) HTML files that I'd
> > like to compile in one file. All the files have
> > frames, redundant pictures I'd like to exclude; they
>
> Find out which files are actually content; sort them out;
> you can also write a simple perl thing which eliminates
> <img> tags (I bet there's already one-liner for you); cat
> them together.
Actually, there is no "simple" parser for HTML/SGML content. If you want
to use Perl, use HTML::Parse and script it.
I don't think that it would be possible to do this mechanically, since a
lot of information can be context dependent.
If you are just looking for the content, without markup, then lynx -dump
is a good way of extracting content.
Personally, I would extract the text content, insert images as needed
and then do the markup and layout.
Devdas Bhagat
More information about the Techtalk
mailing list