[Techtalk] Web page downloads with null characters

Thu Oct 23 09:26:15 EST 2003

On Thu, Oct 23, 2003 at 10:17:56AM +1000 or thereabouts, John Clarke wrote:
> On Wed, Oct 22, 2003 at 12:19:52 +0200, Hamster wrote:
> 
> > But what happens when you choose View Source (or your browser's equivalent)?
> 
> it's fine.  i see exactly the same content as when i fetch the page
> manually or with wget.

Same here. Renders correctly.

> it's starting to sound like a locale/charset problem.  what's your $LANG
> and $SUPPORTED?.  on my rh7.3 machine (the one i use most) i have:

I didn't even know about SUPPORTED. Where's that from? What sets it?

> the default for rh9 is to use utf-8:
...
> i've removed utf-8 because it's known to break a few things.  i can't
> test the page in question on that box because it's my web/mail server
> and doesn't have X installed (and even if it did, it's 20km away and
> i'm only on a 56k dialup).
> 
> if your $LANG includes utf-8, can you try changing it to see if it makes
> a difference?

Not without restarting a whole pile of things I need open. But yes,
I think it's a charset/locale/font thing. I didn't realise you were
not using UTF-8. I assumed that as you were on RH, you were using
it. What does it break, in your experience? 

Slight digression: I remember, um, "teething troubles" when Red Hat
switched: scripts depending on LC_COLLATE=C went all funny, and 
a lot of man pages have/had spurious characters. Some people 
reported grep taking a very long time to run. And all my filenames
with non-ASCII characters (mostly ogg files from bands who don't
sing in English) turned into names like "heil?g_j?l.ogg" and 
"tolldy_t?_coch.ogg".

You can fix the first with "export LC_COLLATE="C"" in your .bashrc
or at the top of your script. 
You can fix the second with "LANG=C man whatever".
You can fix the third the same way: "LANG=C grep thingy"
And a friend wrote a program to correct filenames which had broken.
I don't know whether he's put it anywhere on the web. If not, he should :) 

Is there anything else that you have found it breaks? I need UTF-8
because I want non-English characters. Since the change, most things
have been fixed: or at least the ones where people have filed bugs
about them. I know it's a pain to file one "man page has silly
characters" bug after an another, but it seems the only way to 
do it. 

Anyway, yes, I bet this is a character encoding thing. I recently
had fun and games trying to work out why a web page in UTF-8
that I wrote on a UTF-8-happy machine worked here, but had 
weird characters after I had used scp to move it to another machine.
I was on the point of believing scp must strip out non-ASCII :) 
It turned out to be that the machine assumed this were ASCII
by default and the web server was serving it all as iso-8859-1.
I had to put AddDefaultCharset UTF-8 into a .htaccess file to 
make the validator happy again. And I had to correct my locale
on the box in order for vim to be able to correct all the 
characters. 

I am avoiding trying to _explain_ any of this because I really
don't understand it well. There seem to be very few docs which
are at my level on the subject. I'm just sticking to "things
I have seen happen". I keep hoping someone else will write some
decent "what is this all about, why, and how can I use it?"
documentation. If I understood it well enough, be sure I'd have
written it already: writing docs is a great way to learn more or
to realise you don't know enough yet!

Telsa