[Techtalk] umlauts

Anthony de Boer linuxchix at lists.leftmind.net
Tue Sep 2 14:34:24 EST 2003


Shirrell wrote:
>     A record containing O-umlaut SOMETIMES comes in
>     as a single character ascii 214, as one would expect,
>     and sometimes as 4 characters, \326.
>  ...
> Are there any accepted conventions for these characters.

There are *several* accepted conventions for characters above and beyond
the original ASCII.  Making sure the sender and recipient of the data
agree on the same set is important!

I wrote some software several years ago that had to handle data in
various character encodings and convert.  I've managed to retire many of
the most-stressed braincells from that time, but ISO-10646, ISO-8859,
ISO-twothousandsomething, Unicode, some curveballs from the American
Library Association, and then display issues using customized fonts, are
among the recollections that seep back.

One thing you may run into is that some of the encodings specify
swappable character sets in the "GL" and "GR" ("graphics left" and
"right", aka lower and upper) halves of the 8-bit character space, and
there's an escape sequence to specify a change of code page, so the
sender may prefix a character with that sequence if it's not sure you
have the right page loaded, just to be sure you don't display the byte
as a Hebrew or Cyrillic character or somesuch.

On the other hand, if you mean the four characters are literally "\326",
then that's a Unix-convention octal representation of the decimal
character number 214.  Something's gotten in between and escaped out
that character.

-- 
Anthony de Boer


More information about the Techtalk mailing list