[Techtalk] umlauts

Tue Sep 2 21:15:50 EST 2003

On Tue, Sep 02, 2003 at 01:09:34PM -0400, Shirrell wrote:
> 
> 
> We were sent a file that contains umlauts.
> They have ascii values above 127.
> 
> A-umlaut is ascii 196 (octal 304).
> O-umlaut is ascii 214 (octal 326).
> U-umlaut is ascii 220 (octal 334), etc.
> 
> There seems to be little consistency in the way they are
> represented on our 3 different platforms: solaris 8,
> RedHat 8, and Windows XP .  
> 
> Questions:
> (1) Can you find such a character in VI, or using GREP ?
>     In RedHat vi the umlauts appear as the proper German
>     characters.  In Solaris vi they appear with a back slash
>     followed by the 3 octal numbers

Yes.  It depends on how the file is coded.  RH handles this pretty
gracefully, because it uses Unicode.  RH vi probably decodes the file
automatically into UTF-8, i.e. RH's preferred Unicode coding (also a
very standard way of coding text).  I don't know how you can type
those characters in vi, however, (but ask me about emacs... type M-x
list-input-methods) unless you have a keyboard that has proper keys
for all kinds of accents.  Or use octal with grep.  Solaris probably
doesn't understand a thing about Unicode, and the file might be coded
in UTF-8, in which ü might a three-byte sequence.

> (2) Do the Fortran CHAR and ICHAR functions object to
>     values over 127 ?  This seems to be safe.  We have not
>     run this on RedHat as the program is written in Fortran 90 and
>     all I have is f77 on my linux machine.

Fortran 90 is very likely 8-bit safe.

> Are there any accepted conventions for these characters.

There are many.  Your file is probably (just a guess) encoded in
Latin-1; this is the accepted Western European coding, but it is being
replaced by Latin-9, which contains the € (euro) sign (öäü are encoded
in the same way).  It could also be UTF-8, which is better in a way,
because UTF-8 encodes gracefully almost anything into the same file.
RH started using UTF-8 which differs from all of the Latin encodings
which are 8-bit; at first this was a bit of a headache to me, but now
I actually enjoy it because it means I can write cyrillic and greek
into the same document pretty easily.  UTF-8 is a variable-length
encoding: some characters are stored as single bytes, others as series
of 2 to 4 bytes.  Use libraries to handle them.  See man iconv for
some help on translating the characters into other formats,
www.unicode.org for lengthier explanations.  This mail, BTW, is coded
in UTF-8.

HTH,
Eeva Järvinen

-- 
...women are not obedient, chaste, scented, and exquisitely apparelled by
nature.  They can only attain these graces, without which they may enjoy 
none of the delights of life, by the most tedious discipline.

                                                  V. Woolf, Orlando