[prog] Perl and Japanese filenames

Almut Behrens almut-behrens at gmx.net
Sat Oct 21 12:41:51 UTC 2006


Hi All,

I've been put in charge of migrating a Japanese customer site with many
Windows machines from an ancient version of Perl to something current
(not that I would know much about Windows or Japanese, but being the
"perl girl" in the company I work, such things also end up on my desk...
Well, I'm learning ;)

Anyway, at the moment they're using "jperl", which is a special
japanised Perl, based on version 5.003 (!). More precisely, this is a
patch which makes all the necessary modifications directly to the Perl
sources, such that the resulting perl binary can handle the legacy
encodings SJIS (roughly equivalent to Microsofts CP932) or EUC-JP.

This patch has been developed at a time when Perl didn't yet have any
unicode functionality.  As Perl-5.8.x now comes with comprehensive
unicode support, IO filters and stuff, this patch is no longer
maintained, and, of course, doesn't apply to any recent version of
Perl.

Well, over the years, lots of little jperl-specific scripts have
accumulated at the site (several hundreds, the admins say...), so
ideally, they would not have to touch any of those, but rather just
roll out the new version of Perl (plus some compatibility module), and
everything should work as before.  At least, that's the plan.

As all the old scripts contain the statement "use I18N::Japanese;"
(that's how the specific jperl functionality is enabled in the binary),
I thought this Japanese.pm would be the ideal place to put my
compatibility code... (I18N::Japanese is not being used otherwise in a
standard, recent version of Perl).
Essentially, it involves saying "use encoding 'cp932';" [1] (the old
scripts are written in MS CP932), to make Perl parse any literal
strings, regexes, etc. in the script correctly and convert them to
Perl's internal unicode format.  So far, so good.  Thing is, they
have code like this [2]

system('mkdir "C:\Documents and Settings\All Users\ƒXƒ^[ƒg ƒƒjƒ
[\ƒvƒƒOƒ‰ƒ€\‘ã•\"');

This doesn't work, because the pathname being passed to system() now
is in perl's internal unicode format, instead of the CP932 that the
windows side expects.  I'm not sure how to handle this properly.

What I've come up with so far is the following workaround

use Encode "encode";

*CORE::GLOBAL::system = sub {
    # explicitly convert from Perl's internal unicode format
    # into legacy CP932 encoding
    my @args = map encode("cp932", $_), @_;
    CORE::system(@args);  # call original internal routine
};

This overrides/wraps Perl's internal system() function, in order to do
the required conversion of the arguments explicitly.  Although this
does work, essentially, I can't help thinking this is more cumbersome
than things typically need to be in Perl.  In particular, as I would
have to write similar wrappers for all other functions that take a
filename argument (mkdir(), chdir(), open(), opendir(), rename(),
unlink(), glob() and friends...).  This can't be it!?

So, I'm wondering if I'm missing that magic incantation which would
somehow convert all filenames to the desired target encoding when
passing them to the respective system functions...

Any ideas?

Thanks,
Almut


[1] actually, I can't say "use encoding 'cp932'" in this case, because
in very recent versions of Perl this statement seems to be lexically
scoped (contrary to what's being documented).  So, putting it in a
module wouldn't have any effect on the code that's "use"ing that module.
IOW, I have to write "require encoding; encoding->import('cp932');",
which is functionally equivalent, but without the implicit BEGIN{} block.

[2] as I'm not sure if these SJIS multi-byte/8-bit characters will
survive mail encoding/transport, just in case, here's the same string
in alternative encodings:

# CP932/SJIS in hex
system("mkdir \"C:\\Documents and Settings\\All Users\\\x83\x58\x83\x5E\x81\x5B\x83\x67 \x83\x81\x83\x6A\x83\x85\x81\x5B\\\x83\x76\x83\x8D\x83\x4F\x83\x89\x83\x80\\\x91\xE3\x95\x5C\"");
# as unicode codepoints
system("mkdir \"C:\\Documents and Settings\\All Users\\\x{30B9}\x{30BF}\x{30FC}\x{30C8} \x{30E1}\x{30CB}\x{30E5}\x{30FC}\\\x{30D7}\x{30ED}\x{30B0}\x{30E9}\x{30E0}\\\x{4EE3}\x{8868}\"");



More information about the Programming mailing list