does perl's \W not match accented characters in UTF-8 locales, or am I doing something wrong?
I'm running AMD64 Ubuntu Hardy, with locales 2.7.9-4. The locales on my system are the default English ones, plus
en_CA, fr_CA, and fr_CA.UTF-8. (added to /var/lib/
I'm trying to write a perl script that does some regex matching on UTF-8 (English and French) text. I'm finding that \w doesn't match accented vowels, even with LANG=fr_CA.utf8. (I can understand accented vowels not counting as English word characters, but they are definitely valid French word characters.) The fr_CA locale's LC_CTYPE definition seems to work, though, and somehow this hack actually works with UTF-8.
#!/usr/bin/perl -wCDS
# -CD = utf8 default for input/output. -CS = utf8 stdin/out/err
use locale;
use utf8;
use POSIX qw(locale_h);
#setlocale(LC_ALL, "fr_CA.ISO8859-1");
# we want accented chars to count as part of words,
# but Ubuntu's fr_CA.utf8 locale doesn't include accented characters as words.
my $eaccent = "é\n";
$eaccent =~ s/\w//;
print $eaccent;
print "word characters are: ", +(sort grep /\w/, map { chr } 0..255), "\n";
With setlocale commented, the script outputs
é (so the s/// didn't match it)
word characters are: _0123456789AaBb
With setlocale uncommented:
word characters are: µ_0123456789AaÁ
With 0..65535 instead of 0..255, you get a whole lot of word characters either way, but I think é is still not part of it.
I'm not familiar enough with how locales are supposed to work to report a bug right away, but it certainly seems to me that something's weird, either in the locale files or in perl. (I'm surprised setlocale works, too. Maybe I'm just lucky that EN and FR both fit in ISO 8859-1.)
thanks.
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- No assignee Edit question
- Solved by:
- Jonathan Marsden
- Solved:
- Last query:
- Last reply: