When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d
, \s
, or \w
are more expanvise now (Know your character classes under different semantics). Most of us probably didn’t notice because the range of our inputs is limited.
To get the ASCII semantics back, you can use v5.14’s /a
match flag to restore their pre-v5.8 meanings.
If you look for \d
in later Perls, for example, you get a long list:
use charnames qw(:full); foreach ( 0 .. 0x10_ffff ) { next unless chr =~ /\d/; printf qq(0x%02X --> %s\n), $_, charnames::viacode($_); }
The number of matches you get depends on the version of Unicode included with that Perl:
0x30 --> DIGIT ZERO 0x31 --> DIGIT ONE 0x32 --> DIGIT TWO 0x33 --> DIGIT THREE 0x34 --> DIGIT FOUR 0x35 --> DIGIT FIVE 0x36 --> DIGIT SIX 0x37 --> DIGIT SEVEN 0x38 --> DIGIT EIGHT 0x39 --> DIGIT NINE 0x660 --> ARABIC-INDIC DIGIT ZERO 0x661 --> ARABIC-INDIC DIGIT ONE 0x662 --> ARABIC-INDIC DIGIT TWO ... 0x1D7FD --> MATHEMATICAL MONOSPACE DIGIT SEVEN 0x1D7FE --> MATHEMATICAL MONOSPACE DIGIT EIGHT 0x1D7FF --> MATHEMATICAL MONOSPACE DIGIT NINE
You can change this with the /a
flag:
use charnames qw(:full); foreach ( 0 .. 0x10_ffff ) { next unless chr =~ /\d/a; printf qq(0x%02X --> %s\n), $_, charnames::viacode($_); }
That makes \d
match only 0 to 9:
0x30 --> DIGIT ZERO 0x31 --> DIGIT ONE 0x32 --> DIGIT TWO 0x33 --> DIGIT THREE 0x34 --> DIGIT FOUR 0x35 --> DIGIT FIVE 0x36 --> DIGIT SIX 0x37 --> DIGIT SEVEN 0x38 --> DIGIT EIGHT 0x39 --> DIGIT NINE
The same goes for whitespace (\s
) and word characters (\w
).
That’s not all, though. There’s a problem with case insensitive matches. Some of the “wide” characters lowercase into the ASCII range. Well, so far there’s exactly one, but that doesn’t mean there might be more later. The Kelvin symbol, K (U+212A), lowercases into to k (U+004B). To avoid problems with fonts, I create the characters with the \x{}
sequences:
use v5.16; my $lower_k = "\x{006B}"; if( $lower_k =~ /\x{212A}/ ) { say "Matches without case insensitivity"; } if( $lower_k =~ /\x{212A}/i ) { say "Matches with case insensitivity"; } if( $lower_k =~ /\x{212A}/ia ) { say "Matches with case insensitivity, /a"; } if( $lower_k =~ /\x{212A}/iaa ) { say "Matches with case insensitivity, /aa"; }
The Kelvin sign matches with case insensivity, even with the /a
flag. However, it doesn’t match when you double up with /aa
. That extra /a
means, “no really, I mean ASCII”.
Matches with case insensitivity Matches with case insensitivity, /a
If you’re upgrading a huge codebase and want this behavior on all the regexes in a lexical scope (including a file (Know what creates a scope)), you can Set default regular expression modifiers.
Read perlre for the rest of the story.