Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. No matter which way you treat your strings you get the same answer. With Unicode, however, Perl might now treat certain sequences of bytes as one character. The character and byte semantics have diverged. If you let Perl treat your data as character data when it really isn’t, you can run into problems. If you aren’t already doing something special, you’re probably using character semantics.
Whitespace
How many whitespace characters can you name? Think about that for a moment; don’t cheat by scanning ahead. If I was clever I’d have some sort of Javascript thing that would make you wait, or at least take 15 seconds to subvert, but all I have is this sentence.
Ready? How many did you get? Most people can name at least three ASCII whitespace characters. Some people can name four:
- space (0x20)
- carriage return (0x0A)
- newline (0x0D)
- horizontal tab (0x09)
There are a few others though. If you’re an olde tyme unix geek, you might also get the two other less known ones:
- vertical tab (0x0B)
- form feed (0x0C)
How can you get all of the ASCII whitespace is you didn’t know they were already? You would think that you could just whip up a quick one-liner to do the trick:
$ perl5.10.1 -le 'for(0..127){next if chr =~ /\S/; printf qq(0x%02X\n), $_}' 0x09 0x0A 0x0C 0x0D 0x20
That’s only five of them though. Perl’s \s
character class apparently doesn’t match the vertical tab. It’s not exactly documented that way in perlre, which just says that \s
matches whitespace. However, you can surmise that it’s different from the POSIX definition of whitespace because perlre documents the POSIX character class [[:space:]]
as \s
along with the vertical tab. Adjusting your one-liner to use the POSIX definition instead (and a negated binding operator) gets you that vertical tab (0x0B):
$ perl -le 'for(0..127){next if chr !~ /[[:space:]]/; printf qq(0x%02X\n), $_}' 0x09 0x0A 0x0B 0x0C 0x0D 0x20
It doesn’t stop there. You didn’t think that you’d get all those fancy characters in Unicode without a bunch of fancy whitespace to go with them, did you? First, adjust your one-liner to use a Unicode property (Item 76. Match Unicode characters and properties) instead of a character class:
$ perl -le 'for(0..127){next if chr !~ /\p{Space}/; printf qq(0x%02X\n), $_}' 0x09 0x0A 0x0B 0x0C 0x0D 0x20
You want to get more fancy though, so you can abandon the one-liner. You want to get the character names too (Item 74. Specify Unicode characters by code point or name). Pull in the charnames
module to turn the code point into the name:
use 5.010; use charnames qw(:full); foreach ( 0 .. 127 ) { next unless chr =~ /\p{Space}/; printf qq(0x%02X --> %s\n), $_, charnames::viacode($_); }
Now you know what those numbers represent:
0x09 --> CHARACTER TABULATION 0x0A --> LINE FEED (LF) 0x0B --> LINE TABULATION 0x0C --> FORM FEED (FF) 0x0D --> CARRIAGE RETURN (CR) 0x20 --> SPACE
Okay, so how much whitespace can you find if you go up to 10ffff (the last Unicode “character” code point)?
0x09 --> CHARACTER TABULATION 0x0A --> LINE FEED (LF) 0x0B --> LINE TABULATION 0x0C --> FORM FEED (FF) 0x0D --> CARRIAGE RETURN (CR) 0x20 --> SPACE 0x85 --> NEXT LINE (NEL) 0xA0 --> NO-BREAK SPACE 0x1680 --> OGHAM SPACE MARK 0x180E --> MONGOLIAN VOWEL SEPARATOR 0x2000 --> EN QUAD 0x2001 --> EM QUAD 0x2002 --> EN SPACE 0x2003 --> EM SPACE 0x2004 --> THREE-PER-EM SPACE 0x2005 --> FOUR-PER-EM SPACE 0x2006 --> SIX-PER-EM SPACE 0x2007 --> FIGURE SPACE 0x2008 --> PUNCTUATION SPACE 0x2009 --> THIN SPACE 0x200A --> HAIR SPACE 0x2028 --> LINE SEPARATOR 0x2029 --> PARAGRAPH SEPARATOR 0x202F --> NARROW NO-BREAK SPACE 0x205F --> MEDIUM MATHEMATICAL SPACE 0x3000 --> IDEOGRAPHIC SPACE
Did you what to potentially match those extra characters? Did you even know they existed? And that’s just using the Unicode property \p{Space}
. What happens if you use the \s
instead? It turns out that you get almost exactly the same thing with one difference: the \s
still doesn’t match the vertical tab.
Perl 5.10 added the \h
and \v{Space}
character classes for horizontal and vertical whitespace. How do those hold up? Make a table of all the sorts of whitespace and how they match:
use 5.010; use charnames qw(:full); print <<"LEGEND"; s matches \\s, matches Perl whitespace h matches \\h, horizontal whitespace v matches \\v, vertical whitespace p matches [[:space:]], POSIX whitespace all characters match Unicode whitespace, \\p{Space} LEGEND printf qq(%s %s %s %s %-7s --> %s\n), qw( s h v p Ordinal Name ); print '-' x 50, "\n"; foreach my $ord ( 0 .. 0x10ffff ) { next unless chr($ord) =~ /\p{Space}/; my( $s, $h, $v, $posix ) = map { chr($ord) =~ m/$_/ ? 'x' : ' ' } ( qr/\s/, qr/\h/, qr/\v/, qr/[[:space:]]/ ); printf qq(%s %s %s %s 0x%04X --> %s\n), $s, $h, $v, $posix, $ord, charnames::viacode($ord); }
The output shows that shows there are several different definitions of whitespace:
s matches \s, matches Perl whitespace h matches \h, horizontal whitespace v matches \v, vertical whitespace p matches [[:space:]], POSIX whitespace all characters match Unicode whitespace, \p{Space} s h v p Ordinal --> Name -------------------------------------------------- x x x 0x0009 --> CHARACTER TABULATION x x x 0x000A --> LINE FEED (LF) x x 0x000B --> LINE TABULATION x x x 0x000C --> FORM FEED (FF) x x x 0x000D --> CARRIAGE RETURN (CR) x x x 0x0020 --> SPACE x 0x0085 --> NEXT LINE (NEL) x 0x00A0 --> NO-BREAK SPACE x x x 0x1680 --> OGHAM SPACE MARK x x x 0x180E --> MONGOLIAN VOWEL SEPARATOR x x x 0x2000 --> EN QUAD x x x 0x2001 --> EM QUAD x x x 0x2002 --> EN SPACE x x x 0x2003 --> EM SPACE x x x 0x2004 --> THREE-PER-EM SPACE x x x 0x2005 --> FOUR-PER-EM SPACE x x x 0x2006 --> SIX-PER-EM SPACE x x x 0x2007 --> FIGURE SPACE x x x 0x2008 --> PUNCTUATION SPACE x x x 0x2009 --> THIN SPACE x x x 0x200A --> HAIR SPACE x x x 0x2028 --> LINE SEPARATOR x x x 0x2029 --> PARAGRAPH SEPARATOR x x x 0x202F --> NARROW NO-BREAK SPACE x x x 0x205F --> MEDIUM MATHEMATICAL SPACE x x x 0x3000 --> IDEOGRAPHIC SPACE
Digits
It’s not just whitespace either. What about the digits? Most people expect \d
to match only the characters in the set (0, 1, 2, 3, 4, 5, 6, 7, 8, 9). Try it:
use charnames qw(:full); binmode STDOUT, ':utf8'; foreach ( 0 .. 0x10FFFF ) { next unless chr =~ /\d/; printf qq(0x%04X %s --> %s\n), $_, chr, charnames::viacode($_); }
You get hundreds of lines of output:
0x0030 0 --> DIGIT ZERO 0x0031 1 --> DIGIT ONE 0x0032 2 --> DIGIT TWO 0x0033 3 --> DIGIT THREE 0x0034 4 --> DIGIT FOUR 0x0035 5 --> DIGIT FIVE 0x0036 6 --> DIGIT SIX 0x0037 7 --> DIGIT SEVEN 0x0038 8 --> DIGIT EIGHT 0x0039 9 --> DIGIT NINE 0x0660 ٠ --> ARABIC-INDIC DIGIT ZERO 0x0661 ١ --> ARABIC-INDIC DIGIT ONE 0x0662 ٢ --> ARABIC-INDIC DIGIT TWO 0x0663 ٣ --> ARABIC-INDIC DIGIT THREE ...
If you only wanted the Arabic numerals (which aren’t the ones with “ARABIC” in
their name), you can’t rely on \d
.
Word characters
From the early days of Perl, you’ve been told that \w
is the set of characters that you can legally use to name Perl variables (that is, “identifier characters”). Before Perl’s Unicode awareness, that was the rather limited set of [A-Za-z0-9_]
. The perlre documents it as “alphanumerics plus underscore”, but it doesn’t define the set of alphanumerics. In Item 72. Use Unicode in your source code, you saw how to use many of the Unicode characters as variable names:
my $π = 3.14159265;
Those are legal identifier characters, so \w
is going to match them. If you were expecting something else, you might be in for a surprise.
A possible fix
Besides making more specific character classes without using the character class shortcuts, how can you avoid this? The problem is all the Unicode nonsense and that Perl is handling the strings as Unicode strings. Another way to say that is that Perl uses character semantics normally. If Perl treated your string as octets instead, you’re back to ASCII semantics for \d
, \w
, and \s
. The bytes pragma is lexical, so it only affects strings temporarily:
foreach ( 0 .. 0xFF ) { use bytes; next unless chr =~ /\w/; printf qq(0x%04X %s --> %s\n), $_, chr; }
If you’re playing with binary data, tell Perl that you’re playing with binary data.
I enjoy your writings. A mention of perlrecharclass might be good to add to this one.
About word characters:
$ perl -Mutf8 -E’use constant π => atan2( 0,-1 );print π,”\n”;’
Wide character in print at -e line 1.
π
$ perl -Mutf8 -E’sub π (){ atan2( 0,-1 )};print π,”\n”;’
3.14159265358979
$
I wonder why.
I ran into this problem a couple of weeks ago, but I forget what I was doing. Where I thought I had a bareword filehandle, it assumed the bareword as a string of some sort. I don’t know why that happens.