Perl v5.18 added experimental character code set operations, a requirement for full Unicode support according to Unicode Technical Standard #18, which specifies what a compliant language must support and divides those into three levels.
The perlunicode documentation lists each requirement and its status in Perl. Besides some regular expression anchors handling all forms of line boundaries (which might break older programs), set subtraction and intersection in character classes was the last feature Perl needed to be Level 1 compliant.
Perl calls this experimental feature “Extended Bracketed Character Classes” in perlrecharclass. Inside the (?[ ])
, a regular expression does character class set operations. Inside the brackets, whitespace is insignificant (as if /x
is on). Here’s a simple example to find the character z:
use v5.18; no warnings qw(experimental::regex_sets); my $regex = qr/(?[ [z] ])/; while( <DATA> ) { chomp; say "[$_] ", /$regex/ ? 'Matched' : 'Missed'; } __DATA__ This is a line This is the next line And here's another line
None of the input lines have a letter z, so nothing matches:
[This is a line] Missed [This is the next line] Missed [And here's another line] Missed
To add more characters to the set, in old Perl (and still, even), you would add that character in the same set of brackets. If you want to find an x, you add that next to the z:
use v5.18; no warnings qw(experimental::regex_sets); my $regex = qr/(?[ [xz] ])/; while( <DATA> ) { chomp; say "[$_] ", /$regex/ ? 'Matched' : 'Missed'; } __DATA__ This is a line This is the next line And here's another line
And now the middle input line matches:
[This is a line] Missed [This is the next line] Matched [And here's another line] Missed
But, you can do this with set math. Since you want either of those to match, you would take a union. Inside the (?[ ])
, a +
is the union operator (the |
is also the union operator). Almost everything inside (?[ ])
is a metacharater, which is why you had to have another set of brackets around the literal characters in the previous example:
use v5.18; no warnings qw(experimental::regex_sets); my $regex = qr/(?[ [x] + [z] ])/; while( <DATA> ) { chomp; say "[$_] ", /$regex/ ? 'Matched' : 'Missed'; } __DATA__ This is a line This is the next line And here's another line
The output is the same as before because it’s the same character class:
[This is a line] Missed [This is the next line] Matched [And here's another line] Missed
You can also do intersections with the &
. In this example, you have two separate character classes that each have one character that matches each input line and they only have one character in common:
use v5.18; no warnings qw(experimental::regex_sets); my $regex = qr/(?[ [sxy] & [exw] ])/; while( <DATA> ) { chomp; say "[$_] ", /$regex/ ? 'Matched' : 'Missed'; } __DATA__ This is a line This is the next line And here's another line
Their union is only x, so only that character matches and you get the same input, again:
[This is a line] Missed [This is the next line] Matched [And here's another line] Missed
The -
is the set subtraction operator. In this example, the first character class are Perl word characters. You subtract from that the ASCII alphabetical characters, leaving only the digits and underscore:
use v5.18; no warnings qw(experimental::regex_sets); my $regex = qr/(?[ [\w] - [a-zA-Z] ])/; while( <DATA> ) { chomp; say "[$_] ", /$regex/ ? 'Matched' : 'Missed'; } __DATA__ This is 1 line This is the next line And here's another line
Only the first line has a digit, so only it matches:
[This is 1 line] Matched [This is the next line] Missed [And here's another line] Missed
This gets more interesting with named properties, the only Level 2 feature Perl supports so far (see perluniprops). Some character classes may be easier to construct, read, and maintain without losing their literal characters. Suppose you want to get just the Eastern Arabic digits, perhaps because you’re in a country that uses Arabic as I am as I write this. You can take the intersection of the Arabic property and the Digit property. The Universal Character Set has this wonderful feature to assign many labels to its characters so we can identify subsets of a particular script:
use v5.18; use utf8; use open qw(:std :utf8); no warnings qw(experimental::regex_sets); my $regex = qr/(?[ \p{Arabic} & \p{Digit} ])/; foreach my $ord ( 0 .. 0x10fffd ) { my $char = chr( $ord ); say $char if $char =~ m/$regex/; }
Now you see just the digits from that script:
۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
You can get more complicated. If you wanted the Western Arabic Digits too (what we normally call just “arabic numerals”). Although some of this problem is easy, that doesn’t show off the operations. In this example, you have two separate intersections that are joined in a union:
use v5.18; use utf8; use open qw(:std :utf8); no warnings qw(experimental::regex_sets); my $regex = qr/(?[ ( \p{Arabic} & \p{Digit} ) + ( \p{ASCII} & \p{Digit} ) ])/; foreach my $ord ( 0 .. 0x10fffd ) { my $char = chr( $ord ); say $char if $char =~ m/$regex/; }
Now you see two sets of numerals:
0 1 2 3 4 5 6 7 8 9 ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
There is one more character class set operator, the ^
, which acts like an exclusive-or (the xor
bit operator uses the same character. This operator takes the union of the two character classes then subtracts their intersection. That is, the resulting set has all the characters in both classes except for the ones they both have.
In this example, you have two intersections to extract the hex digits and digits from ASCII. That’s important since other scripts in the UCS have characters with these properties. From those intersections, you use the ^
to get the set that only contains the characters that show up in exactly one set.
use v5.18; use utf8; use open qw(:std :utf8); no warnings qw(experimental::regex_sets); my $regex = qr/(?[ ( \p{ASCII} & \p{HexDigit} ) ^ ( \p{ASCII} & \p{Digit} ) ])/; foreach my $ord ( 0 .. 0x10fffd ) { my $char = chr( $ord ); say $char if $char =~ m/$regex/; }
In this case, it’s the uppercase and lowercase letters:
A B C D E F a b c d e f
Here’s another example. In Match Unicode characters by property value, you could find characters by their numerical value. But not all of those are “digits”. You can construct a character class for the digits (already the shortcut \d
) and the character class with numerical values. Only when both of those character classes match in the same position does the overall character class set match:
use v5.18; no warnings qw(experimental::regex_sets); use open qw(:std :utf8); foreach ( 1 .. 0x10fffd ) { next unless chr =~ m/ (?[ \d & [\p{nv=1}\p{nv=3}\p{nv=7}] ]) /x; printf "%s (U+%04X)\n", chr, $_; }
Things to remember
- Regular expression character class set operations satisfy UTS #18 Level 1 requirements.
- You can compose character classes from other classes with unions, intersections, and subtractions.
- Inside the
(?[ ])
, whitespace is insignificant. regex_sets
is an experimental feature.
Great explanation, I had completely missed that you could compose character classes.