Perl’s regular expressions have a simple rule for capturing groups. It counts the order of left parentheses to assign capture variables. Not all capture groups must actually match parts of the string, and Perl doesn’t care if they do. Perl assigns capture groups inside an alternation consecutively, even though it knows that only one branch of the alternation will match. Perl 5.10 adds the branch reset, (?|alternation)
which mitigates that, though.
How many captures will a particular pattern produce? Can you tell just by looking at the pattern? How much does the particular string matter? How many capture groups are in this pattern:
(Buster)|(Mimi)|(Ella)
There are three capture groups. Only one of them is going to capture because each group is in a different branch of the alternation. What capture variables will that pattern set?
String | Triggered groups | $1 | $2 | $3 |
---|---|---|---|---|
Buster | (Buster) | Buster | undef | undef |
Mimi | (Mimi) | undef | Mimi | undef |
Ella | (Ella) | undef | undef | Ella |
Buster Mimi Ella | (Buster) | Buster | undef | undef |
No matter which string you match against this pattern, you’ll also set at least three of the capture variables, and two of those will be undefined.
Perl 5.10 introduces the branch reset pattern, (?|alternation)
. You use that so that Perl numbers the capture buffers from the same starting point for each branch in the alternation. Instead of creating three capture buffers in your alternation, you can create just one buffer for this pattern:
(?|(Buster)|(Mimi)|(Ella))
The three capture groups in this pattern populate the same buffer:
String | Triggered groups | $1 |
---|---|---|
Buster | (Buster) | Buster |
Mimi | (Mimi) | Mimi |
Ella | (Ella) | Ella |
Buster Mimi Ella | (Buster) | Buster |
This is more important when the alternation is in the middle of a larger pattern and there are additional capture groups after the alternation:
(?|(Buster)|(Mimi)|(Ella))(Ginger)
That’s a bit easier to read with extended patterns (Item 37: Make regular expressions readable):
(?| # $1 (Buster) | (Mimi) | (Ella) ) ( # $2 Ginger )
No matter how many branches you add to the alternation, the group for Ginger
is always $2
:
(?| # $1 (Buster) | (Mimi) | (Ella) | (Roscoe) ) ( # $2 Ginger )
That doesn’t mean that the numbering after the alternation is always the same though. Not every branch must have the same number of captures, but the pattern reset grouping always takes up the number of buffers in the branch with the most capture groups even if that’s not the branch that matches. Consider this pattern where one of the branches has two capture groups:
(?| (Buster) | # $1, $2 is undef (Mimi)(Roscoe) | # $1, $2 (Ella) # $1, $2 is undef ) ( # $3 Ginger )
The $1
variable is always the first capture group of whichever branch matched:
String | Triggered groups | $1 |
---|---|---|
BusterGinger | (Buster) | Buster |
MimiRoscoeGinger | (Mimi) | Mimi |
The branch reset can cause problems with named captures (Item 31: Use named captures to label matches), which are really just aliases the the numbered captured variables. Labeling each capture group doesn’t do what you might expect:
(?| (?<cat1>Buster) | # $1, $2 is undef (?<cat2>Mimi)(?<cat3>Roscoe) | # $1, $2 (?<cat4>Ella) # $1, $2 is undef ) (?<cat5> Ginger )
Each label is just an alias to its numbered capture variable:
Label | Aliased to |
---|---|
cat1 | $1 |
cat2 | $1 |
cat3 | $2 |
cat4 | $1 |
cat5 | $3 |
The labels don’t apply to the groups you think they do:
String | $1 | $2 | cat1 | cat2 | cat3 | cat4 |
---|---|---|---|---|---|---|
BusterGinger | Buster | undef | Buster | Buster | undef | Buster |
EllaGinger | Ella | undef | Ella | Ella | undef | Ella |
MimiRoscoeGinger | Mimi | Roscoe | Mimi | Mimi | Roscoe | Mimi |
You should probably use the same labels in each branch and order them the same so you get the results that you expect:
(?| (?<cat1>Buster) | # $1, $2 is undef (?<cat1>Mimi)(?<cat2>Roscoe) | # $1, $2 (?<cat1>Ella) # $1, $2 is undef ) (?<cat3> Ginger )
Things to remember
- Perl numbers capture groups by counting the literal order of left parentheses
- Every capture group in an alternation creates a capture buffer
- The pattern reset grouping,
(?|...)
restarts the buffer numbering for each branch of the alternation - Label captures in alternations with the same labels in the same order
I was playing with branch reset today, and there’s one thing I can’t understand. Please take a look at this code:
I expected the result would be
But instead it’s
Could you tell my why I didn’t meet my expectation? Why is $1 undef (although if the string was Wilma and Barney it would be OK)?
The problem is your second alternation and precedence. If you group
(?:(Barney)|(Fred))
so that | only applies to them, you get the output you expected.