This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.
Perl v5.36 adds the -g
switch as a shortcut for -0777
, which undefines the input record separator so you can read an entire file as a single string. This is often called “slurping”, and is useful when you need to process text that spans several lines.
The input record separator
The input record separator is the character (or characters) that Perl’s line-input operator uses to determine when a line has ended. By default, that’s a newline (U+0010), but you can use any string you like by setting $/
($INPUT_RECORD_SEPARATOR
). Sometimes the form feed is a useful separator for multiline records:
$/ = "\f";
On the command line, the -0
switch is a quick way to set the value for $/
. Without a value, it uses the null byte, which is sometimes a useful as a separator:
% perl -MO=Deparse -0 -e 1 BEGIN { $/ = "\000"; $\ = undef; } '???'; -e syntax OK
A number in octal or hexadecimal sets $/
to some other single character:
% perl -MO=Deparse -0014 -e 1 BEGIN { $/ = "\f"; $\ = undef; } '???'; -e syntax OK
% perl -MO=Deparse -0xC -e 1 BEGIN { $/ = "\f"; $\ = undef; } '???'; -e syntax OK
Any number above 0377 octal (more than 255 decimal) sets $/
to undef:
% perl -MO=Deparse -0377 -e 1 BEGIN { $/ = "\377"; $\ = undef; } '???'; -e syntax OK
% perl -MO=Deparse -0400 -e 1 BEGIN { $/ = undef; $\ = undef; } '???'; -e syntax OK
Conventionally, though, the Perl documentation has used 777
as the value to get undef probably since it’s easier to remember:
% perl -MO=Deparse -0777 -e 1 BEGIN { $/ = undef; $\ = undef; } '???'; -e syntax OK
The new -g
is a short synonym for -0777
, so it does the same thing that :
% perl -MO=Deparse -e 1 '???'; -e syntax OK
% perl -MO=Deparse -g -e 1 BEGIN { $/ = undef; $\ = undef; } '???'; -e syntax OK
Far more than you ever wanted to know
The -0
switch has some other interesting behavior, and has a few other interesting features. Since I’m already writing about this feature, I might as well keep going.
Single-character line ending
You can use an octal or hexadecimal number after the -0
to choose the single character that you want to use as the line ending. I’ve often used the form feed (U+000C) to separate multi-line records. The particular character doesn’t matter as long as it doesn’t appear in the data (so the null byte might be useful too):
% perl -le "print qq(one\ntwo\nthree\n\fA\nB\nC\n\f9\n10\n11\n)" > formfeed.txt
When you read F
% perl -ne 'print' formfeed.txt one two three A B C 9 10 11
You can see that easier when you replace the invisible characters with their ordinal values, which you do in octal here:
% perl -pe 's/(\P{Print})/sprintf(q(%03o),ord($1)) . "\n"/eg' formfeed.txt one012 two012 three012 014 A012 B012 C012 014 9012 10012 11012 012
When you use the octal value of the form feed for the number after the -0
switch and output lines surrounded by angle brackets, you get three lines (with the newlines and line-ending form feed in tact):
% perl -014 -ne 'print qq(<$_>)' formfeed.txt <one two three ><A B C ><9 10 11 >
You could have also specified this with three digits, -0014
, or as hexadecimal with a leading x
, like -0xC
. The hexadecimal version is valuable when you need to specify a character past the largest single octet value you can get out of three octal digits, which is 0377
.
There’s a catch though. If you want to set the input record separator to a wide character, you need to ensure that you read the input correctly. For the ☃ (U+2603 SNOWMAN) to be the separator, which takes up three octets in UTF-8, you need to read the input as UTF-8 too. The -C
is one way to do that:
% perl -0x2603 -C -ne 'print qq(<$_>)' snowmen.txt >>
You aren’t able to specify multiple characters as a line separator since B
% perl -MO=Deparse -0x0100x2603 -e No Perl script found in input
Slurping an entire file
If you specify an octal value 400 or higher, which is more than 8 bits, Perl sets the input record separator to undef. With no defined value for $/
, Perl slurps the entire input. But, this is different than setting the empty string (a defined value), which I write about in the next section.
You’ve probably seen -0777
, perhaps the most common use of -0
:
% perl -0777 -ne 'print qq(<$_>)' dog.txt <Newfoundland Golden Retreiver Boxer >
That FARGV
filehandle, which does some trickery to make it look like all the input is coming from one source. However, the line input operator can’t read across the command line files; B
% perl -0777 -ne 'print qq(<$_>)'⏎dog.txt cat.txt lizard.txt <Newfoundland Golden Retreiver Boxer ><Tabby Marmalade Tiger ><Monitor Iguana Godzilla >
If you wanted all the files to be one lines, route them through standard input before they get to B
% cat dog.txt cat.txt lizard.txt |⏎perl -0 -ne 'print qq(<$_>)'
=head1 Paragraph mode
“Paragraph mode” is a special case. The -00
sets the input record separator to the empty string. That’s different than the undefined value even though both are false:
% perl -MO=Deparse -00 -e 1 BEGIN { $/ = ""; $\ = undef; } '???'; -e syntax OK
When the input record separator is the empty string, B\n+
Not only that, put it collapses the multiple newlines to exactly two newlines:
% perl -00 -ne 'print qq(<$_>)' paras.txt <First line first para Second line first para Third line first para ><After first blank line Second line after first blank line Third line after first blank line ><After 2nd blank line 2nd line after 2nd blank line 3rd line after 2nd blank line >
Summary
Here’s a quick summary of the various incantations of the -0
switch:
Switch | Input Record Separator | Note |
---|---|---|
-0 |
\000 |
null byte |
-00 |
empty string, but “\n+” | paragraph mode |
-0014 |
8-bit character, in octal | form feed |
-0xC |
8-bit character, in hex | form feed |
-0400 |
undef, above 8-bit | slurp |
-0777 |
undef, idiomatic | slurp |
-g |
undef | slurp, new in v5.36 |
-0x1FF |
\777 character, include -C |
actual \777 |
-0x2603 |
wide character, include -C |
snowman |