Slurp a file from the command line with -g

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.


Perl v5.36 adds the -g switch as a shortcut for -0777, which undefines the input record separator so you can read an entire file as a single string. This is often called “slurping”, and is useful when you need to process text that spans several lines.

The input record separator

The input record separator is the character (or characters) that Perl’s line-input operator uses to determine when a line has ended. By default, that’s a newline (U+0010), but you can use any string you like by setting $/ ($INPUT_RECORD_SEPARATOR). Sometimes the form feed is a useful separator for multiline records:

$/ = "\f";

On the command line, the -0 switch is a quick way to set the value for $/. Without a value, it uses the null byte, which is sometimes a useful as a separator:

% perl -MO=Deparse -0 -e 1
BEGIN { $/ = "\000"; $\ = undef; }
'???';
-e syntax OK

A number in octal or hexadecimal sets $/ to some other single character:

% perl -MO=Deparse -0014 -e 1
BEGIN { $/ = "\f"; $\ = undef; }
'???';
-e syntax OK
% perl -MO=Deparse -0xC -e 1
BEGIN { $/ = "\f"; $\ = undef; }
'???';
-e syntax OK

Any number above 0377 octal (more than 255 decimal) sets $/ to undef:

% perl -MO=Deparse -0377 -e 1
BEGIN { $/ = "\377"; $\ = undef; }
'???';
-e syntax OK
% perl -MO=Deparse -0400 -e 1
BEGIN { $/ = undef; $\ = undef; }
'???';
-e syntax OK

Conventionally, though, the Perl documentation has used 777 as the value to get undef probably since it’s easier to remember:

% perl -MO=Deparse -0777 -e 1
BEGIN { $/ = undef; $\ = undef; }
'???';
-e syntax OK

The new -g is a short synonym for -0777, so it does the same thing that :

% perl -MO=Deparse -e 1
'???';
-e syntax OK
% perl -MO=Deparse -g -e 1
BEGIN { $/ = undef; $\ = undef; }
'???';
-e syntax OK

Far more than you ever wanted to know

The -0 switch has some other interesting behavior, and has a few other interesting features. Since I’m already writing about this feature, I might as well keep going.

Single-character line ending

You can use an octal or hexadecimal number after the -0 to choose the single character that you want to use as the line ending. I’ve often used the form feed (U+000C) to separate multi-line records. The particular character doesn’t matter as long as it doesn’t appear in the data (so the null byte might be useful too):

% perl -le "print qq(one\ntwo\nthree\n\fA\nB\nC\n\f9\n10\n11\n)" > formfeed.txt

When you read F by lines with no change to the input record separator, you see the three records separated by “blank” lines, which are really the form feed:

% perl -ne 'print' formfeed.txt
one
two
three

A
B
C

9
10
11

You can see that easier when you replace the invisible characters with their ordinal values, which you do in octal here:

% perl -pe 's/(\P{Print})/sprintf(q(%03o),ord($1)) . "\n"/eg' formfeed.txt
one012
two012
three012
014
A012
B012
C012
014
9012
10012
11012
012

When you use the octal value of the form feed for the number after the -0 switch and output lines surrounded by angle brackets, you get three lines (with the newlines and line-ending form feed in tact):

% perl -014 -ne 'print qq(<$_>)' formfeed.txt
<one
two
three

><A
B
C

><9
10
11

>

You could have also specified this with three digits, -0014, or as hexadecimal with a leading x, like -0xC. The hexadecimal version is valuable when you need to specify a character past the largest single octet value you can get out of three octal digits, which is 0377.

There’s a catch though. If you want to set the input record separator to a wide character, you need to ensure that you read the input correctly. For the ☃ (U+2603 SNOWMAN) to be the separator, which takes up three octets in UTF-8, you need to read the input as UTF-8 too. The -C is one way to do that:

% perl -0x2603 -C -ne 'print qq(<$_>)' snowmen.txt >>

You aren’t able to specify multiple characters as a line separator since B thinks the extra characters are a file for input:

% perl -MO=Deparse -0x0100x2603 -e
No Perl script found in input

Slurping an entire file

If you specify an octal value 400 or higher, which is more than 8 bits, Perl sets the input record separator to undef. With no defined value for $/, Perl slurps the entire input. But, this is different than setting the empty string (a defined value), which I write about in the next section.

You’ve probably seen -0777, perhaps the most common use of -0:

% perl -0777 -ne 'print qq(<$_>)' dog.txt
<Newfoundland
Golden Retreiver
Boxer
>

That F is actually read through the ARGV filehandle, which does some trickery to make it look like all the input is coming from one source. However, the line input operator can’t read across the command line files; B figures out when one file is empty, closes it, then opens the next file. So, each file appears to be its own line:

% perl -0777 -ne 'print qq(<$_>)'⏎dog.txt cat.txt lizard.txt
<Newfoundland
Golden Retreiver
Boxer
><Tabby
Marmalade
Tiger
><Monitor
Iguana
Godzilla
>

If you wanted all the files to be one lines, route them through standard input before they get to B. This only looks like a useless use of B:

% cat dog.txt cat.txt lizard.txt |⏎perl -0 -ne 'print qq(<$_>)'

=head1 Paragraph mode

“Paragraph mode” is a special case. The -00 sets the input record separator to the empty string. That’s different than the undefined value even though both are false:

% perl -MO=Deparse -00 -e 1
BEGIN { $/ = ""; $\ = undef; }
'???';
-e syntax OK

When the input record separator is the empty string, B treats it as if it is multiple consecutive newlines. This has the same effect as if the input record separator were the pattern \n+ Not only that, put it collapses the multiple newlines to exactly two newlines:

% perl -00 -ne 'print qq(<$_>)' paras.txt
<First line first para
Second line first para
Third line first para

><After first blank line
Second line after first blank line
Third line after first blank line

><After 2nd blank line
2nd line after 2nd blank line
3rd line after 2nd blank line
>

Summary

Here’s a quick summary of the various incantations of the -0 switch:

Switch Input Record Separator Note
-0 \000 null byte
-00 empty string, but “\n+” paragraph mode
-0014 8-bit character, in octal form feed
-0xC 8-bit character, in hex form feed
-0400 undef, above 8-bit slurp
-0777 undef, idiomatic slurp
-g undef slurp, new in v5.36
-0x1FF \777 character, include -C actual \777
-0x2603 wide character, include -C snowman

From the Perl documentation