The conventional wisdom for slurping a file into a Perl program is to actually load the file into a program. We showed some of these in Item 53: Consider different ways of reading from a stream.
There are several idioms for doing it, from doing it yourself:
my $text = do { local( @ARGV, $/ ) = $file; <> };
or using an optimized module such as File::Slurp.
use File::Slurp qw(read_file); my $text = read_file( $file );
Given a large file, say, something that is 2 GB, you end up with a memory footprint that is at least the file size. This program to load a 2 GB file took 11 seconds to load the file on my Mac Pro. The memory footprint rose to 2.25 GB and stayed there even after $text
went out of scope:
#!/usr/bin/perl use strict; use warnings; print "I am $$\n"; use File::Slurp; { my $start = time; my $text = read_file( $ARGV[0] ); my $loadtime = time - $start; print "Loaded file in $loadtime seconds\n"; my $count = () = $text =~ /abc/; print "Found $count occurances\n"; } print "Press enter to continue..."; <STDIN>;
The problem is in the concept that you have to somehow capture and retain control of the data to make use of it.
To solve this, you should avoid the painful part. That is, don’t load the file at all. That I/O is really slow! You can memory-map, or mmap
, the file. The name comes from the system call that makes it possible.
Instead of loading the file, you use mmap
to make a connection between your address space and the file on the disk. You don’t have to worry about how this happens, but basically you use part of a disk file as if it was actually in memory. The advantage is that you don’t have the I/O overhead, so there is no load time, and since you don’t have to make space to hold the file in memory, you don’t pay a memory footprint.
This program use File::Map, you “load” the file instantly and it’s actual memory footprint was under 3 MB (three orders of magnitude less!):
#!/usr/bin/perl use strict; use warnings; use File::Map qw(map_file); print "I am $$\n"; { my $start = time; map_file my $map, $ARGV[0]; my $loadtime = time - $start; print "Loaded file in $loadtime seconds\n"; my $count = () = $map =~ /abc/; print "Found $count occurances\n"; } <STDIN>;
The $map
acts just like a normal Perl string, and you don’t have to worry about any of the mmap
details. When the variable goes out of scope, the map is broken and your program doesn’t suffer from a large chunk of unused memory.
In Tim Bray’s Wide Finder contest to find the fatest way to process log files with “wider” rather than “faster” processors, the winning solution was a Perl implementation using mmap
(although using the older Sys-Mmap). Perl had nothing special in that regard because most of the top solutions used mmap
to avoid the I/O penalty.
The mmap
is especially handy when you have to do this with several files at the same time (or even sequentially if Perl needs to find a chunk of contiguous memory). Since you don’t have the data in real memory, you can mmap as many files as you like and work with them simultaneously.
Also, since the data actually live on the disk, different programs running at the same time can share the data, including seeing the changes each program makes (although you have to work out the normal concurrency issues yourself). That is, mmap
is a way to share memory.
The File::Map module can do much more too. It allows you to lock filehandles, and you can also synchronize access from threads in the same process.
If you don’t actually need the data in your program, don’t ever load it: mmap
it instead.
Thanks for the tips. Switching from File::Slurp to File::Map effectively resolves an out-of-memory problem in multipart-uploading some big files to AWS S3.
Leo
Thank you for the tips Brian.
Is it possible the file loaded using File::Map to be read line by line? I tried to read it into an array, but felt there is some other memory efficient way as my file have 340 million lines.
You can open a file handle on a string, so I’d go with that. You don’t want to create an array, though! You’ve now read the entire file into your program.