The Perl 5 Porters released 5.14. By reading the perldelta51[34]* documentation, you can see what’s new in this maintenance version of Perl. David Golden, one of the Perl 5.13 release managers, put together what he’s calling the perldeltadelta, which is just the most interesting bits of the longish perl*delta files for Perl 5.13, the experimental track that led up to Perl 5.14. He’s covered most of the interesting bits. Continue reading “Perl 5.14 new features”
Use formats to create paginated, plaintext reports
Perl’s format feature allows you to easily create line-oriented text reports with pagination, and if that’s what you want, Perl is for you. This item is just an introduction. You can find the full details in perlform, and in future items. Continue reading “Use formats to create paginated, plaintext reports”
Temporarily remove hash keys or array elements with `delete local`
Perl 5.12 adds a feature that lets you locally delete a hash key or array element (refresh your memory of local with Item 43: Know the difference between my
and local
. This new feature allows you to temporarily prune a hash or an array: Continue reading “Temporarily remove hash keys or array elements with `delete local`”
Don’t make Perl do more work than it needs to
My choice of algorithms and data organization can lead to orders of magnitude of performance differences between functionally equivalent implementations of my programs. Choosing the right way to do something can save me orders of magnitude in processing.
I wrote some code to loop over lines in a file and modify a couple of elements in each line. My code seems very innocent, but the actual field modifications were a little more complex; I replaced them with lc and uc to make things simple for this example:
while (<>) { chomp; my @x = split "\t"; $x[3] = lc $x[3]; $x[5] = uc $x[5]; print join("\t", @x), "\n"; }
I have a loop that splits each line of a file on the tab character then modifies the fourth and sixth field.
Look closely at the loop: I’m making Perl do a lot more work than it needs to.
First, I cal chomp on each line, but I’m adding the newline right back. There is no reason to do that. If I leave it there I get the same result:
while (<>) { my @x = split "\t"; $x[3] = lc $x[3]; $x[5] = uc $x[5]; print join("\t", @x); }
chomping when I don’t need to is a minor issue, but it is not likely going to destroy the performance of any program.
Looking a little closer, I see a much bigger inefficiency if I know the format of the data. Assume each line of data contains many fields; many being some number greater than six in this specific example. Since I’m are only acting on fields four and six, why do I split the entire line just to put it back together?
With a couple of more arguments, I can tell split to limit the number of fields it creates. I can limit my results to seven items since I don’t care to modify any beyond that:
while (<>) { my @x = split "\t", $_, 7; $x[3] = lc $x[3]; $x[5] = uc $x[5]; print join("\t", @x); }
If each line only contains seven, or even ten, elements, I won’t see much if any improvement. However, if each line contains dozens or hundreds of fields, my potential speed-up is huge.
There is even more I can do to milk performance out of my loop if I control the data format. If I move the columns that I need to modify to the front of each row, I don’t need to split into so many fields:
while (<>) { my @x = split "\t", $_, 3; $x[0] = lc $x[0]; $x[1] = uc $x[1]; print join("\t", @x); }
Measure the improvement
Just to be sure that these changes are really making my code faster, I do a little benchmarking to get a feel for the relative performance differences:
use warnings; use strict; use Benchmark qw(timethese); my @data; for (0..10_000) { $data[$_] = join "\t", map { chr(65 + (int rand(52))) } (0..100); } timethese(500, { 'standard' => sub { for (@data) { chomp; my @x = split "\t"; $x[3] = lc $x[3]; $x[5] = uc $x[5]; $_ = join("\t", @x) . "\n"; } }, 'no_chomp' => sub { for (@data) { my @x = split "\t"; $x[3] = lc $x[3]; $x[5] = uc $x[5]; $_ = join("\t", @x); } }, 'smaller' => sub { for (@data) { my @x = split "\t", $_, 7; $x[3] = lc $x[3]; $x[5] = uc $x[5]; $_ = join("\t", @x); } }, 'smallest' => sub { for (@data) { my @x = split "\t", $_, 3; $x[0] = lc $x[0]; $x[1] = uc $x[1]; $_ = join("\t", @x); } }, });
In this benchmark I experimented on ten thousand records, each with a hundred fields. The benchmarks measure:
- the initial, or “standard” case.
- the case where I just removed a chomp, “no_chomp”.
- the case where I limit split, “smaller”.
- the case where I reorder the inbound data, “smallest”.
The [reordered] results tell me quite a bit:
Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 451 wallclock secs (449.66 usr + 0.34 sys = 450.00 CPU) @ 1.11/s (n=500)
no_chomp: 451 wallclock secs (446.18 usr + 0.41 sys = 446.59 CPU) @ 1.12/s (n=500)
smaller: 39 wallclock secs (39.15 usr + 0.03 sys = 39.18 CPU) @ 12.76/s (n=500)
smallest: 19 wallclock secs (18.98 usr + 0.01 sys = 18.99 CPU) @ 26.33/s (n=500)
Removing chomp had an almost unnoticeable effect. However, I recuded my processing tenfold by limiting split. I made my code even faster by reordering the inbound data.
Just to see what effect number of fields really has, I reduced the size of the data so that each record had ten fields. The results were obviously less impressive, though still visible:
Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 58 wallclock secs (57.50 usr + 0.05 sys = 57.55 CPU) @ 8.69/s (n=500)
no_chomp: 55 wallclock secs (55.13 usr + 0.04 sys = 55.17 CPU) @ 9.06/s (n=500)
smaller: 38 wallclock secs (37.67 usr + 0.03 sys = 37.70 CPU) @ 13.26/s (n=500)
smallest: 18 wallclock secs (18.46 usr + 0.01 sys = 18.47 CPU) @ 27.07/s (n=500)
So what does this tell me? Well, as far as a specific optimization goes, limiting split is probably a good idea if I’m trying to optimize my processing and really don’t need every field.
However, there is a larger moral to this story and that is: When performance is a concern, don’t make your code do work that it doesn’t have too. The most important is my choice of algorithms and related data structures for solving my problem. This can make or break my performance. After that, knowing specific optimizations such as limiting split fields, help me get just a little more performance out of my code.
Do you want extra Effective Perl in Safari Books?
We’re looking for some reader feedback. You can comment here, send us mail, reply on Twitter, skywrite your message in Chicago (please send us a link to that video in case we miss it), or come up to us at a conference.
We’re thinking about writing an extra, bonus chapter of Effective Perl Programming for Safari Books Online. We’re having fun providing lots of free, extra content on this website, but we thought we might do something nice for those you also paid to get online access to our book. Continue reading “Do you want extra Effective Perl in Safari Books?”
Implicitly turn on strictures with Perl 5.12
Perl 5.12 can turn on strict for you automatically, stealing a feature from Modern::Perl that takes away one line of boilerplate in your Perl programs and modules. We talk about strict in Item 3: Enable strictures to promote better coding. Similar to what we show in Item 2: Enable new Perl features when you need them, to turn strictures on automatically, you have to use use with a version of Perl 5.11.0 or later: Continue reading “Implicitly turn on strictures with Perl 5.12”
Turn off Perl 5.12 deprecation warnings, if you dare!
Perl 5.12 deprecates several features, for various reasons. Some of the features were always stupid, some need to make way for future development, and some are just too ornery to maintain. All of these are listed in the perldelta5120 documentation. The new thing, however, is that Perl 5.12 will warn you about these even if you don’t have warnings turned on. Consider this script full of Perl whoppers: Continue reading “Turn off Perl 5.12 deprecation warnings, if you dare!”
Match Unicode characters by property value
A Unicode character has properties; it knows things about itself. Perl v5.10 introduced a way to match a character that has certain properties that v5.10 supports. In some cases you can match a particular property value. Now v5.12 allows you can match any Unicode property by its value. The newly-supported ones include Numeric_Value
and Age
, for example:
\p{Numeric_Value: 1} \p{Nv=7} \p{Age: 3.0}
Continue reading “Match Unicode characters by property value”
Locate bugs with source control bisection
As you work in Perl you store each step in source control. When you finish a little bit of work, you commit your work. Ideally, every commit deals with one thing so you’re only introducing one logical change in each revision. Somewhere along the process, you might discover that something is not working correctly. You think that it used to work but you’re not sure where things went pear-shaped, perhaps because the bug seemingly deals with something that you weren’t working on. Continue reading “Locate bugs with source control bisection”
Perl Authors Night at Powell’s Technical Books
During OSCON, Joshua and I are taking part in the Perl Authors Night at Powell’s Technical Books on Tuesday, July 20 at 7 pm. Bring your copy of Effective Perl Programming for us to sign.
Other authors confirmed so far include:
- chromatic (Modern Perl, Perl Testing: A Developer’s Notebook, Perl Hacks, Extreme Programming Pocket Guide)
- brian d foy (Effective Perl Programming, Learning Perl, Intermediate Perl, Mastering Perl)
- Joshua McAdams (Effective Perl Programming)
- Curtis “Ovid” Poe (Perl Hacks)
- Randal Schwartz (Programming perl (1st edition), Learning Perl, Intermediate Perl, Perls of Wisdom)
- Peter Scott (Perl Medic, Perl Debugged, Perl Fundamentals (DVD))
Make sure that you’re going to Powell’s Technical Books is at 33 Northwest Park Avenue in Portland, and not one of their other Portland stores. You can take the MAX Green line (for free) from the Convention Center to NW 5th St and NW Couch St, then walk 4 blocks west to the store.