book – Page 16 – The Effective Perler

In Perl v5.12, length(undef) returns undef.

Starting with v5.12, the length of an undefined value is undefined itself. Instead of converting the undef to the empty string then taking the length of that string, v5.12 returns undef: Continue reading “In Perl v5.12, length(undef) returns undef.”

Use the /r substitution flag to work on a copy

How many times has this happened to you? You want to modify each element of an array so you send it through a map (Item 20. Use foreach, map, and grep as appropriate.). However, instead of your expected output, you only to get a bunch of numbers or empty strings back? For example, in this case, some digits got into the names of the cats and you want to remove them with a substitution: Continue reading “Use the /r substitution flag to work on a copy”

Understand “global” variables.

Perl has two sorts of variables: the lexical variables that are limited to a particular scope, and package variables that you define in a namespace. The package variables are sometimes also called global variables because they are visible from anywhere in the program as long as you know their name. As you might suspect, Perl makes it a bit more interesting: there are many sorts of global variables. Continue reading “Understand “global” variables.”

Use branch reset grouping to number captures in alternations

Perl’s regular expressions have a simple rule for capturing groups. It counts the order of left parentheses to assign capture variables. Not all capture groups must actually match parts of the string, and Perl doesn’t care if they do. Perl assigns capture groups inside an alternation consecutively, even though it knows that only one branch of the alternation will match. Perl 5.10 adds the branch reset, (?|alternation) which mitigates that, though. Continue reading “Use branch reset grouping to number captures in alternations”

Use formats to create paginated, plaintext reports

Perl’s format feature allows you to easily create line-oriented text reports with pagination, and if that’s what you want, Perl is for you. This item is just an introduction. You can find the full details in perlform, and in future items. Continue reading “Use formats to create paginated, plaintext reports”

Temporarily remove hash keys or array elements with `delete local`

Perl 5.12 adds a feature that lets you locally delete a hash key or array element (refresh your memory of local with Item 43: Know the difference between my and local. This new feature allows you to temporarily prune a hash or an array: Continue reading “Temporarily remove hash keys or array elements with `delete local`”

Don’t make Perl do more work than it needs to

My choice of algorithms and data organization can lead to orders of magnitude of performance differences between functionally equivalent implementations of my programs. Choosing the right way to do something can save me orders of magnitude in processing.

I wrote some code to loop over lines in a file and modify a couple of elements in each line. My code seems very innocent, but the actual field modifications were a little more complex; I replaced them with lc and uc to make things simple for this example:

while (<>) {
  chomp;
  my @x = split "\t";
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x), "\n";
}

I have a loop that splits each line of a file on the tab character then modifies the fourth and sixth field.
Look closely at the loop: I’m making Perl do a lot more work than it needs to.

First, I cal chomp on each line, but I’m adding the newline right back. There is no reason to do that. If I leave it there I get the same result:

while (<>) {
  my @x = split "\t";
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x);
}

chomping when I don’t need to is a minor issue, but it is not likely going to destroy the performance of any program.

Looking a little closer, I see a much bigger inefficiency if I know the format of the data. Assume each line of data contains many fields; many being some number greater than six in this specific example. Since I’m are only acting on fields four and six, why do I split the entire line just to put it back together?

With a couple of more arguments, I can tell split to limit the number of fields it creates. I can limit my results to seven items since I don’t care to modify any beyond that:

while (<>) {
  my @x = split "\t", $_, 7;
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x);
}

If each line only contains seven, or even ten, elements, I won’t see much if any improvement. However, if each line contains dozens or hundreds of fields, my potential speed-up is huge.

There is even more I can do to milk performance out of my loop if I control the data format. If I move the columns that I need to modify to the front of each row, I don’t need to split into so many fields:

while (<>) {
  my @x = split "\t", $_, 3;
  $x[0] = lc $x[0];
  $x[1] = uc $x[1];
  print join("\t", @x);
}

Measure the improvement

Just to be sure that these changes are really making my code faster, I do a little benchmarking to get a feel for the relative performance differences:

use warnings;
use strict;
use Benchmark qw(timethese);

my @data;

for (0..10_000) {
  $data[$_] = join "\t", map { chr(65 + (int rand(52))) } (0..100);
}

timethese(500, {
  'standard' => sub {
    for (@data) {
      chomp;
      my @x = split "\t";
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x) . "\n";
    }
  },
  'no_chomp' => sub {
    for (@data) {
      my @x = split "\t";
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x);
    }
  },
  'smaller' => sub {
    for (@data) {
      my @x = split "\t", $_, 7;
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x);
    }
  },
  'smallest' => sub {
    for (@data) {
      my @x = split "\t", $_, 3;
      $x[0] = lc $x[0];
      $x[1] = uc $x[1];
      $_ = join("\t", @x);
    }
  },
});

In this benchmark I experimented on ten thousand records, each with a hundred fields. The benchmarks measure:

the initial, or “standard” case.
the case where I just removed a chomp, “no_chomp”.
the case where I limit split, “smaller”.
the case where I reorder the inbound data, “smallest”.

The [reordered] results tell me quite a bit:

Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 451 wallclock secs (449.66 usr + 0.34 sys = 450.00 CPU) @ 1.11/s (n=500)
no_chomp: 451 wallclock secs (446.18 usr + 0.41 sys = 446.59 CPU) @ 1.12/s (n=500)
smaller: 39 wallclock secs (39.15 usr + 0.03 sys = 39.18 CPU) @ 12.76/s (n=500)
smallest: 19 wallclock secs (18.98 usr + 0.01 sys = 18.99 CPU) @ 26.33/s (n=500)

Removing chomp had an almost unnoticeable effect. However, I recuded my processing tenfold by limiting split. I made my code even faster by reordering the inbound data.

Just to see what effect number of fields really has, I reduced the size of the data so that each record had ten fields. The results were obviously less impressive, though still visible:

Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 58 wallclock secs (57.50 usr + 0.05 sys = 57.55 CPU) @ 8.69/s (n=500)
no_chomp: 55 wallclock secs (55.13 usr + 0.04 sys = 55.17 CPU) @ 9.06/s (n=500)
smaller: 38 wallclock secs (37.67 usr + 0.03 sys = 37.70 CPU) @ 13.26/s (n=500)
smallest: 18 wallclock secs (18.46 usr + 0.01 sys = 18.47 CPU) @ 27.07/s (n=500)

So what does this tell me? Well, as far as a specific optimization goes, limiting split is probably a good idea if I’m trying to optimize my processing and really don’t need every field.

However, there is a larger moral to this story and that is: When performance is a concern, don’t make your code do work that it doesn’t have too. The most important is my choice of algorithms and related data structures for solving my problem. This can make or break my performance. After that, knowing specific optimizations such as limiting split fields, help me get just a little more performance out of my code.

Implicitly turn on strictures with Perl 5.12

Perl 5.12 can turn on strict for you automatically, stealing a feature from Modern::Perl that takes away one line of boilerplate in your Perl programs and modules. We talk about strict in Item 3: Enable strictures to promote better coding. Similar to what we show in Item 2: Enable new Perl features when you need them, to turn strictures on automatically, you have to use use with a version of Perl 5.11.0 or later: Continue reading “Implicitly turn on strictures with Perl 5.12”

Turn off Perl 5.12 deprecation warnings, if you dare!

Perl 5.12 deprecates several features, for various reasons. Some of the features were always stupid, some need to make way for future development, and some are just too ornery to maintain. All of these are listed in the perldelta5120 documentation. The new thing, however, is that Perl 5.12 will warn you about these even if you don’t have warnings turned on. Consider this script full of Perl whoppers: Continue reading “Turn off Perl 5.12 deprecation warnings, if you dare!”

Match Unicode characters by property value

A Unicode character has properties; it knows things about itself. Perl v5.10 introduced a way to match a character that has certain properties that v5.10 supports. In some cases you can match a particular property value. Now v5.12 allows you can match any Unicode property by its value. The newly-supported ones include Numeric_Value and Age, for example:

\p{Numeric_Value: 1}
\p{Nv=7}
\p{Age: 3.0}

Continue reading “Match Unicode characters by property value”