Use formats to create paginated, plaintext reports

Perl’s format feature allows you to easily create line-oriented text reports with pagination, and if that’s what you want, Perl is for you. This item is just an introduction. You can find the full details in perlform, and in future items. Continue reading “Use formats to create paginated, plaintext reports”

Temporarily remove hash keys or array elements with `delete local`

Perl 5.12 adds a feature that lets you locally delete a hash key or array element (refresh your memory of local with Item 43: Know the difference between my and local. This new feature allows you to temporarily prune a hash or an array: Continue reading “Temporarily remove hash keys or array elements with `delete local`”

Don’t make Perl do more work than it needs to

My choice of algorithms and data organization can lead to orders of magnitude of performance differences between functionally equivalent implementations of my programs. Choosing the right way to do something can save me orders of magnitude in processing.

I wrote some code to loop over lines in a file and modify a couple of elements in each line. My code seems very innocent, but the actual field modifications were a little more complex; I replaced them with lc and uc to make things simple for this example:

while (<>) {
  chomp;
  my @x = split "\t";
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x), "\n";
}

I have a loop that splits each line of a file on the tab character then modifies the fourth and sixth field.
Look closely at the loop: I’m making Perl do a lot more work than it needs to.

First, I cal chomp on each line, but I’m adding the newline right back. There is no reason to do that. If I leave it there I get the same result:

while (<>) {
  my @x = split "\t";
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x);
}

chomping when I don’t need to is a minor issue, but it is not likely going to destroy the performance of any program.

Looking a little closer, I see a much bigger inefficiency if I know the format of the data. Assume each line of data contains many fields; many being some number greater than six in this specific example. Since I’m are only acting on fields four and six, why do I split the entire line just to put it back together?

With a couple of more arguments, I can tell split to limit the number of fields it creates. I can limit my results to seven items since I don’t care to modify any beyond that:

while (<>) {
  my @x = split "\t", $_, 7;
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x);
}

If each line only contains seven, or even ten, elements, I won’t see much if any improvement. However, if each line contains dozens or hundreds of fields, my potential speed-up is huge.

There is even more I can do to milk performance out of my loop if I control the data format. If I move the columns that I need to modify to the front of each row, I don’t need to split into so many fields:

while (<>) {
  my @x = split "\t", $_, 3;
  $x[0] = lc $x[0];
  $x[1] = uc $x[1];
  print join("\t", @x);
}

Measure the improvement

Just to be sure that these changes are really making my code faster, I do a little benchmarking to get a feel for the relative performance differences:

use warnings;
use strict;
use Benchmark qw(timethese);

my @data;

for (0..10_000) {
  $data[$_] = join "\t", map { chr(65 + (int rand(52))) } (0..100);
}

timethese(500, {
  'standard' => sub {
    for (@data) {
      chomp;
      my @x = split "\t";
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x) . "\n";
    }
  },
  'no_chomp' => sub {
    for (@data) {
      my @x = split "\t";
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x);
    }
  },
  'smaller' => sub {
    for (@data) {
      my @x = split "\t", $_, 7;
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x);
    }
  },
  'smallest' => sub {
    for (@data) {
      my @x = split "\t", $_, 3;
      $x[0] = lc $x[0];
      $x[1] = uc $x[1];
      $_ = join("\t", @x);
    }
  },
});

In this benchmark I experimented on ten thousand records, each with a hundred fields. The benchmarks measure:

  • the initial, or “standard” case.
  • the case where I just removed a chomp, “no_chomp”.
  • the case where I limit split, “smaller”.
  • the case where I reorder the inbound data, “smallest”.

The [reordered] results tell me quite a bit:

Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 451 wallclock secs (449.66 usr + 0.34 sys = 450.00 CPU) @ 1.11/s (n=500)
no_chomp: 451 wallclock secs (446.18 usr + 0.41 sys = 446.59 CPU) @ 1.12/s (n=500)
smaller: 39 wallclock secs (39.15 usr + 0.03 sys = 39.18 CPU) @ 12.76/s (n=500)
smallest: 19 wallclock secs (18.98 usr + 0.01 sys = 18.99 CPU) @ 26.33/s (n=500)

Removing chomp had an almost unnoticeable effect. However, I recuded my processing tenfold by limiting split. I made my code even faster by reordering the inbound data.

Just to see what effect number of fields really has, I reduced the size of the data so that each record had ten fields. The results were obviously less impressive, though still visible:

Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 58 wallclock secs (57.50 usr + 0.05 sys = 57.55 CPU) @ 8.69/s (n=500)
no_chomp: 55 wallclock secs (55.13 usr + 0.04 sys = 55.17 CPU) @ 9.06/s (n=500)
smaller: 38 wallclock secs (37.67 usr + 0.03 sys = 37.70 CPU) @ 13.26/s (n=500)
smallest: 18 wallclock secs (18.46 usr + 0.01 sys = 18.47 CPU) @ 27.07/s (n=500)

So what does this tell me? Well, as far as a specific optimization goes, limiting split is probably a good idea if I’m trying to optimize my processing and really don’t need every field.

However, there is a larger moral to this story and that is: When performance is a concern, don’t make your code do work that it doesn’t have too. The most important is my choice of algorithms and related data structures for solving my problem. This can make or break my performance. After that, knowing specific optimizations such as limiting split fields, help me get just a little more performance out of my code.

Implicitly turn on strictures with Perl 5.12

Perl 5.12 can turn on strict for you automatically, stealing a feature from Modern::Perl that takes away one line of boilerplate in your Perl programs and modules. We talk about strict in Item 3: Enable strictures to promote better coding. Similar to what we show in Item 2: Enable new Perl features when you need them, to turn strictures on automatically, you have to use use with a version of Perl 5.11.0 or later: Continue reading “Implicitly turn on strictures with Perl 5.12”

Turn off Perl 5.12 deprecation warnings, if you dare!

Perl 5.12 deprecates several features, for various reasons. Some of the features were always stupid, some need to make way for future development, and some are just too ornery to maintain. All of these are listed in the perldelta5120 documentation. The new thing, however, is that Perl 5.12 will warn you about these even if you don’t have warnings turned on. Consider this script full of Perl whoppers: Continue reading “Turn off Perl 5.12 deprecation warnings, if you dare!”

Locate bugs with source control bisection

As you work in Perl you store each step in source control. When you finish a little bit of work, you commit your work. Ideally, every commit deals with one thing so you’re only introducing one logical change in each revision. Somewhere along the process, you might discover that something is not working correctly. You think that it used to work but you’re not sure where things went pear-shaped, perhaps because the bug seemingly deals with something that you weren’t working on. Continue reading “Locate bugs with source control bisection”

Keep your programmatic configuration DRY

A common mantra among programmers today is to keep your code DRY. This little acronym stands for “Don’t Repeat Yourself” and serves as a reminder that when you see a repetitive pattern in your code or are tempted to copy/paste some statements, you should think twice and consider extracting the common logic into a chunk of code that can be reused.

For many programmers, this practice begins to break down when “configuration” code is involved. When I talk about configuration code here, I’m not talking about the XML, YAML, INI, etc. bits of your project. I’m talking about the Perl code in your program that simply serves as data to feed some active portion of your code.

A common mantra among programmers today is to keep your code DRY. This little acronym stands for “Don’t Repeat Yourself” and serves as a reminder that when you see a repetitive pattern in your code or are tempted to copy/paste some statements, you should think twice and consider extracting the common logic into a chunk of code that can be reused. Continue reading “Keep your programmatic configuration DRY”

Watch out for side effects with `use VERSION`

Item 83: Limit your distributions to the right platforms mentioned that use might invoke side effects. We didn’t get into the details in that Item though. As of Perl 5.10, use imports some feature that you might not want.

Merely specifying a Perl version prior to 5.10 does nothing other than check the version you specify against the interpreter version. If the version you specify is equal to or greater than the interpreter version, your program continues. If not, it dies:

use 5.008;  # needs perl5.008000 or later

This works with require too:

require 5.008;  # needs perl5.008000 or later

However, use is a compile-time function and require is a run-time function. By the time you hit that require, perl has already compiled your program up to that point or died trying as it ran into unknown features. Code may have already run, despite using an inappropriate version of perl. You want to impose your version restriction as soon as possible, so use is more appropriate since it happens earlier.

You might think that you can fix this with a BEGIN block which compiles and immediately runs the code so you get the ordering right. This gets the version check at compile time even though it’s a runtime statement:

BEGIN { require v5.10; }

In early versions of v5.10, this still imported new features, but this bug has been fixed. See BEGIN {require 5.011} imports features.

You should use at least v5.10.1 because it fixes various issues with smart match. That version doesn’t automatically import the new features if you use require. Either of these specify that version:

use v5.10.1;
BEGIN { require v5.10.1; }

use 5.010

With Perl 5.10, you get three side effects with use v5.10. Starting with that version, use-ing the version also pulls in the new features for that version. Obstensibly, that keeps programs designed for earlier versions breaking as newer perls add keywords, but it also tries to enforce the current philosophy of good programming on you.

Perl 5.10 introduces say, state, and given-which, which you import implicitly when you say use v5.10.1:

use v5.10.1;
say 'I can use Switch!';  # imported say()

given ($ARGV[0]) {        # imported given()
	when( defined ) { some_sub() }
	};

sub some_sub {
	state $n = 0;         # imported state()
	say "$n: got a defined argument";
	}

If you want to insist on v5.10 without its new features, perhaps because your code uses some of the same keywords already, you can unimport the side effects immediately with the new feature pragma:

use v5.10.1;     # implicit imports
no feature;    # take it right back again

# you're own version of say()
sub say {
	# something that you want to do
	}

If you only want some of the new features, you can unimport the ones that you don’t want:

use v5.10.1;
no feature qw(say);   # leaves state() and given()

sub say {
	# something that you want to do
	}

use 5.012

Perl 5.12 includes two more side effects for use VERSION. The unicode_strings feature treats all strings outside of bytes and locale scopes as Unicode strings. Additionally, use v5.12 automatically turns on strict:

use v5.12;
# now strictures are on

$foo = 1;   # compile-time error!

If, for some odd and dangerous reason you don’t want strict on by default, you can turn it off yourself, even though unimporting it doesn’t give you the warning that you’ve left the paved roads, you’ve just violated your rental car contract, and there’s a chainsaw massacrer waiting for you:

use v5.12;
no feature;
no strict;

my $foo = 1;   

$fo0++;  # sure, go ahead and make that error

A workaround to restrict perl versions

You can restrict the version more tightly by checking the value of the $] variable, just like the various examples you saw in Item 83:

BEGIN {
	die "Unsupported version" 
		unless $] >= 5.010 and $] < 5.011
	}

This has the added benefit of restricting the upper acceptable perl version. It works on older Perls too.

Things to remember

  • use VERSION imports new features since Perl 5.9.5.
  • BEGIN { require VERSION } still imports new features (fixed in later versions of v5.10 and v5.12)
  • Use no feature or no strict to unimport unwanted features.
  • Restrict the perl version with $].

Respect the global state of the flip flop operator

Perl’s flip-flop operator, .., (otherwise known as the range operator in scalar context) is a simple way to choose a window on some data. It returns false until its lefthand side is true. Once the lefthand side is true, the flip-flop operator returns true until its righthand side is true. Once the righthand side is true, the flip flop operator returns false. That is, the lefthand side turns it on and the righthand side turns it off. Continue reading “Respect the global state of the flip flop operator”