My choice of algorithms and data organization can lead to orders of magnitude of performance differences between functionally equivalent implementations of my programs. Choosing the right way to do something can save me orders of magnitude in processing.
I wrote some code to loop over lines in a file and modify a couple of elements in each line. My code seems very innocent, but the actual field modifications were a little more complex; I replaced them with lc and uc to make things simple for this example:
while (<>) {
chomp;
my @x = split "\t";
$x[3] = lc $x[3];
$x[5] = uc $x[5];
print join("\t", @x), "\n";
}
I have a loop that splits each line of a file on the tab character then modifies the fourth and sixth field.
Look closely at the loop: I’m making Perl do a lot more work than it needs to.
First, I cal chomp on each line, but I’m adding the newline right back. There is no reason to do that. If I leave it there I get the same result:
while (<>) {
my @x = split "\t";
$x[3] = lc $x[3];
$x[5] = uc $x[5];
print join("\t", @x);
}
chomping when I don’t need to is a minor issue, but it is not likely going to destroy the performance of any program.
Looking a little closer, I see a much bigger inefficiency if I know the format of the data. Assume each line of data contains many fields; many being some number greater than six in this specific example. Since I’m are only acting on fields four and six, why do I split the entire line just to put it back together?
With a couple of more arguments, I can tell split to limit the number of fields it creates. I can limit my results to seven items since I don’t care to modify any beyond that:
while (<>) {
my @x = split "\t", $_, 7;
$x[3] = lc $x[3];
$x[5] = uc $x[5];
print join("\t", @x);
}
If each line only contains seven, or even ten, elements, I won’t see much if any improvement. However, if each line contains dozens or hundreds of fields, my potential speed-up is huge.
There is even more I can do to milk performance out of my loop if I control the data format. If I move the columns that I need to modify to the front of each row, I don’t need to split into so many fields:
while (<>) {
my @x = split "\t", $_, 3;
$x[0] = lc $x[0];
$x[1] = uc $x[1];
print join("\t", @x);
}
Measure the improvement
Just to be sure that these changes are really making my code faster, I do a little benchmarking to get a feel for the relative performance differences:
use warnings;
use strict;
use Benchmark qw(timethese);
my @data;
for (0..10_000) {
$data[$_] = join "\t", map { chr(65 + (int rand(52))) } (0..100);
}
timethese(500, {
'standard' => sub {
for (@data) {
chomp;
my @x = split "\t";
$x[3] = lc $x[3];
$x[5] = uc $x[5];
$_ = join("\t", @x) . "\n";
}
},
'no_chomp' => sub {
for (@data) {
my @x = split "\t";
$x[3] = lc $x[3];
$x[5] = uc $x[5];
$_ = join("\t", @x);
}
},
'smaller' => sub {
for (@data) {
my @x = split "\t", $_, 7;
$x[3] = lc $x[3];
$x[5] = uc $x[5];
$_ = join("\t", @x);
}
},
'smallest' => sub {
for (@data) {
my @x = split "\t", $_, 3;
$x[0] = lc $x[0];
$x[1] = uc $x[1];
$_ = join("\t", @x);
}
},
});
In this benchmark I experimented on ten thousand records, each with a hundred fields. The benchmarks measure:
- the initial, or “standard” case.
- the case where I just removed a chomp, “no_chomp”.
- the case where I limit split, “smaller”.
- the case where I reorder the inbound data, “smallest”.
The [reordered] results tell me quite a bit:
Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 451 wallclock secs (449.66 usr + 0.34 sys = 450.00 CPU) @ 1.11/s (n=500)
no_chomp: 451 wallclock secs (446.18 usr + 0.41 sys = 446.59 CPU) @ 1.12/s (n=500)
smaller: 39 wallclock secs (39.15 usr + 0.03 sys = 39.18 CPU) @ 12.76/s (n=500)
smallest: 19 wallclock secs (18.98 usr + 0.01 sys = 18.99 CPU) @ 26.33/s (n=500)
Removing chomp had an almost unnoticeable effect. However, I recuded my processing tenfold by limiting split. I made my code even faster by reordering the inbound data.
Just to see what effect number of fields really has, I reduced the size of the data so that each record had ten fields. The results were obviously less impressive, though still visible:
Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 58 wallclock secs (57.50 usr + 0.05 sys = 57.55 CPU) @ 8.69/s (n=500)
no_chomp: 55 wallclock secs (55.13 usr + 0.04 sys = 55.17 CPU) @ 9.06/s (n=500)
smaller: 38 wallclock secs (37.67 usr + 0.03 sys = 37.70 CPU) @ 13.26/s (n=500)
smallest: 18 wallclock secs (18.46 usr + 0.01 sys = 18.47 CPU) @ 27.07/s (n=500)
So what does this tell me? Well, as far as a specific optimization goes, limiting split is probably a good idea if I’m trying to optimize my processing and really don’t need every field.
However, there is a larger moral to this story and that is: When performance is a concern, don’t make your code do work that it doesn’t have too. The most important is my choice of algorithms and related data structures for solving my problem. This can make or break my performance. After that, knowing specific optimizations such as limiting split fields, help me get just a little more performance out of my code.