Byte-order modifiers are one of the Perl 5.10 features farther along in perl5100delta, after the really big features. To any pack
format, you can append a <
or a >
to specify that the format is little-endian or big-endian, respectively. This allows you to handle endianness in the formats that don’t have specify versions for each architecture already, as well as apply endianness to groups.
Before you think about the <
and >
modifiers, consider those that already specify the endianness. The n
and N
formats specify an unsigned short or long in “network order”, which is big-endian. The v
and V
formats specify the same things, but in “VAX order”, which is little endian.
Here’s a test program which takes some bytes, which you specify in a string using the hex representation of each charater (just like pack
would). Once you have the string, you use both N
and V
to unpack
that, finding out which one works on your system. The L
format always does it using the local architecture:
use 5.010; my $string = "\xAA\xBB\xCC\xDD"; foreach my $format ( qw(N V) ) { my $number = unpack $format, $string; say sprintf "%s is 0x%X", $format, $number; say "Your native format is $format" if $number == pack 'L', $string; }
The output shows that the little-endian order switches the bytes around, and that this program ran on a little-endian machine (in this case, a MacBook Air, which uses Intel processors):
N is 0xAABBCCDD V is 0xDDCCBBAA Your native format is V
For those, you need to know which order you have, either by knowing the architecture or getting the producer of the data to tell you the format. For instance, UTF-16 text files can have a byte order mark, 0xFEFF; that’s a short integer (two bytes). If you are using a big endian machine, when you read that short you get 0xFEFF. If you are using a little endian machine, you get 0xFFFE because it switches the bytes around as you saw before.
The other pack
formats use the native format so you haven’t had a way to specify which order to interpret the bytes. These formats have always used the native architecture (meaning they will get it wrong on the other architecture):
Format | Description |
---|---|
s, S | signed and unsigned shorts (two bytes) |
i, I | signed and unsigned integers (at least four bytes) |
l, L | signed and unsigned longs |
q, Q | signed and unsigned quads (if you have a 64-bit perl) |
j, J | signed and unsigned Perl internal integers |
f | single-precision floating-point value |
d | double-precision floating-point value |
F | Perl internal floating−point value |
D | long-double-precision floating-point value |
p, P | pointers to a null-terminated string and a structure |
Perl 5.10 let’s you specify the architecture these formats should use. You can use big-endian values even if you are using a little-endian machine. Suppose you have π encoded as a single-precision floating point value in big-endian even though you have a little-endian machine. The native format
use 5.010; my $pi_string = "\x40\x49\x0F\xDA"; # 3.14159250259399 in big-endians foreach my $format ( qw(f f< f>) ) { my $number = unpack $format, $pi_string; say sprintf "%s is %f", $format, $number; }
The f
and f<
give the non-π results. The f
assumes the native, little-endian format while the f<
makes it explicit. The f>
specifies big-endian format despite the native architecture, and it gets the right value (with normal floating-point rounding error):
f is -10082865224089600.000000 f< is -10082865224089600.000000 f> is 3.141593
You can also apply these modifiers to groups so that all of the modifiable formats in that group. This example tries combinations of unsigned shorts in either format:
use 5.010; my $string = "\xAA\xBB\xCC\xDD"; foreach my $format ( qw| SS S<S> S>S< (SS)> (SS)< | ) { my( $first, $second ) = unpack $format, $string; say sprintf "%5s is 0x%X 0x%X", $format, $first, $second; }
The output shows you show the S
format changes based on which architecture you tell pack
to use:
SS is 0xBBAA 0xDDCC S<S> is 0xBBAA 0xCCDD S>S< is 0xAABB 0xDDCC (SS)> is 0xAABB 0xCCDD (SS)< is 0xBBAA 0xDDCC
You still have to know which architecture your data are in, but at least you can tell Perl which format you want.
Things to remember
- Most
pack
formats rely on the native architecture - Perl 5.10 introduces the
<
and>
modifiers
so you can specify the architecture - The
<
specifies little-endian because the little side touches the specifier - The
>
specifies big-endian because the big side touches the specifier