In the Effective Perl class I gave at Frozen Perl last week, I got a question I didn’t have the quick answer to. What happens to the strings when Encode's
decode
function only partially decodes the string?
The default behavior for decode
always decodes the entire string, although it uses substitution character (0xFFFD, which may look like ? on the screen) anywhere that it finds an error in the encoding:
You can change how decode
handles problems by supplying a third argument to it, using one of the constants FB_DEFAULT
, FB_CROAK, FB_WARN, or FB_QUIET. The FB_DEFAULT uses the substitution character and the FB_CROAK just dies. It’s the other two that are interesting. They stop decoding, either with a warning or without one. Try it yourself:
use 5.010;
use strict;
use warnings;
use Encode qw(decode :fallbacks);
binmode STDOUT, ":utf8";
foreach my $fallback ( qw( FB_DEFAULT FB_CROAK FB_WARN FB_QUIET ) )
{
my $fallback_value = do { no strict 'refs'; &{"$fallback"} };
my $octets = do { use bytes; "\x41\x42\x43\x61\xCC\x61\x41\x42\x43" };
my $decoded = eval { decode( 'utf8', $octets, $fallback_value ) };
say "$fallback: ", show_chars( $decoded ), " [$octets]";
}
sub show_chars {
use bytes;
defined $_[0] ?
join( ':', map { sprintf "%X", ord } split //, $_[0] )
:
'undefined';
}
The string you’re using is "\x41\x42\x43\x61\xCC\x61\x41\x42\x43"
. It’s “ABCa.aABC” where that "\xCC"
in the middle is an error. It’s the starting a combining character but it doesn’t have a valid octet following it. When you print it, it looks a bit odd (ABCaÃŒaABC
) because Perl is treating it as bytes since you used use bytes;
in the scope that you created it.
The output shows the fallback type, the characters (in hex separated by colons), and in the braces, the value of $octets
after the operation:
FB_DEFAULT: 41:42:43:61:FFFD:61:41:42:43 [ABCaÃŒaABC]
FB_CROAK: undefined [ABCaÃŒaABC]
FB_WARN: 41:42:43:61 [ÃŒaABC]
FB_QUIET: 41:42:43:61 [ÃŒaABC]
utf8 "\xCC" does not map to Unicode at ...
In the FB_DEFAULT
case, the \xCC
turned into the substitution character, \xFFFD
. Notice that the split //
worked on characters, so the two-byte substitution character has four letters in the hex representation.
In the FB_CROAK
case, the decode
dies, the return value is undef
, and $octets
stays the same. decode
doesn’t mess with the argument at all.
Both FB_WARN
and FB_QUIET
do the same thing, although FB_WARN
whines about it. They each tell decode
to handle as much of the string as it can. When it finds an error, it returns what it had so far (represented by 41:42:43:61
, which is ABCa
). However, it also removes that part from the input string, leaving only the part of the string from the error onward. This gives you a chance to examine the string where decode
left off so you can decide what to do on your own. You might take off offending bits and start the processing again.
It’s documented that decode
changes its input, but not right next to the main documentation for that function. You have to read the “Handling Malformed Data” section later in the Encode docs.
You might notice the problem if you try to decode
a string literal:
use Encode qw(decode :fallbacks);
my $decoded = decode( 'utf8', "\x61\xCC\x61", FB_WARN );
You get the error about modifying a read-only value:
Modification of a read-only value attempted ...
If you don’t want decode
to mess with your argument, you can use a bitmask to adjust the fallback value. decode
looks for the LEAVE_SRC
bit to be set (and it only matters for FB_WARN
and FB_QUIET
), so just OR it away:
use Encode qw(decode :fallbacks LEAVE_SRC);
my $decoded = decode( 'utf8', "\x61\xCC\x61", FB_WARN | LEAVE_SRC );
If you want to keep the original octet sequence, save a copy before you pass it to decode
.