Perl v5.24 adds a linebreak word boundary, \b{lb}
, to go along the new word boundaries added in v5.22. This is part of Perl’s increasing conformance with the regular expression requirements in Unicode Technical Standard #18. The Unicode::LineBreak implements the same thing, although you have to do a lot more work.
Consider this example that takes a long string what contains no vertical whitespace (despite how it may appear on a screen). You want to wrap it within a window of 50 to 70 columns wide, but you don’t want to break in the middle of something:
use v5.24; my $string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."; $string =~ s/(\X{50,70}\b{lb})/$1\n/g; say "-" x 73; say $string;
The result is a wrapped paragraph:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
The line-boundary break knows where to not break a line. It’s not as simple as whitespace because some whitespace is non-breaking. Some hyphens can wrap around lines, and some sorts can’t. It’s a bit too complicated to deal with here but you can read about it in Unicode Standard Annex #14: “Unicode Line Breaking Algorithm”.
You might also consider the Text::Autoformat module for complete string you already have.
But, what about text you get piece-by-piece? You can add it to a buffer then check if there’s a good place to break a line once the buffer gets too big:
use v5.24; my @words = split /\s/, "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."; my $buffer; foreach my $word ( @words ) { $buffer .= $word . ' '; next if length $buffer <= 40; if( $buffer =~ m/(\X{30,45}\b{lb})(.*)/ ) { say $1; $buffer = $2; } } say $buffer;
Again, you get a wrapped paragraph:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Can you say more about what the new `\b{lb}` actually matches?
You’ll have to read the Unicode document I mentioned. It’s not simple and I don’t think I could do better than the original source.