Parsers vs Unicode

Home
Career
Developer blog
Parsers vs Unicode

2 min read — by Arno Schödl

Keeping things simple

Boost.Parser is a new library that is currently reviewed for inclusion in Boost. In its introduction, the documentation touts Unicode awareness as one of its features.

At think-cell, we have standardized on Boost.Spirit for years for all custom parsing needs, which is similar in spirit (no pun intended) with Boost.Parser. Because maintenance of Boost.Spirit got a bit slow, we recently forked into our public library. Most of the grammars we use it for are small, but some are larger, like a sheet of Excel formulas referencing each other.

Of course, our input is almost exclusively Unicode, either UTF-8 or UTF-16. Matching Unicode is complex. Comparison by code point is usually not the right thing. Instead, we must normalize, for which we even have various choices of what to accept as equal. Case-insensitive matching is even more complex, slow, and even language-dependent.

Input is often not guaranteed to be valid Unicode. For example, file names on Windows are sequences of 16-bit units, allowing unmatched surrogates, same with input from Win32 edit boxes and file content.

We realized that for almost all grammars we have, all this complexity does not matter. The reserved symbols of most grammars (JSON, XML, C++, URLs, etc.) are pure ASCII. Semantically relevant strings are ASCII as well ("EXCEL.EXE"). ASCII can be correctly and quickly matched on a per-code-unit basis. Case-insensitive matching for ASCII is simple and fast. User-defined strings, such as JSON string values, may contain Unicode, but then they usually do not affect parsing decisions. The user may want Unicode validation for these strings, but this can be done by the leaf parser for these strings, rather than for the whole input.

Since so much matching is against ASCII, we found it useful to have support for compile-time known ASCII literals (tc::char_ascii) in the parser library. With them, the same grammar can be used for any input encoding. When parsing user-defined strings, they will have the encoding of the input, but that’s fine. Any encoding conversion can be dealt with separately from the parser.

Finally, we may want to parse more than only strings. Parsing binary files, or sequences of DNA, should be possible and efficient.

Overall, I recommend separating Unicode processing from the parser library. The parser library operates on an abstract stream of symbols. For Unicode text these would be code units. It provides the structural parsers such as sequences with and without backtracking, alternatives, Kleene star etc., and leaves the interpretation of the symbols entirely to the leaf parsers, which may or may not care about Unicode.

We have modified the Boost.Spirit fork in our library in this direction, and it serves us well.

Do you have feedback? Send us a message at devblog@think-cell.com

We are hiring

Enough time to make sure that every detail of your solution is perfect
Become part of an international team of brilliant minds
No scheduled meetings

Join as a C++ developer