Get the Original Mapping of Unicode String Normalization

I am having the same problem as Get changed offsets of unicode normalization? - Core Development - Discussions on Python.org, but in a JavaScript library.

My use case is that I am making a “parser” that requires high performance. The reason I quote it is that it is not a tokenizer for a language (it does not have a formal syntax) and the goal is to maximize the parsed content. To achieve it I would like to normalize the string by NFKD beforehand but the problem is that the original strings are necessary to be included in the parsing result.

The downside of using a library is that it will cause a slow performance because “the Unicode Normalization Algorithm is fairly complex” (UAX #15: Unicode Normalization Forms) and the logic will be duplicated as it is already implemented in JavaScript engines.

2 Likes

Most parsers I write operate on the raw input rather than normalizing, which has worked well for me. How much additional work is required to parse the raw input, rather than normalizing first?