Get the Original Mapping of Unicode String Normalization

graphemecluster · June 26, 2022, 9:04pm

I am having the same problem as Get changed offsets of unicode normalization? - Core Development - Discussions on Python.org, but in a JavaScript library.

My use case is that I am making a “parser” that requires high performance. The reason I quote it is that it is not a tokenizer for a language (it does not have a formal syntax) and the goal is to maximize the parsed content. To achieve it I would like to normalize the string by NFKD beforehand but the problem is that the original strings are necessary to be included in the parsing result.

The downside of using a library is that it will cause a slow performance because “the Unicode Normalization Algorithm is fairly complex” (UAX #15: Unicode Normalization Forms) and the logic will be duplicated as it is already implemented in JavaScript engines.

bakkot · June 27, 2022, 4:19am

Most parsers I write operate on the raw input rather than normalizing, which has worked well for me. How much additional work is required to parse the raw input, rather than normalizing first?

Topic		Replies	Views
Syntactic sugar to offset 💡 Ideas proposal	2	252	August 2, 2023
Character code literals 💡 Ideas	5	348	January 11, 2022
Raw string literals that can contain any arbitrary text without the need for special escape sequences 💡 Ideas proposal	4	1398	July 29, 2023
String byteLength count? 💡 Ideas proposal	15	286	March 15, 2025
Treating strings as an array of chars, maybe a new Char type? 💡 Ideas	4	333	May 8, 2025

Get the Original Mapping of Unicode String Normalization

Related topics