Character code literals

claudiameadows · January 10, 2022, 11:24am

Something that's gotten old years ago is having to constantly look up character codes for various string and byte parsing needs. In a pinch, it's useful for scripting, and V8 optimizes for str[i] === "ch" specially, but that doesn't help when dealing with binary data like raw UTF-8 buffers. One could save "a".charCodeAt(0) to a variable, but sufficiently complicated parsing algorithms could be saving dozens of those, and so it'd get unwieldy fast. (As a concrete data point, parsing JSON, a relatively simple language, requires specific awareness of 29 code points: 0, 1, 9, ., +, -, E, e, {, }, [, ], :, \, ", t, r, u, f, a, l, s, n, ,, b, space, newline, carriage return, and horizontal tab.)

I don't have any specific concrete ideas - first idea was 0s'a' or 0sa, but those obviously look a bit strange and the second doesn't look delimited enough. But this would make for a very good quality of life improvement around text parsing in JS.

theScottyJam · January 10, 2022, 3:10pm

One option is to create a template tag. This could be provided built-in, letting engines optimize it away if they choose, but it also isn't too difficult to provide user-land.

const c = ([char]) => char.charCodeAt(0)
console.log(c`a`) // 97

tabatkins · January 10, 2022, 5:52pm

You'd want codePointAt() to avoid messing up astral characters, but otherwise yeah, that works. ^_^

lightmare · January 10, 2022, 8:58pm

Exactly what I've been using ;) As for optimizing it away, hand-written parsers are such a niche that a babel macro is enough:

babel-charcode.macro

//  sorry never bothered publishing it
const { createMacro, MacroError } = require("babel-plugin-macros");

module.exports = createMacro(charCodeMacro);

function charCodeMacro({ references, babel }) {
  if ("default" in references) {
    for (const ref of references.default) {
      transform(ref, babel.types);
    }
  }
}

function transform(ref, t) {
  const { parentPath } = ref;
  if (!parentPath || !parentPath.isTaggedTemplateExpression() || ref.key !== "tag") {
    throw new MacroError(`\`${ref}\` can only be used as template expression tag`);
  }
  const { quasis } = parentPath.node.quasi;
  if (quasis.length !== 1) {
    throw new MacroError(`\`${ref}\` can only be used on template strings without substitutions`);
  }
  const { cooked } = quasis[0].value;
  const cp = cooked.codePointAt(0);
  if (typeof cp !== "number" || cooked !== String.fromCodePoint(cp)) {
    throw new MacroError(`\`${ref}\` can only be used on single characters, got ${JSON.stringify(cooked)}`);
  }
  parentPath.replaceWith(t.numericLiteral(cp));
}

claudiameadows · January 10, 2022, 9:05pm

Worth mentioning there is considerable language precedent (though some of these use separate types for characters and numbers). Just to name a few:

C/C++
Java
C#
Visual Basic
Swift
Rust
Erlang
Elixir
Haskell

I expected TC39 people to have enough familiarity with some of these languages to be aware of this, hence why I didn't make it explicit.

theScottyJam · January 11, 2022, 3:32am

I think this "gotcha" is pretty important. I assume this is simply a list of languages that have some form of character literal? It's not necessarily a list of languages that have character literals that produce numbers? The difference, IMO, is important. After all, many of these languages wouldn't be able to help with your main use case either, as you would still have to transform the character to a number using some sort of function or casting mechanism, just like you have to do today in JavaScript (though, I guess casting can be done for free, so you do get the minor performance benefit with those languages).

Put it another way, what you're asking for isn't simply a character literal syntax, we've already got that, put 'x' into your source code, and you'll get a character using this language's native choice of representing characters, a one-length string. This puts us on-par with the feature-set of all of these other languages when it comes to character-literal syntax. What you're asking for instead is a special character-literal syntax that produces numbers instead of the language's native character type. I don't think any of these languages have this. (sure, C++ character literals produce numbers, but numbers are also their native way to represent characters).

Topic		Replies	Views
Code point string 💡 Ideas	7	272	May 25, 2022
Code point iterators for strings? 💡 Ideas	3	400	January 24, 2020
Inverted regex charsets in Unicode mode I have questions	4	320	July 21, 2022
Proposal: add offset, line, and column properties to `SyntaxError` and `JSON.parse` errors 💡 Ideas proposal	10	53	September 6, 2024
Why just UTF-16? Add UTF-8 support everywhere! 💡 Ideas	14	1476	September 14, 2022

Character code literals

Related topics