Character code literals

Something that's gotten old years ago is having to constantly look up character codes for various string and byte parsing needs. In a pinch, it's useful for scripting, and V8 optimizes for str[i] === "ch" specially, but that doesn't help when dealing with binary data like raw UTF-8 buffers. One could save "a".charCodeAt(0) to a variable, but sufficiently complicated parsing algorithms could be saving dozens of those, and so it'd get unwieldy fast. (As a concrete data point, parsing JSON, a relatively simple language, requires specific awareness of 29 code points: 0, 1, 9, ., +, -, E, e, {, }, [, ], :, \, ", t, r, u, f, a, l, s, n, ,, b, space, newline, carriage return, and horizontal tab.)

I don't have any specific concrete ideas - first idea was 0s'a' or 0sa, but those obviously look a bit strange and the second doesn't look delimited enough. But this would make for a very good quality of life improvement around text parsing in JS.

One option is to create a template tag. This could be provided built-in, letting engines optimize it away if they choose, but it also isn't too difficult to provide user-land.

const c = ([char]) => char.charCodeAt(0)
console.log(c`a`) // 97
3 Likes

You'd want codePointAt() to avoid messing up astral characters, but otherwise yeah, that works. ^_^

3 Likes

Exactly what I've been using ;) As for optimizing it away, hand-written parsers are such a niche that a babel macro is enough:

babel-charcode.macro
//  sorry never bothered publishing it
const { createMacro, MacroError } = require("babel-plugin-macros");

module.exports = createMacro(charCodeMacro);

function charCodeMacro({ references, babel }) {
  if ("default" in references) {
    for (const ref of references.default) {
      transform(ref, babel.types);
    }
  }
}

function transform(ref, t) {
  const { parentPath } = ref;
  if (!parentPath || !parentPath.isTaggedTemplateExpression() || ref.key !== "tag") {
    throw new MacroError(`\`${ref}\` can only be used as template expression tag`);
  }
  const { quasis } = parentPath.node.quasi;
  if (quasis.length !== 1) {
    throw new MacroError(`\`${ref}\` can only be used on template strings without substitutions`);
  }
  const { cooked } = quasis[0].value;
  const cp = cooked.codePointAt(0);
  if (typeof cp !== "number" || cooked !== String.fromCodePoint(cp)) {
    throw new MacroError(`\`${ref}\` can only be used on single characters, got ${JSON.stringify(cooked)}`);
  }
  parentPath.replaceWith(t.numericLiteral(cp));
}

Worth mentioning there is considerable language precedent (though some of these use separate types for characters and numbers). Just to name a few:

  • C/C++
  • Java
  • C#
  • Visual Basic
  • Swift
  • Rust
  • Erlang
  • Elixir
  • Haskell

I expected TC39 people to have enough familiarity with some of these languages to be aware of this, hence why I didn't make it explicit.

I think this "gotcha" is pretty important. I assume this is simply a list of languages that have some form of character literal? It's not necessarily a list of languages that have character literals that produce numbers? The difference, IMO, is important. After all, many of these languages wouldn't be able to help with your main use case either, as you would still have to transform the character to a number using some sort of function or casting mechanism, just like you have to do today in JavaScript (though, I guess casting can be done for free, so you do get the minor performance benefit with those languages).

Put it another way, what you're asking for isn't simply a character literal syntax, we've already got that, put 'x' into your source code, and you'll get a character using this language's native choice of representing characters, a one-length string. This puts us on-par with the feature-set of all of these other languages when it comes to character-literal syntax. What you're asking for instead is a special character-literal syntax that produces numbers instead of the language's native character type. I don't think any of these languages have this. (sure, C++ character literals produce numbers, but numbers are also their native way to represent characters).