String byteLength count?

WebReflection · March 3, 2025, 11:27am

I believe there was a discussion but the search didn't help ... basically my question is about exposing the size in bytes of a JS string, not length and not TextEncoder.encode(str).length because the latter duplicates the amount of RAM needed to obtain a size as oposite of allowing to grow or resize an ArrayBuffer to then use encodeInto.

As summary: why there is no easy way without converting strings all over the place to obtain what would be the size of a string without encoding, creating a view, or whatnot as workaround?

Thank you!

edit I know for some reason resizable ArrayBuffer cannot work with encodeInto but I have a MagicView module that transfers buffers behind the scene while growing and I cannot transfer and create a new static ArrayBuffer with a meaningful length if I cannot know upfront how much data is needed. See: TextEncoder & TextDecoder shenanigans · GitHub

Also, in StackOverflow this question about "how would I get the length in bytes from a JS string?" is a forever green question full of misleading, slow, often unacceptable answers.

Is there any concrete reason that detail cannot be shared in JS, beside historical one?

bakkot · March 3, 2025, 3:23pm

There is no notion of the "length in bytes" of a string independent of encoding. Formally, strings in JS are UTF-16, which means that the length in bytes of a string is exactly twice their .length, which is sufficiently trivial that there's no reason to expose this.

TextEncoder uses UTF-8. Is your question specifically what the length of the encoding of a string would be if encoded as UTF-8? In principle there could be a function which computes that, although because TextEncoder lives in WHATWG it would probably be most appropriate there.

In the mean time, here's a userland function you can use if you want (I wrote it and dedicate it to the public domain). I haven't actually tested it so you probably want to do that before relying on it, but I think it's right. In any case, something like this should be pretty straightforward to do.

function utf8Length(str) {
  let count = 0;
  
  for (let i = 0; i < str.length; i++) {
    let code = str.charCodeAt(i);
    
    if (code < 0x80) {
      count += 1;
    } else if (code < 0x800) {
      count += 2;
    } else if (code >= 0xD800 && code <= 0xDBFF && i + 1 < str.length) {
      let nextCode = str.charCodeAt(i + 1);
      if (nextCode >= 0xDC00 && nextCode <= 0xDFFF) {
        count += 4;
        i++; // Skip trailing surrogate
      } else {
        // Unpaired surrogates get represented as U+FFFD, which takes 3 bytes
        count += 3;
      }
    } else {
      count += 3;
    }
  }
  
  return count;
}

bakkot · March 3, 2025, 3:35pm

Oh, I see there's already a WHATWG thread on this. Which you've already commented on, actually.

WebReflection · March 3, 2025, 4:54pm

I am probably all over the place and your solution proves it's not just str.length * 2 ... imagine a string where no char requires UTF-16 surrogates, that's double the amount of RAM needed to work on serialization.

In my attempts, I had an ASCII.encode variant for things I know that don't require that * 2 guessing:

dates to ISO strings
RegExp flags
possibly RegExp in general
other use case where I can be sure the str does not need, or contain, UTF-16 chars

On top of MessagePack a COBR specification was born with also the ability to disambiguate between bytestrings and byteutf8strings so this problem is not new, it's just some overhead nobody needs, as in: nobody needs to find the bulletproof solution out there when every PL knows the buffer amount of data needed to represent any string, and buffers are binary.

In here JS has binary types and Text encoders/decoders and yet no way to know how many bytes a string would return once encoded by primitives.

So yes, we can solve that on userland but please explain to me why is that needed at all when buffers is all the engine understands behind the scene, thanks.

Last, but not least, when raw performance matters, any of these workarounds are slower than what a native implementation could provide.

tabatkins · March 3, 2025, 5:17pm

Bakkot proved no such thing. It is exactly str.length * 2 if you're counting the bytes of the WTF-16 serialization that JS uses by default. Yes, it's clearly not that when you're serializing using any other method; bakkot is providing a method for counting the bytes when serializing to UTF-8. (And from a quick read, I think that the function is exactly right.) Their entire point is that the "length" depends completely on the serialization you're using.

bakkot · March 3, 2025, 6:28pm

Engines mostly do not use UTF-8 behind the scenes, because JS strings are UTF-16, not UTF-8. So the engines have to do exactly the same calculation as you'd be doing in userland to provide this information. Here's V8's implementation, for example, which is in fact ultimately called by Chromium's implementatation of TextEncoder.encode. You'll note it's pretty much exactly the same thing I wrote (well, V8's version operates a character at a time, but when called in a loop it's the same thing).

And sure, engines can do it a little faster if it's native. But that's true of any function you can write in JS. It's still fundamentally the same operation.

I'm not opposed to exposing this in the web platform; as you say, it's not trivial for JS programmers to get right. But it would be done in WHATWG, as part of TextEncoder, not in TC39.

WebReflection · March 3, 2025, 8:26pm

not really ... when official APIs to encode and decode text assume UTF-8 ... yes, doubling bytes on strings to have Uint16 instead of Uint8 all over might be convenient. but it's extremely inconvenient when the Web speaks UTF-8:

meta to enforce UTF-8 on HTML pages
fetch and arrayBuffer that returns UTF-8 buffers
Blobs which has been used as a (wrong/slow) hack to retrieve what I am after here
decode(encodeURIComponent(str)).length to do similar error prone or slow approach we have in the previous point
TextEncoder.encode producing UTF-8 views and TextDecoder.decoder working only with those
and so on and so forth ... the amount of questions without answers from the PL that fuels the Web and lives out of UTF-8 in all places is never answered if not: "use this userland solution or that one"

Meanwhile, JSON, TextEncoder.encode and TextDecoder.decode fully speak UTF-8 and translate UTF-16 to it, making the surrogate dance an internal affair, not "yet another library needed to do that right" for something trivially done internally, that's my whole point.

I have never proposed to have JS using UTF-8 strings, I am just saying this long-time demanding question has so much backfire for something that looks like "2 LOC" on engines to provide out there too as detail ... so, once again, why do we need userland error-prone solutions to do what every engine internally is capable of?

Maybe my quest for a String.prototype.byteLength was wrong, in retrospective, but what should we ask to have a String.prototype.utf8Length out there?

WebReflection · March 3, 2025, 8:39pm

you see that conversation summarized in here, Mathias needing the exact same thing and instead of iterating twice natively, the proposed solution is to iterate on the JS side which is worse than just iterating twice natively Fast byteLength() · Issue #333 · whatwg/encoding · GitHub but most importantly, the whole point of havving that detail is to preserve needed RAM ...

Imagine I have a 2GB length string (allowed) and I want to serialize it or send it or create a view to send it ... if I use any of the current primitives to do so I will need at least 4GB of free RAM, even if temporarily, to create that view and trash it somewhere else plus all the extra RAM needed for the program .. so I need to TextEncoder.encode(str), already duplicating + view and buffer memory to handle it, to then set (append) to a buffer I am meaning to send (we are 6GB of memory already in here) to then hope GC will be good enough to not crash (NodeJS case and other IoT or more constrained environments) ... enter utf8Length() (or as accessor)

I loop natively and retreve the length, no extra RAM needed except for that unsigned int 32
I allocate if possible enough buffer to store in there (even fixed length) to send that data ...
the GC now deals with max 4GB of RAM instead of 6GB+
I haven't created, by accident or logic, any extra GC operation to perform in the meantime

I agree that asking for length then encoding is twice the work behind the scene, but how is it acceptable that such work is needed and done twice from the JS side and it's actually OK?

WebReflection · March 3, 2025, 8:55pm

sorry for the extra reply, it's also the last one but you pointed at this: https://source.chromium.org/chromium/chromium/src/+/main:v8/src/objects/string-inl.h;l=1092-1111;drc=e71cafaddfaa5850b906bf4c78b5289a2fa2d81a

In there, there are tons of things JS cannot do:

DisallowGarbageCollection no_gc; ... if it was possible on JS world, I welcome a JS only solution ... this is instead what I am after: less GC operations to retrieve something the engine can retrieve, plus less RAM involved
FlatContent content = string->GetFlatContent(no_gc); another one ... impossible to have the same on JS side without using primitives that expose everything but that detail (reason I think this should really be a JS thing, not a WHATWG one)
content.IsOneByte() another primitive missing on JS world ... if we could have that, maybe all the userland checks around surrogates could fade away?
unibrow::Utf8::Length(c, last_character) another primitive we miss on the JS side ... a combination of both this one and the isOneByte would help everyone out there solving this long-time problem, I have actually asked for just that String:Utf8Length shortcut exposed as API, and that's an internal JS detail, not a WHATWG one, so I think this should be consider as part of the JS PL, as these things should work even outside the WHATWG scope, imho

I hope it's clear what I am asking for and/or why, happy to expand further if needed, but please attach reasons to backfire on it that have not been already mentioned in a thread where people using Chromium project ended up writing their own solution too, as that proves it's a constant and helpful demand, way more important than a forEach or many other JS shortcuts introduced recently, thank you!

bakkot · March 3, 2025, 9:04pm

DisallowGarbageCollection does not mean "less GC operations", and GetFlatContent only makes sense in the context of V8's extremely complex internal string representation, not in JS. If you're not familiar with reading V8's source it's probably best not to try to draw conclusions from guessing at the behavior of its internal APIs from their names. I'm sorry for linking it.

It's really not. The length of a string when encoded as UTF-8 is something which is only relevant if you're doing encoding to UTF-8. It is not information which exists internally in the specification or engines. It has to be computed, just as the length in any other encoding except UTF-16 would have to be computed, because JS strings are defined to be UTF-16.

Again, I'm not saying that this is a bad API. It's true that lots of things want UTF-8 encoding. But it will not happen in JS, because TextEncoder is in WHATWG, and that's the appropriate place for such an API. So if you want to make a case for it, the WHATWG thread linked above is the correct place to do so, not this forum.

WebReflection · March 3, 2025, 9:23pm

fair enough, guilty as charged, but those internals specialized things are the reason I would love to have this utf8Length exposed.

but the argument there is that "it's going to be used for encodeInto which is bad" which is the primary use case to ask for that, among all other use cases users keep needing to deal with daily when it comes to ArrayBuffer which is a JS thing.

Have a look at the current state of buffers and WHATWG primitives here, if you have a spare minute: TextEncoder & TextDecoder shenanigans · GitHub

There is no way to avoid duplicated RAM out of a string and a resulting encoded buffer in JS today, while most other dynamic PLs don't have such issue, they have binary strings and ref.write(bytes) avilable (see Python, as example) ... not having the same easy in JS due resizable/growing paranoia (reasons people use those is to preserve references and avoid trashing RAM on demand).

I deal with this stuff daily and if I could remove that JS workaround everyone else needs to write or provide, I'd be way happier around JS dynamic capabilities, specially around its UTF-16 strings based approach that doesn't play super well with the rest of the world and primitives behind proposed to solve that can't cope with memory at all, it's either there, already pre-allocated, or "not happening" for most operations.

bakkot · March 3, 2025, 11:04pm

The WHATWG thread linked above is still the correct place to argue for this. I'm sorry they have not yet found your arguments convincing but that doesn't mean we're going to add it here instead.

I agree there's several ways that encoding can be improved. That work is going to have to be done in WHATWG.

WebReflection · March 14, 2025, 10:10am

I wonder if anyone in here would like to follow up around this topic, so I'll just leave this post in WHATWG space here: Fast byteLength() · Issue #333 · whatwg/encoding · GitHub

It's really annoying such fundamental thing is delegated to WHATWG while NodeJS, Bun, all other JS based runtimes could just benefit if JS had the ability to convert strings to utf-8 compatible buffers and vice-versa ... very sad everyone has to reinvent that same wheel with other libraries or even via JS itself, polluting code for a very obscure reason.

bakkot · March 15, 2025, 1:07am

Node, Bun, and most other JS based runtimes follow the WinterTC (formerly WinterCG) Minimum Common API, which includes TextEncoder and TextDecoder, so none of them have to reinvent anything. It makes no difference whatsoever to Node and friends whether this is defined in ECMA262 or WHATWG. It happens to be defined in WHATWG, so that's the appropriate place to pursue changes to it, but this is not relevant to users.

WebReflection · March 15, 2025, 10:41am

but why every runtime uses its own encoder/decoder instead? I would make no argument if Buffer in NodeJS wouldn't be preferred and overall wy faster than those 2 APIs but I don't understand why, or how, those 2 APIs are slower as opposite of being as fast as their WinterTC counterpart ... can anyone please enlighten me around this?

In WHATWG the current status is that "it's not true these API are slow" while the evidence and the abundance of posts online would suggest, and provide numbers, otherwise (specially for small strings which are the norm, not the exception/edge-case).

Thanks!

bakkot · March 15, 2025, 4:43pm

Node's Buffer predates the existence of TextEncoder (and for that matter TypedArrays in general) by several years.

The speed of these APIs is determined by how they're implemented, not how they're specified. (At least for basic use cases; if you're trying to write to a specific offset the fact that TextEncoder makes you use subarray does inherently make it slower than APIs which don't do that.)

All the major browsers are open source; if you want to optimize TextEncoder in your favorite browser, go for it. It's an implementation issue, not a specification issue.

Topic		Replies	Views
bigint enhancements. 🦋 Proposals proposal	13	1808	January 27, 2021
Why just UTF-16? Add UTF-8 support everywhere! 💡 Ideas	14	1479	September 14, 2022
[Issue] BigInt serialization 💡 Ideas proposal	23	411	July 17, 2025
Javascript, WebAssembly and WASI - what is the future and do we have a voice? I have questions	7	877	March 24, 2023
Style Utilities I have questions proposal	7	510	September 19, 2020

String byteLength count?

Related topics