I believe there was a discussion but the search didn't help ... basically my question is about exposing the size in bytes of a JS string, not length and not TextEncoder.encode(str).length because the latter duplicates the amount of RAM needed to obtain a size as oposite of allowing to grow or resize an ArrayBuffer to then use encodeInto.
As summary: why there is no easy way without converting strings all over the place to obtain what would be the size of a string without encoding, creating a view, or whatnot as workaround?
Thank you!
edit I know for some reason resizable ArrayBuffer cannot work with encodeInto but I have a MagicView module that transfers buffers behind the scene while growing and I cannot transfer and create a new static ArrayBuffer with a meaningful length if I cannot know upfront how much data is needed. See: TextEncoder & TextDecoder shenanigans · GitHub
Also, in StackOverflow this question about "how would I get the length in bytes from a JS string?" is a forever green question full of misleading, slow, often unacceptable answers.
Is there any concrete reason that detail cannot be shared in JS, beside historical one?
There is no notion of the "length in bytes" of a string independent of encoding. Formally, strings in JS are UTF-16, which means that the length in bytes of a string is exactly twice their .length, which is sufficiently trivial that there's no reason to expose this.
TextEncoder uses UTF-8. Is your question specifically what the length of the encoding of a string would be if encoded as UTF-8? In principle there could be a function which computes that, although because TextEncoder lives in WHATWG it would probably be most appropriate there.
In the mean time, here's a userland function you can use if you want (I wrote it and dedicate it to the public domain). I haven't actually tested it so you probably want to do that before relying on it, but I think it's right. In any case, something like this should be pretty straightforward to do.
function utf8Length(str) {
let count = 0;
for (let i = 0; i < str.length; i++) {
let code = str.charCodeAt(i);
if (code < 0x80) {
count += 1;
} else if (code < 0x800) {
count += 2;
} else if (code >= 0xD800 && code <= 0xDBFF && i + 1 < str.length) {
let nextCode = str.charCodeAt(i + 1);
if (nextCode >= 0xDC00 && nextCode <= 0xDFFF) {
count += 4;
i++; // Skip trailing surrogate
} else {
// Unpaired surrogates get represented as U+FFFD, which takes 3 bytes
count += 3;
}
} else {
count += 3;
}
}
return count;
}
I am probably all over the place and your solution proves it's not just str.length * 2 ... imagine a string where no char requires UTF-16 surrogates, that's double the amount of RAM needed to work on serialization.
In my attempts, I had an ASCII.encode variant for things I know that don't require that * 2 guessing:
dates to ISO strings
RegExp flags
possibly RegExp in general
other use case where I can be sure the str does not need, or contain, UTF-16 chars
On top of MessagePack a COBR specification was born with also the ability to disambiguate between bytestrings and byteutf8strings so this problem is not new, it's just some overhead nobody needs, as in: nobody needs to find the bulletproof solution out there when every PL knows the buffer amount of data needed to represent any string, and buffers are binary.
In here JS has binary types and Text encoders/decoders and yet no way to know how many bytes a string would return once encoded by primitives.
So yes, we can solve that on userland but please explain to me why is that needed at all when buffers is all the engine understands behind the scene, thanks.
Last, but not least, when raw performance matters, any of these workarounds are slower than what a native implementation could provide.
Bakkot proved no such thing. It is exactly str.length * 2if you're counting the bytes of the WTF-16 serialization that JS uses by default. Yes, it's clearly not that when you're serializing using any other method; bakkot is providing a method for counting the bytes when serializing to UTF-8. (And from a quick read, I think that the function is exactly right.) Their entire point is that the "length" depends completely on the serialization you're using.
Engines mostly do not use UTF-8 behind the scenes, because JS strings are UTF-16, not UTF-8. So the engines have to do exactly the same calculation as you'd be doing in userland to provide this information. Here's V8's implementation, for example, which is in fact ultimately called by Chromium's implementatation of TextEncoder.encode. You'll note it's pretty much exactly the same thing I wrote (well, V8's version operates a character at a time, but when called in a loop it's the same thing).
And sure, engines can do it a little faster if it's native. But that's true of any function you can write in JS. It's still fundamentally the same operation.
I'm not opposed to exposing this in the web platform; as you say, it's not trivial for JS programmers to get right. But it would be done in WHATWG, as part of TextEncoder, not in TC39.
not really ... when official APIs to encode and decode text assume UTF-8 ... yes, doubling bytes on strings to have Uint16 instead of Uint8 all over might be convenient. but it's extremely inconvenient when the Web speaks UTF-8:
meta to enforce UTF-8 on HTML pages
fetch and arrayBuffer that returns UTF-8 buffers
Blobs which has been used as a (wrong/slow) hack to retrieve what I am after here
decode(encodeURIComponent(str)).length to do similar error prone or slow approach we have in the previous point
TextEncoder.encode producing UTF-8 views and TextDecoder.decoder working only with those
and so on and so forth ... the amount of questions without answers from the PL that fuels the Web and lives out of UTF-8 in all places is never answered if not: "use this userland solution or that one"
Meanwhile, JSON, TextEncoder.encode and TextDecoder.decode fully speak UTF-8 and translate UTF-16 to it, making the surrogate dance an internal affair, not "yet another library needed to do that right" for something trivially done internally, that's my whole point.
I have never proposed to have JS using UTF-8 strings, I am just saying this long-time demanding question has so much backfire for something that looks like "2 LOC" on engines to provide out there too as detail ... so, once again, why do we need userland error-prone solutions to do what every engine internally is capable of?
Maybe my quest for a String.prototype.byteLength was wrong, in retrospective, but what should we ask to have a String.prototype.utf8Length out there?
you see that conversation summarized in here, Mathias needing the exact same thing and instead of iterating twice natively, the proposed solution is to iterate on the JS side which is worse than just iterating twice natively Fast byteLength() · Issue #333 · whatwg/encoding · GitHub but most importantly, the whole point of havving that detail is to preserve needed RAM ...
Imagine I have a 2GB length string (allowed) and I want to serialize it or send it or create a view to send it ... if I use any of the current primitives to do so I will need at least 4GB of free RAM, even if temporarily, to create that view and trash it somewhere else plus all the extra RAM needed for the program .. so I need to TextEncoder.encode(str), already duplicating + view and buffer memory to handle it, to then set (append) to a buffer I am meaning to send (we are 6GB of memory already in here) to then hope GC will be good enough to not crash (NodeJS case and other IoT or more constrained environments) ... enter utf8Length() (or as accessor)
I loop natively and retreve the length, no extra RAM needed except for that unsigned int 32
I allocate if possible enough buffer to store in there (even fixed length) to send that data ...
the GC now deals with max 4GB of RAM instead of 6GB+
I haven't created, by accident or logic, any extra GC operation to perform in the meantime
I agree that asking for length then encoding is twice the work behind the scene, but how is it acceptable that such work is needed and done twice from the JS side and it's actually OK?
DisallowGarbageCollection no_gc; ... if it was possible on JS world, I welcome a JS only solution ... this is instead what I am after: less GC operations to retrieve something the engine can retrieve, plus less RAM involved
FlatContent content = string->GetFlatContent(no_gc); another one ... impossible to have the same on JS side without using primitives that expose everything but that detail (reason I think this should really be a JS thing, not a WHATWG one)
content.IsOneByte() another primitive missing on JS world ... if we could have that, maybe all the userland checks around surrogates could fade away?
unibrow::Utf8::Length(c, last_character) another primitive we miss on the JS side ... a combination of both this one and the isOneByte would help everyone out there solving this long-time problem, I have actually asked for just that String:Utf8Length shortcut exposed as API, and that's an internal JS detail, not a WHATWG one, so I think this should be consider as part of the JS PL, as these things should work even outside the WHATWG scope, imho
I hope it's clear what I am asking for and/or why, happy to expand further if needed, but please attach reasons to backfire on it that have not been already mentioned in a thread where people using Chromium project ended up writing their own solution too, as that proves it's a constant and helpful demand, way more important than a forEach or many other JS shortcuts introduced recently, thank you!
DisallowGarbageCollection does not mean "less GC operations", and GetFlatContent only makes sense in the context of V8's extremely complex internal string representation, not in JS. If you're not familiar with reading V8's source it's probably best not to try to draw conclusions from guessing at the behavior of its internal APIs from their names. I'm sorry for linking it.
It's really not. The length of a string when encoded as UTF-8 is something which is only relevant if you're doing encoding to UTF-8. It is not information which exists internally in the specification or engines. It has to be computed, just as the length in any other encoding except UTF-16 would have to be computed, because JS strings are defined to be UTF-16.
Again, I'm not saying that this is a bad API. It's true that lots of things want UTF-8 encoding. But it will not happen in JS, because TextEncoder is in WHATWG, and that's the appropriate place for such an API. So if you want to make a case for it, the WHATWG thread linked above is the correct place to do so, not this forum.
fair enough, guilty as charged, but those internals specialized things are the reason I would love to have this utf8Length exposed.
but the argument there is that "it's going to be used for encodeInto which is bad" which is the primary use case to ask for that, among all other use cases users keep needing to deal with daily when it comes to ArrayBuffer which is a JS thing.
There is no way to avoid duplicated RAM out of a string and a resulting encoded buffer in JS today, while most other dynamic PLs don't have such issue, they have binary strings and ref.write(bytes) avilable (see Python, as example) ... not having the same easy in JS due resizable/growing paranoia (reasons people use those is to preserve references and avoid trashing RAM on demand).
I deal with this stuff daily and if I could remove that JS workaround everyone else needs to write or provide, I'd be way happier around JS dynamic capabilities, specially around its UTF-16 strings based approach that doesn't play super well with the rest of the world and primitives behind proposed to solve that can't cope with memory at all, it's either there, already pre-allocated, or "not happening" for most operations.
The WHATWG thread linked above is still the correct place to argue for this. I'm sorry they have not yet found your arguments convincing but that doesn't mean we're going to add it here instead.
I agree there's several ways that encoding can be improved. That work is going to have to be done in WHATWG.
It's really annoying such fundamental thing is delegated to WHATWG while NodeJS, Bun, all other JS based runtimes could just benefit if JS had the ability to convert strings to utf-8 compatible buffers and vice-versa ... very sad everyone has to reinvent that same wheel with other libraries or even via JS itself, polluting code for a very obscure reason.
Node, Bun, and most other JS based runtimes follow the WinterTC (formerly WinterCG) Minimum Common API, which includes TextEncoder and TextDecoder, so none of them have to reinvent anything. It makes no difference whatsoever to Node and friends whether this is defined in ECMA262 or WHATWG. It happens to be defined in WHATWG, so that's the appropriate place to pursue changes to it, but this is not relevant to users.
but why every runtime uses its own encoder/decoder instead? I would make no argument if Buffer in NodeJS wouldn't be preferred and overall wy faster than those 2 APIs but I don't understand why, or how, those 2 APIs are slower as opposite of being as fast as their WinterTC counterpart ... can anyone please enlighten me around this?
In WHATWG the current status is that "it's not true these API are slow" while the evidence and the abundance of posts online would suggest, and provide numbers, otherwise (specially for small strings which are the norm, not the exception/edge-case).
Node's Buffer predates the existence of TextEncoder (and for that matter TypedArrays in general) by several years.
The speed of these APIs is determined by how they're implemented, not how they're specified. (At least for basic use cases; if you're trying to write to a specific offset the fact that TextEncoder makes you use subarray does inherently make it slower than APIs which don't do that.)
All the major browsers are open source; if you want to optimize TextEncoder in your favorite browser, go for it. It's an implementation issue, not a specification issue.