Javascript, WebAssembly and WASI - what is the future and do we have a voice?

Hi Folks,

I've been following the AssemblyScript project for some time now and it's progressing well. Although I happened on this objection from their group in regards to WASI where they are suggesting decisions happening in WASI will not play well with javascript. I cannot describe it as well as has already been done here.

I wouldn't normally be concerned about a W3C working group's decisions however I'm highly invested and concerned about the future of javascript and I hope it won't be relegated to the back-seat, or possibly made irrelevant or inoperable with regard to WASI. Perhaps there is nothing to worry about but the charges made by the AssemblyScript group are concerning and I think I'm just looking for opinions from others "in the know" who can assuage my fears :)

1 Like

JavaScript will always be a first-class citizen on the web, and integration between JS and wasm a major priority of both projects. But as far as I can tell, the main thing that has upset the AssemblyScript team is the decision for the wasm component model to disallow unpaired surrogates in strings going across component boundaries, after much consideration and discussion. (See the repeated references in the linked document to "strings" and "preferred choice of semantics does not match Java-like languages" which links to an issue about strings and so on). And on that particular detail, most people (including myself) don't share AssemblyScript's concerns. In particular, I don't think it's a problem that JS strings which contain unpaired surrogates won't be easy to send to wasm components - that's already true of a bunch of web APIs, like TextEncoder, and in my experience this is completely fine.

As to whether the community "has a voice", well, you can see for yourself the huge amount of effort many people spent engaging with AssemblyScript's team on this question, despite their (at times) rather impolite comments. See primarily here, especially this summary comment, though there's plenty of others (x, x, x, x, x) (including on largely unrelated repos). That the outcome eventually settled on was not the one AssemblyScript wanted doesn't mean they didn't have a voice. I want to call your attention in particular to the vote in the meeting notes linked in the previous sentence, where the current semantics were reaffirmed after literally years of discussion:

Poll for maintaining single list-of-USV string type

SF F N A SA
31 8 6 0 2
1 Like

There are a bunch of questionably points here that I'd like to address pretty quick:

But as far as I can tell, the main thing that has upset the AssemblyScript team is the decision for the wasm component model to disallow unpaired surrogates in strings going across component boundaries, after much consideration and discussion. (See the repeated references in the linked document to "strings" and "preferred choice of semantics does not match Java-like languages" which links to an issue about strings and so on)

This statement is incorrect. Strings are one example, albeit a critically important one - i.e., if not even the absolute basics like strings work properly, what will? I suggest to reread the objections, as there are many more concerns, not less pressing ones I'd say, that so far nobody spent the effort to address.

In particular, I don't think it's a problem that JS strings which contain unpaired surrogates won't be easy to send to wasm components - that's already true of a bunch of web APIs, like TextEncoder, and in my experience this is completely fine.

Luckily standardization is not about mere opinion, and certainly unsubstantiated claims cannot be justified by mentioning the exact exceptions to the rule, i.e. the few APIs that mutate strings by design, whereas literally everything else within the language does not. I suggest to look up the ECMAScript standard, which is very clear about what a string is, and WebIDL, which gives recommendations on when to use which concept, or JSON, which actually preserves strings, say so these still compare equal after a JSON.parse(JSON.stringify(...)). Other than that, basing what could well throw JS on the trash pile of programming language history onto handwaving is frankly worrying.

Regarding the rest of your response, not only are the accusations irrelevant to answer the question at hand, i.e. these do not serve a purpose other than discrediting AssemblyScript so the concerns can yet again remain unaddressed, it is absurd to one-sidedly vilify AssemblyScript after all the hostility it had to endure. As if nobody else could be seen as slightly impolite in years of heated discourse. It's truly astonishing that this is how an open source project with a spine is treated not only by the Wasm CG and a bunch of tech giants, but now by TC39 as well. To emphasize: In fact, the link "plenty" above goes to the very first thread in this context that was first ignored for one and a half years and eventually closed due to targeted personal attacks against me, ironically by the author of the "this summary comment" you linked. The sheer hypocrisy is so extraordinarily unfair, in that AssemblyScript has been provoked and categorically treated in the most frustrating ways for so long (go look it up, it's all public), just to then be accused of the exact bad behavior others committed against it, repeatedly, with zero consequences for those who openly and quite obviously attacked it in bad faith. Over and over, again and again. Hint: If the technical issue was truly trivial, none of that would have been necessary. As such I recommend to at least check the last couple of posts on that one thread again. I'll happily leave whether or not the other links back your accusations or not to the interested reader, since the more people see what was actually going on over all these years, and how incredibly unfair this all was, the better.

The linked summary, btw, basically claims that two mutually exclusive things (IT supports "all" encodings, but then not all encodings) are true at the same time while prominently confusing Unicode concepts. I'd call that a low bar for an appropriate summary of many years of discussion that many experts still do not seem to fully grasp.

So, no, I would not agree to

JavaScript will always be a first-class citizen on the web, and integration between JS and wasm a major priority of both projects.

given that the issue was only settled with the sledgehammer in a vote with a surprisingly unsurprising outcome. Hint: That's exactly why the W3C is not a majority organization, but a consensus organization - at least in theory.

I hope that my strong disagreement does not feel impolite, as so far that seemed to be confused quite a lot.

I am familiar with what various standards have to say about strings, including the ECMAScript standard (of which I am an editor). Nevertheless I stand by my position that "strings which contain unpaired surrogates won't be easy to send to wasm components" is not a huge deal. I do not think this will throw JS on the trash pile of programming language history.

I don't want to get into this at any great length, and I agree that readers who are interested in the topic certainly should go through the linked threads and judge for themselves (though some of your comments have been deleted, so some of that context is now lost). I only brought it up because I consider it directly relevant to the OP's question of whether the community has a voice.

But for readers who don't have time to go through the hundreds of comments in the linked threads, I do want to make it extremely clear that my gloss of "(at times) rather impolite comments" was not about merely expressing strong technical disagreement. It was about this kind of thing.

I don't want to reopen these discussions, so I won't comment further on the matter.

I'm flattered that we are once again talking about me and the many ways I'm imperfect to you, given that there are many more interesting, often technical, matters to address. For example those mentioned within the objections. (Btw, does a horrible experience like mine, where it's all about silencing me, really indicate to you that the community has a voice?)

Perhaps, to quickly introduce the string aspect for those interested:

According to clause 6.1.4 of the ECMAScript specification, a string is defined as the "set of all ordered sequences of zero or more 16-bit unsigned integer values".

Only if the string is in fact interpreted as text, say when logged to console, "each element in the String is treated as a UTF-16 code unit value". Most folks will have spotted the "�" character from time to time, indicating that part of a string is not printable. In fact, since strings are often constructed via APIs like fromCharCode, substring, concat and so forth, or their contents similarly inspected, it is common that the individual parts that eventually make up text are not necessarily well-formed UTF-16.

A simplified example is a string builder or buffer containing an array of strings of a fixed length, say 1024 bytes, that will later be concatenated to form the final text. Here, whenever a string is sliced by a fixed length, it might split a so-called surrogate pair (see the presentation linked at the end for more info on the concepts) into half, like so:

let part1 = "𝄞".substring(0, 1);
let part2 = "𝄞".substring(1);

console.log(part1, part2); // � �

Neither of these strings is well-formed, in that the first contains the first half of a surrogate pair and the second contains the second half. This is not a problem when both strings are re-assembled again:

...
let text = part1 + part2;
console.log(text); // 𝄞

where now the surrogate pair has been fused and the string represents text again.

Now, what does this mean for our string builder or buffer? Since the Component Model does not allow this over boundaries, the string builder or buffer, say when packaged as its own module, will be broken when written in JavaScript, Java, C#, Dart, Kotlin, or any other language doing the same, as it can neither be provided with contents nor its contents be serialized as a whole without silently mutating the string data. In essence, the same thing as when logging part1 and part2 above will happen, just now it will not be limited to display because the information is lost during processing already through replacement with "�", i.e. when reassembling the string beyond the boundary, the resulting string will not be "𝄞" but "��" (two individually replaced surrogate code points since individually transferred over a WebAssembly boundary).

The argument to nonetheless break this is basically that while I might be technically correct, several experts "don't think this is a problem". No credible evidence for this claim has been provided in five years of discussion, and instead the discussions have degenerated into harassment and defamation of my person for insisting on proper arguments. Typically, seeing this happen over and over again should raise red flags, and one would resort to what is already well-known and applicable precedents.

One of these places is WebIDL, featuring the following prominent "Warning!":

Here, USVString semantics is what the Component Model exclusively proposes by means of fixing its char type, whereas DOMString is a normal JavaScript string. Notably, WebIDL is very clear that "When in doubt, use DOMString." I would add: Because otherwise stuff will break.

Another such place is the Web Platform Design Principles, that is also very clear:

Here as well, the recommendation is to use DOMString, where USVString is a special case that should only be used under very specific circumstances. Given that WebAssembly wants to support many languages, and as such a "WebAssembly Component Model" is also a "Java Component Model" or an "Interact-with-JavaScript Component Model" or a "JavaScript-compiled-to-WebAssembly Component Model", it becomes clear that the special case does not apply, since applying it would make it merely an "[insert a few languages here] Component Model", whereas composing an application written in anything else becomes a hazard, like in the string builder example.

An applicable precedent is JSON, where JSON.stringify produces a DOMString. Originally, if the input to JSON.stringify contained, say, the contents of the string builder, then these contents were preserved in the resulting DOMString. This indirectly led to a problem, since if the result is saved to disk, it is saved in UTF-8 encoding, breaking the contents of the string builder, producing mojibake as shown above again. This issue has actually been addressed by the proposal Well-formed JSON.stringify, where now JSON.stringify uses escape sequences so no information can be lost in subsequent steps, including that when saving the result to disk string integrity is preserved. Practically speaking, JSON is a sophisticated manifestation of the simplified string builder example above.

As one can see, a lot of thought went into this over the last few decades already, with recommendations and principles being formulated, whereas the Component Model simply proposes to break all this with unsubstantiated claims, in conflict with all the evidence that has, for some reason, been systematically ignored. "I don't think this is a problem" and "this dude is impolite" summarizes foregoing discussions very well.

More such places are the language specifications respectively conventions of Java, C#, Dart, Kotlin, TypeScript, you name it, that all behave the same as JavaScript. All of these will be affected, in that composing an overall application of respective components implies lossy strings. A presentation about the implications that I wanted to give in a Wasm CG meeting but was not (really) allowed to can be found here. Note that this link has already been linked by the ECMAScript editor above in a grossly different frame of reference.

What might be much more interesting, however, is that WASI, that the Component Model is a spin-off of, some say a trojan horse of, is eagerly establishing an entire JS-incompatible set of platform APIs. Nothing is reused. Nothing is bridged. This is what 99% of AssemblyScript's objections is about, yet the bulk has been cleverly moved out of view here with manipulative rhetoric and irrelevant accusations, that on its own violates like half of TC39's CoC I'd say, even though nobody will be penalized for this "political finesse". So I hope nobody is surprised that I am not amused by a mere continuation of the WebAssembly CG's practices.

I really wish these concerns could be adequately discussed for once (nobody has responded to AssemblyScript's objections so far), as I think this is critical so, to phrase it in the OPs words, JavaScript is not "relegated to the back-seat, or possibly made irrelevant or inoperable".

P.S.: Some companies are already experimenting with disabling what makes JS fast, the JIT, for "super duper security". I really hope this is not somehow connected.

1 Like

I've learned much about unpaired surrogates, UTF-16 and WTF-16 after much scouring of the internet. I've followed all the threads listed here. As far as I can tell, this schism was introduced some years back when Unicode was redefined to exclude unpaired surrogates.

I'm weighing in now because I feel the question I have asked has been well-answered. I will try to avoid any speculation and simply say that IMO, in regards to WASI and WebAssembly, web standards have been ignored in a way that hurts javascript to favor other languages.

The quality of argument in favor of the current decision has, in quite a few instances, resorted to ad hominems and, effectively the "it's not a big deal" argument. While the arguments in favor of preserving web standards were thorough, clear, and for me convincing.

I would prefer that WASI be allowed to fork or be set free from W3C so that it may use unicode as it prefers, thus allowing it to potentially usher in some very interesting edge-computing solutions. Further, WebAssembly should continue to support WTF-16 as it is the web standard. Web Assembly should not bend over backwards to support other languages at javascript's and backwards-compatibility's expense. It should be remembered that WASM was made for and to interoperate with the browser and javascript.

1 Like

some years back when Unicode was redefined to exclude unpaired surrogates

Btw, this is a myth, as even latest Unicode is very clear of what a string is:

Notably, "all ordered sequences of zero or more 16-bit unsigned integer values" as per ECMAScript (chapter 6.1.4, The String Type) is, in fact, a valid Unicode string (chapter 2.7, Unicode Strings). Unicode's definition explicitly excludes the requirements imposed by Unicode encoding forms, where things like UTF-xy and surrogate pairs become relevant. ECMAScript strings are Unicode strings. The Component Model's string definition is not identical to the definition of Unicode strings.