String.hash

edited to suggest String.hash instead of Math.hashCode

Developers often need a fast, non-cryptographic hash for small strings, but currently need to copy paste from stack overflow or install a dependency.

I searched for this but didn't find it (exactly). I think it'd be useful if the spec provided a String.hash('some string') function.

I found this discussion which asks for a method to be added Object - which I believe is much more controversial.

To make this alternative more likely to succeed, I'd suggest:

  1. Put it on the String global as a static method
  2. Only accept a string input
  3. Make no particular guarantees about collision-avoidance - just use a long-established algorithm like Java's .hashCode() default implementation which has perfectly valid use-cases.

Here's an implementation the likes of which I've found myself copy-pasting from stack overflow many times:

function hashString(str){
	let hash = 0;
	for (let i = 0; i < str.length; i++) {
		hash += Math.pow(str.charCodeAt(i) * 31, str.length - i);
		hash = hash & hash; // Convert to 32bit integer
	}
	return hash;
}

Forcing people to copy-paste code that they don't fully understand, even when it's pretty innocuous like this, seems bad. On projects I work on that use eslint or similar tools (many configs have rules like no-bitwise which make it even more cumbersome to copy this implementation into a project)

This, I think, would be a very easy addition to the spec, easy to implement, and easy to document (compared to other proposals at least).

Note:

  • namespacing could of course be bikeshedded - I thought Math was appropriate since this is just a simple mathematical formula, but String was suggested as a better alternative
  • naming too - maybe it could be String.insecureHashCode(...) instead to make sure nobody thinks it's cryptographic and only uses it for cases where real cryptography isn't needed
  • equivalents to the dead-simple above snippet include crypto.subtle, but require async/await and creations of buffers/TextEncoders etc.
  • this would be very fast for use cases that don't involve security and value speed and simplicity instead

It's an interesting thought. Certainly I've found myself needing to hash strings pretty often, and like you I just copy a snippet around, which is both annoying and slow. I'd call it String.hash personally; I think the case for having a built-in hash for strings is strong and the design more straightforward (especially given that we can't feasibly do arbitrary objects).

2 Likes

I like that, I've edited the original to push for that instead (with a note to avoid confusion)

Joining the club, I've also had to add a userland sync hashing function for the Composites proposal polyfill: proposal-composites/polyfill/internal/murmur.ts at ae5ea98e7c966581f46af37e80f954335ad78948 ยท tc39/proposal-composites ยท GitHub

I think the main design question would be if the hash function (and starting state) would be exactly defined (portable) or if the values can change across runs (reduce collision attacks). Or if the proposal offers both forms.

If the hash is different between runs but is guaranteed to be consistent between connected agents (e.g. web workers) this would be a great uplift over current userland implementations where it may be challenging to sync the seed all workers use.

The next biggest win, IMO, if this could be added to the language would be if the function could access the existing cached hash string value (that the most popular JS engines have) if it's already been computed for that particular string instance. As strings are not valid WeakMap keys there is no safe way for userland to cache hashes in the same way engines can. While the specification proposal couldn't mandate caching, as it wouldn't be observable, it should be discussed with implementors to validate what would happen in reality.

wouldn't a "random prefix" already cover the non portable case? with fixed algo you get both caching and portability out of the box so that if anyone wants to make that less portable it can just add some random prefix to the string that's being hashed keeping the cache guards and reducing collisions?

To my mind the main advantage of non-portability is that it frees engines to experiment with implementations - the use of the internal hash that @aclaymore mentions is unlikely to be possible with a fully specified hash function, for example. But I think there's likely to be a lot of pushback on that; many people (myself included) really don't like adding more non-explicit nondeterminism to the language.

Yes, though my primary point was could all the agents agree on that random prefix so that two workers that hash the same string both get the same result without first having to message each other to agree on a prefix. Because this is something that would be a large value add that can't (easily) be done today.

plus consider client/server/db stored hash that are suddenly unusable anymore in the future if anything changes in there

via broadcast channel or message channel or even post message that's inevitable already for a variety of things, reason I don't think it's super hard ... the usefulness of a random hash remains to me though, we went from a clear background/example (the Java one) to full "let's make it more complex" without presenting use cases ... anything obvious I could think about? :thinking:

If these are non-portable, they would be explicitly required to include some randomization at agent init so that they would never be portable across sessions, so that you would immediately run into that rather than it changing at some point in the future. That's how Go handles map iteration order, for example. But yes, doing that does mean you can't meaningfully persist hashes, which would be annoying for some applications.

There's two main things, as mentioned:

  • built-in defensiveness against a class of denial of service attacks which are enabled by attackers crafting strings which will have the same hash, which is only practical if hashes are fixed, and
  • allowing engines to choose their own implementations, which is going to be faster - often much faster - than any particular algorithm we might mandate.

Whether these outweigh the downsides of non-portability depends on your values.