bigint enhancements.

Motivation

bigints were a great addition to ES, but they often need to be converted to/from other datatypes for transport or integration with legacy systems. At the moment, every user has to re-implement those methods manually, some of which are easy to get wrong (like twos-complement encoding).

Proposal

  • Convert a bigint to a Uint8Array of the given number of bits.

    BigInt.toUint8Array(bits: number, signed: boolean = false)
    
    • BigEndian encoding order.
    • If the value cannot fit into the specified number of bits, the function MUST throw an Error.
    • If the value is negative, it should be encoded using two's complement.
    • If the value is negative and signed == false, the function MUST throw an Error
  • Convert a UInt8Array into a bigint.

    BigInt(bytes: Uint8Array, signed: boolean = false)
    
    • Interpret the bytes as big endian.
    • If signed is true then the bytes should be interpreted as two's complement.
  • Convert a bigint into a base radix string using two's complement for negative numbers and doing sign extension out to bits.

    BigInt.toString(radix: number, bits: number, signed: boolean = false)
    
    • If the value is negative and signed == false, then the function MUST throw an Error.
  • Convert a string into a bigint, interpreting it as a two's complement value

    BigInt(value: string, bits?: number, signed: boolean = false)
    
    • If the string is more than bits / 8 characters long after processing the leading 0x radix indicator, then the function MUST throw an Error.

Test Cases

(0n).toUint8Array(8, true) // [0x00]
(0n).toUint8Array(8, false) // [0x00]
(0n).toUint8Array(8) // [0x00]
(0n).toUint8Array(16) // [0x00, 0x00]
(1n).toUint8Array(8, true) // [0x01]
(127n).toUint8Array(8, true) // [0x7f]
(128n).toUint8Array(8, true) // error
(128n).toUint8Array(16, true) // [0x00, 0x80]
(128n).toUint8Array(8, false) // [0x80]
(128n).toUint8Array(8) // [0x80]
(255n).toUint8Array(8, false) // [0xff]
(256n).toUint8Array(8, false) // error
(256n).toUint8Array(16, false) // [0x01, 0x00]
(-129).toUint8Array(8, true) // error
(-129).toUint8Array(16, true) // [0xff, 0x7f]
(-128).toUint8Array(8, true) // [0x80]
(-1n).toUint8Array(8, true) // [0xff]
(-1n).toUint8Array(8, false) // error

BigInt(new Uint8Array([0x80])) // 128n
BigInt(new Uint8Array([0x80]), false) // 128n
BigInt(new Uint8Array([0x80]), true) // -128n

(0n).toString(16, 0) // ''
(0n).toString(16, 8) // '00'
(0n).toString(16, 16) // '0000'
(128n).toString(16, 8, true) // error
(128n).toString(16, 8, false) // '80'
(-128n).toString(16, 8, false) // error
(-128n).toString(16, 8, true) // '80'
(-128n).toString(16, 16, true) // 'ff80'

BigInt('0x80') // 128n
BigInt('0x80', 8) // 128n
BigInt('0x80', 8, false) // 128n
BigInt('0x80', 8, true) // -128n
BigInt('0x80', 16, true) // 128n
BigInt('0xff80', 16, true) // -128n

Considerations

Array instead of Uint8Array:

Array<number> may be preferable to Uint8Array as it is a "simpler" type, and thus more likely to be usable. Plus, it is trivial to convert an Array<number> into a Uint8Array if necessary. Uint8Array is more commonly used for data buffers and transport, hence why I went with that, but if others feel strongly about Array<number> I don't see any problem there.

Endianness

Big Endian is often called "Network Byte Order" due to how common it is for network payloads. Since JavaScript is used largely for "internet" applications, it makes sense to choose that byte order, even though it may be slightly less performant on little endian systems.

Two's Complement

This seems to have won out over ones' complement almost across the board. I don't really see any reason to pick ones' complement over the defacto standard in this case.

FromString length for bits?

We could infer the bits parameter for BigInt from string. This way we would only need a single boolean parameter and it would be up to the caller to provide a properly padded unsigned value, to ensure that the leading bit is not set. This means 0xff would always decode as -1n, and if you wanted 255n then you would need to pass 0x00ff. In general, I'm not a fan of inferring things like this because it is a vector for unexpected runtime errors, but I am open to being convinced it is a better API.

2 Likes

I acknowledge that there is a pretty common need to serialise BigInts in order to store them, communicate them between systems, etc., but I do not think we need a standard transcoder built in to the language, at least not now. There are many, many different "VLQ"/"varint" encodings of arbitrary-magnitude integers used in a variety of systems: MIDI popularised one, protobuffs popularised another, and git uses yet another. Each of them would be a valid way to encode BigInts, but none is so overwhelmingly popular among JavaScript users (yet) or hard enough to implement via libraries to justify inclusion in the language. For example, the one I've built (https://github.com/michaelficarra/bigint-serialiser) uses yet another "VLQ"/"varint" variant that makes the tradeoffs that I felt were appropriate for general usage, and its implementation is not particularly large or difficult to get correct, as you can see from the source and tests.

The most likely part to get wrong is the two's complement. Currently, ES can encode to arbitrary base string, which is pretty easy to convert to most other formats, with the exception of encoding negative numbers "reasonably" (I don't know of any other system that encodes a number by prefixing with a leading - sign aside from when it is presenting a base 10 number). Perhaps this proposal could be simplified/reduced just to making the toString method correctly handle negative numbers using the defacto standard of two's complement?

Do any of the protocls you mentioned above use anything other than two's complement for encoding negative numbers (before compression)? If not, then it seems that two's complement may be the common piece where we would get the most value out of an ES standard. We could have it only be part of the toString and from string methods, and scrap all of the Uint8Array stuff.

If the BigInt fits into 64 bits, you can just new BigInt64Array([some bigint]).buffer

I think most people don't care which format, as long as encoding and decoding works with same binary.
For those who care, they are mostly using crypto related algorithm, so we just need the same format used by webcrypto when encoding private/public keys into binary.

Overall, I think this is fine. I wouldn't want to have the bits parameter, though, it should just work and give me a result.

I think most people don't care which format, as long as encoding and decoding works with same binary.

+1.

In order for the system to be able to decode something that was previously encoded by the same system, either you need the bits parameter OR unsigned values may need a leading 0 byte sometimes. For example (128n).toString(16, true) would result in '0080'. This is because BigInt('0x80', true) would decode as -128.

Also, if we leave out the bits parameter then I believe the signed boolean becomes required since (128n).toString(16) currently will produce '80', and I don't think we can do a backward incompatible change like that.

I have been seeing a trend with LEB128 (signed and unsigned both) taking off as the predominant variable-length integer encoding, including in WebAssembly, LLVM, DWARF (debugging format), Protobufs, among others, so I wonder if it'd be worth exposing that. Also, some formats specify fixed-width integers larger than 64 bits, and so it'd be useful to be able to read that.

Concretely, I feel these would be best to add:

  • BigInt.prototype.bitLength - Basically, Math.floor(Math.log2(Math.abs(this))) + BigInt(this < 0), but would be a cheap O(1) operation in practice due to pervasive hardware and compiler support for "find first set", "count leading zeroes", "log base 2", etc. For standard JS numbers, it's 31 - Math.clz32(x) and thus trivial, but bigints could be any size and so this is necessary for them. It's also necessary to readily calculate how much was written for variadic integers. It returns a BigInt.
  • DataView.prototype.getBigIntN(width, index, littleEndian), DataView.prototype.getBigUintN(width, index, littleEndian) - Read bytes bytes starting at offset index into a bigint n and return BigInt.asIntN(width, n) or BigInt.asUintN(width, n) as appropriate. Note that result.bitLength won't necessarily equal width - it just can't be greater than it.
  • DataView.prototype.setBigIntN(width, index, value, littleEndian), DataView.prototype.setBigUintN(width, index, value, littleEndian) - Convert value to a bigint n via BigInt.asIntN(width, value) or BigInt.asUintN(width, value) as appropriate, and write it to the first bytes bytes starting at offset index, sign- or zero-extended as appropriate.
  • DataView.prototype.getBigIntVL(index), DataView.prototype.getBigUintVL(index) - Read a LEB128-encoded integer at index. Naturally, the number of bytes read is Math.floor(result.bitLength / 8).
  • DataView.prototype.setBigIntVL(index, value), DataView.prototype.setBigUintVL(index, value) - Write a LEB128-encoded integer at index. Naturally, the number of bytes written is Math.floor(value.bitLength / 8).

Note: bytes here is max(1, div_round_up(width, 8n)) + 1 for signed integers and max(1, div_round_up(width, 8n)) for unsigned integers, where width is the number of bits required to represent the integer as a two's complement signed integer. It's the minimum number of bytes required to represent it as either a signed or unsigned integer.

As for the need to convert into a Uint8Array, here's how this addresses that:

  1. If you really do want to convert to/from a Uint8Array, it's not too hard, just a little involved and naturally math-y. (You're probably already doing something math-y, so it's not going to be that much simpler.) Do note that you do have to take into account whether you want to treat it as signed.
function signedToArray(bigint, littleEndian) {
    const bits = bigint.bitLength
    const width = Math.max(1n, bits / 8n + BigInt(bits % 8n !== 0n)) * 8n
    const result = new Uint8Array(width)
    new DataView(result.buffer).setBigIntN(bits, width, bigint, littleEndian)
    return result
}

function unsignedToArray(bigint, littleEndian) {
    const bits = bigint.bitLength - 1
    const width = Math.max(1n, bits / 8n + BigInt(bits % 8n !== 0n)) * 8n
    const result = new Uint8Array(width)
    new DataView(result.buffer).setBigUintN(bits, 0, bigint, littleEndian)
    return result
}

function arrayToSigned(array, littleEndian) {
    return new DataView(array.buffer).getBigIntN(
        BigInt(array.length) * 8n, 0, littleEndian
    )
}

function arrayToUnsigned(array, littleEndian) {
    return new DataView(array.buffer).getBigUintN(
        BigInt(array.length) * 8n, 0, littleEndian
    )
}
  1. For bit vectors, you could do bits & 1n << BigInt(offset). It's a trivial thing for parsers to detect, and any reasonable implementation could optimize this to avoid the ceremony of a bit shift + a full bitwise AND across the entire bigint, even for large bit vectors. It's similar to C++'s vector<bool> specialization, but obviously immutable. Not sure any engines implement this optimization, though.

I think moving the to/from Uint8Array into DataView is probably a better solution than having those functions on BigInt. It would be nice if you could utilize that in one-line, rather than two (minimum), though. For example, if the DataView set functions functions returned this so you could do:

new Uint8Array(new DataView(new ArrayBuffer(byteLength)).setBigUintN(byteLength * 8, 0, false).buffer)

It would be nice if we could get that down a bit, but I think that is more an issue not being able to construct a DataView with an implicit ArrayBuffer by just providing a length, but it isn't too bad and if you are pulling more than a single number out of a byte array it gets better.

VLQs encode this as the least or most significant bit of the message. It should just be handled by the format. If we consider this an opaque data format (composed of a variable number of ints), then this shouldn't be an issue.

If we use some specific variable length encoding format then yes. However, I would argue that is getting too opinionated unless we provide multiple encoding formats (or at least design for the addition of future encoding formats) since there isn't a clear winner at this time beyond "Big Endian" and "Two's Complement" (even Big Endian has only won within the context of network traffic). Any encoding beyond that is still a fairly competitive space in my experience, and thus it is probably not a good to bake one into ES as the "official encoding format".

Another option would be to have these functions take an encoding parameter, like TextEncoder used to have in some browsers prior to being well specified. In that case, I think choosing UTF-8 was reasonable beacuse it has "won" the character encoding competition for most of the internet (where JS is predominately used).

If we use some specific variable length encoding format then yes. However, I would argue that is getting too opinionated unless we provide multiple encoding formats (or at least design for the addition of future encoding formats) since there isn't a clear winner at this time beyond "Big Endian" and "Two's Complement" (even Big Endian has only won within the context of network traffic).

Going to have to disagree that big endian "won". It's used in most network traffic, but little endian won in the hardware space due to better read/write efficiency. (ARM is mostly little-endian, x86-64 is exclusively little-endian, and the remaining popular big-endian architectures are almost all older.) Little-endian also largely won the file space. Yes, most networking protocols use big-endian representation, but this is not universal.

Any encoding beyond that is still a fairly competitive space in my experience, and thus it is probably not a good to bake one into ES as the "official encoding format".

For variable-length quantities, I do see a winner here, and it aligns with my proposal:

  • The unsigned LEB128 format has pretty much won for unsigned integers, more than UTF-8 won the encoding wars - people were using this very format longer than the term "LEB128" was even a thing. DWARF, MIDI (page 2), ASN.1 BER (page 13), Google's Protocol Buffers, W3C's EXI spec, and .NET's BinaryReader and BinaryWriter classes. Also, several other specs use the format explicitly. LLVM bitcode itself generally follows that format without specifying it, but supports segments widths of fewer than 8 bits and adjusts accordingly. (This is rarely dealt with directly in JS, BTW.) Source maps use a similar system, but with 6-bit segments to fit in a base 64 encoding instead of 8-bit segments to fit in 8-bit bytes. (This is almost exclusive to build tooling and doesn't account for much of the time spent.) It all stems from using the high bit of a given segment as the "continuation bit". There's a couple more significant exceptions here:
    • QUIC sets the high 2 bits of the first byte to 00 for 6-bit integers, 01 for 14-bit integers, 10 for 30-bit integers, and 11 for 62-bit integers. They don't support arbitrary-length integers, though.
    • Git makes sure that the smallest N+1-byte octet represents one more than the highest N-byte octet, removing redundancy. Otherwise, it follows the same pattern as LEB128 for unsigned integers.
  • LEB128's form of truncating it to the smallest possible width that can represent it as two's complement and then sign-extending that to the nearest multiple of 7 also largely won. Signed integers in Protocol Buffers are an unusual exception that stores it as the unsigned representation of (n << 1) ^ (n >> 31) for 32-bit integers, (n << 1) ^ (n >> 63) for 64-bit integers, etc., but that transform is relatively trivial and easily converted to something in terms of dataView.setBigUintN(...). And the two main exceptions listed above lack variable-length signed integers entirely.