Getting the UTF-16 bytes of JavaScript strings

by
, posted

This post assumes you understand UTF-16.

Recently, I wanted to get the UTF-16 bytes of a JavaScript string for a demo I was working on. I couldn’t find anyone else who had done this, so I thought I’d write this post.

I just want the length

If you need the number of UTF-16 code units, just use the length of the string! The length property asks for the number of UTF-16 code units, so you can do this:

function numberOfUtf16CodeUnits(str) {
  return str.length;
}

numberOfUtf16CodeUnits("hello world");
// => 11

Because UTF-16 code units are always two bytes each, you can just multiply the length by two to get the number of UTF-16 bytes.

function numberOfUtf16Bytes(str) {
  return str.length * 2;
}

numberOfUtf16CodeUnits("hello world");
// => 22

I want the bytes

However, I didn’t just want the number of UTF-16 bytes: I wanted the actual bytes!

My goal was to write a generator function that yielded each UTF-16 byte.

First, I started by generating the string’s UTF-16 code units.

function* utf16CodeUnits(str) {
  for (let i = 0; i < str.length; i++) {
    const charCode = str.charCodeAt(i);
    yield charCode;
  }
}

[...utf16CodeUnits("hi 🌍")];
// => [104, 105, 32, 55356, 57101]

Now I needed to turn these into bytes. I did a little bit masking and shifting to turn these two-byte numbers into two one-byte numbers:

function* utf16Bytes(str) {
  for (let i = 0; i < str.length; i++) {
    const charCode = str.charCodeAt(i);

    // Get the most significant byte.
    // For example, given 0x1234, yield 0x12.
    yield (charCode & 0xff00) >> 8;

    // Get the least significant byte.
    // For example, given 0x1234, yield 0x34.
    yield charCode & 0x00ff;
  }
}

[...utf16Bytes("hi 🌍")];
// => [0, 104, 0, 105, 0, 32, 216, 60, 223, 13]

And that’s it! I could now get the UTF-16 bytes of a JavaScript string.

I want the results as a buffer

My solution uses a generator. If you want the results as a Uint8Array, simply pass the result to the Uint8Array constructor:

new Uint8Array(utf16Bytes("hi 🌍"));
// => Uint8Array(10) [0, 104, 0, 105, ...]

Alternatively, if you’re using Node, you can get the results as a Buffersee below.

I want the little endian bytes

My solution yields big endian results (UTF-16BE), not little endian (UTF-16LE). If you want little endian results, you can just switch the order of the yields.

function* utf16LeBytes(str) {
  for (let i = 0; i < str.length; i++) {
    const charCode = str.charCodeAt(i);

    // Get the least significant byte.
    // For example, given 0x1234, yield 0x34.
    yield charCode & 0x00ff;

    // Get the most significant byte.
    // For example, given 0x1234, yield 0x12.
    yield (charCode & 0xff00) >> 8;
  }
}

[...utf16LeBytes("hi 🌍")];
// => [104, 0, 105, 0, 32, 0, 60, 216, 13, 223]

You can also attach the byte order mark to your result by adding a couple of yields at the beginning.

Alternatively, if you’re using Node, keep reading.

I’m using Node

Node’s built-in Buffer.from() can do this much quicker.

To get the UTF-16LE bytes of a string in Node:

import { Buffer } from "node:buffer";

Buffer.from("hi 🌍", "utf16le");
// => <Buffer 68 00 69 00 20 00 3c d8 0d df>

Node only supports UTF-16LE natively, but there’s an easy way to get the big endian bytes: just call swap16() on the result.

Buffer.from("hi 🌍", "utf16le").swap16();
// => <Buffer 00 68 00 69 00 20 d8 3c df 0d>

I want something else

I also needed to get the UTF-8 and UTF-32 bytes for my little demo, so I wrote up how to do those too: