From UTF-16 to UTF-8… in JavaScript

JavaScript Strings are “finite ordered sequence of zero or more 16-bit unsigned integer values.” Usually these integer values are UTF-16 code units. The UTF-16 encoding uses one 16-bit unit for Unicode characters from U+0000 to U+FFFF, and two units for characters from U+10000 to U+10FFFF. Unfortunately all the usual String functions length, charAt, charCodeAt, are defined with respect to these code units, so characters such as 𝄞 (U+1D11E MUSICAL SYMBOL G CLEF) appear as a pair of surrogate characters. This little detail makes it complicated to operate on Strings.

This little JavaScript function encodes a string as an array of integers using UTF-8 encoding while taking surrogate pairs into account:

2 Comments

  1. Anonymous
    Posted 2013/08/03 at 18:18 | Permalink

    Hello,

    i just want to inform you, that the following part of your code is wrong due to operator precedence:

    charcode = ((charcode & 0x3ff)<<10)
                          | (str.charCodeAt(i) & 0x3ff)
                          + 0x10000;
    

    has to be

    charcode = (((charcode & 0x3ff)<<10)
                          | (str.charCodeAt(i) & 0x3ff))
                          + 0x10000;
    

    Greetz

  2. Joni
    Posted 2013/08/06 at 14:09 | Permalink

    Hey you’re right, thanks, fixed.