Behind Teamemo: Using a C-library in JavaScript

Recently we integrated spellchecking into Teamemo. Therefore we searched for a good JavaScript spellchecking library. There were some very promising available, for example: nspell. It wasn’t very popular, however was good documented and based on hunspell dictionaries – the same spellchecking library used in Apache OpenOffice / LibreOffice and all major browsers.

After some tests it turned out, that it works very well for English texts, however not so well for German ones. While in LibreOffice German texts were spellchecked correctly, with nspell a lot of words were incorrectly marked as misspelled. The problem was that nspell did not support all word combination rules in the aff-Files supplied with the dictionaries and so had trouble with compound words. Adding support for the missing rules in nspell would have been too hard and so we dropped nspell as a solution.

We couldn’t find a pure JavaScript library with better support for affix rules. There are some server side solutions like nodehun. These provide Node.js bindings for the C-library hunspell. However server side spellchecking was not an option for us, because of bigger latency, no offline support and higher server load, so we decided to try something new: Compiling the real hunspell C-library to JavaScript.

There is an awesome compiler called emscripten which compiles C/C++ Code to JavaScript. Actual emscripten works on LLVM bitcode, which can be generated from most programming languages. It claims that the generated JS-code runs at near native speed so we wanted to give it a try.

Teamemo is a modern wiki software featuring collaborative editing in an innovative WYSIWYG Editor.

We cloned the hunspell repository and installed the emscripten SDK. Respectively for  convenience we used a emscripten docker image so the emscripten compiler was just one command away:

docker run --rm -v $(pwd):/src -ti apiaryio/emcc bash

Emscripten does provide its own C/C++ compiler for the build toolchain and some helper scripts to set the correct environment variables to build projects using the GNU make:

emconfigure ./configure
emmake make

This compiled the hunspell C/C++ code to LLVM bitcode, which can then be compiled to JavaScript. The final build step to get a JS-File is to invoke em++ directly:

em++ \
  -O3 \
  -Oz \
  --llvm-lto 1 \
  -s NO_EXIT_RUNTIME=1 \
  -s EXPORTED_FUNCTIONS="['_Hunspell_create', '_Hunspell_destroy', '_Hunspell_spell']" \
  ../src/hunspell/.libs/libhunspell-1.6.a \
  -o hunspell.js

The first four parameters are some optimization flags. EXPORTED_FUNCTIONS defines which functions should be exported to the JavaScript-World. If you change the output file from hunspell.js to hunspell.html then emscripten generates a html page which does some required definitions and includes the compiled JS so that you can test your compiled source directly by firing up a simple web server serving the files:

python -m SimpleHTTPServer 8080

However to call the C-Functions from JS some wrapper code is required. The simplest way is to use Module.cwrap. You only have to provide the function name and return and parameter types:

Hunspell_create = Module.cwrap('Hunspell_create', 'number', ['string', 'string']);
Hunspell_spell = Module.cwrap('Hunspell_spell', 'number', ['number', 'string']);
Hunspell_destroy = Module.cwrap('Hunspell_destroy', 'number', ['number']);

You can now directly call the compiled C-Code from JS:

var handle = Hunspell_create('dictionary.aff', 'dictionary.dic');
console.log(Hunspell_spell(handle, 'mispelled'));
Hunspell_destroy(handle);

There is just one thing missing: The files are not present in the virtual file system, so hunspell can’t open the dictionary files. You can either specific the files in the compilation step using the –embed-file option or download the files manually using XHR and write the content to a file with FS.writeFile.

Results

The compiled JS-File has a size of 850 kB. In additional a file containing the initial memory must be loaded, which has around 150 kB. So the working example including the wrapper code has around 1 MB size, which is OK for desktop browsers, however could be problematic for some mobile devices.

The performance is very good, we could not find a noticeable performance problem, even for long texts. You can check the performance yourself in Teamemo.

Bonus

In addition to Module.cwrap you can write the wrapper for you C-Function your own. Here is an example for the spell function.  It demonstrates allocating a string on the stack calling the function and restoring the stack.

var _Hunspell_spell = getCFunc('Hunspell_spell');

function allocStr(str) {
  var len = (str.length<<2)+1;
  var ret = Runtime.stackAlloc(len);
  stringToUTF8(str, ret, len);
  return ret;
}

function Hunspell_spell (word) {
  var stack = Runtime.stackSave();
  var wordPtr = allocStr(word);
  var ret = _Hunspell_spell(handle, wordPtr);
  Runtime.stackRestore(stack);
  return !!ret;
}

Since spellchecking is quite expensive and emscripten pollutes the global scope very extensively, we run our spellchecker in a webworkers context. You can find the full example code in the emscripten-hunspell git repository. Visit teamemo.com to see the spellchecking in action.

Since WebAssembly is on its way into all majors Browsers, next step would be to use wasm as a alternative in supported browsers.