Character codes

Snowball (since version 2.0) supports specifying non-ASCII characters using the standard Unicode notation U+XXXX where XXXX is a string of hex digits. However, this doesn't make for very readable source code, so the Snowball scripts on this site define more mnemonic representations of the non-ASCII characters which they use - for example, the German stemmer includes the lines

    /* special characters */

    stringdef a"   '{U+00E4}'
    stringdef o"   '{U+00F6}'
    stringdef u"   '{U+00FC}'
    stringdef ss   '{U+00DF}'

(In Unicode, hex values E4, F6, FC and DF are the numeric values of characters ä, ö, ü and ß respectively.)

Then the code which follows uses '{a"}' when it wants ä, etc.

Using literal Unicode character in strings in the source file may work in some cases, but isn't really supported - the snowball compiler doesn't (currently at least) have the concept of "source character set", so at best you'll limit which programming languages your stemmer can be used with.

If you wish to describe other Latin-alphabet based codesets for use in stemmers we recommend using the following conventions:

accent ASCII form example
acute single quote  e' for é
grave grave  a` for à
umlaut double quote  u" for ü
circumflex circumflex  i^ for î
cedilla comma  c, for ç
tilde tilde  n~ for ñ
ring letter o  ao for å
line through solidus  o/ for ø
breve plus  a+ for ă
double acute letter q  oq for ő

And, should they ever arise, use  r  for left and right hook (as in Polish), and  v  for hacek (as in Czech).

The ‘line-through’ accent covers a numbers of miscellaneous cases: the Scandinavian  o/, Icelandic  d/  and Polish  l/.

Use  ae  and  ss  for æ ligature and the German ß, with upper case forms  AE  and  SS. Use  th  for Icelandic thorn.

If you're writing a new stemmer, see below for a file of suitable stringdef lines you can cut and paste into your code.