Character codes

Snowball (since version 2.0) supports specifying non-ASCII characters using the standard Unicode notation U+XXXX where XXXX is a string of hex digits. For example, a suffix removal rule from the Spanish stemmer could be written like so:

            'a' 'o' '{U+00E1}' '{U+00ED}' '{U+00F3}'
                ( RV delete )

However, this doesn't make for very readable source code, so the Snowball scripts on this site define more mnemonic representations of the non-ASCII characters which they use - for example, the Spanish stemmer includes the lines

stringescapes {}

/* special characters */

stringdef a'   '{U+00E1}'  // a-acute
stringdef e'   '{U+00E9}'  // e-acute
stringdef i'   '{U+00ED}'  // i-acute
stringdef o'   '{U+00F3}'  // o-acute
stringdef u'   '{U+00FA}'  // u-acute
stringdef u"   '{U+00FC}'  // u-diaeresis
stringdef n~   '{U+00F1}'  // n-tilde

U+00E1 is Unicode notation for code point hex E1 which is á, etc. Then the code which follows uses '{a'}' when it wants á and similarly for other accented characters, so the example shown above is actually written

            'a' 'o' '{a'}' '{i'}' '{o'}'
                ( RV delete )

Using literal UTF-8-encoded Unicode characters in strings in the source file may work in some cases, but isn't really supported - the Snowball compiler doesn't (currently at least) have the concept of "source character set", so at best you'll limit which programming languages your stemmer can be used with.

If you wish to describe other Latin-alphabet based codesets for use in stemmers we recommend using the following conventions:

accent ASCII form example
acute single quote (') '{e'}' for é
grave back quote (`) '{a`}' for à
umlaut double quote (") '{u"}' for ü
circumflex circumflex (^) '{i^}' for î
cedilla letter c '{cc}' for ç
tilde tilde (~) '{n~}' for ñ
ring letter o '{ao}' for å
line through solidus (/) '{o/}' for ø
breve plus (+) '{a+}' for ă
double acute letter q '{oq}' for ő
comma below comma (,) '{t,}' for ț
caron/hacek letter v '{cv}' for č
dot above full stop/period (.) '{e.}' for ė
macron minus (-) '{u-}' for ū
ogonek letter k '{uk}' for ų
without dot no suffix '{i}' for ı

The ‘line-through’ accent covers a numbers of miscellaneous cases: the Scandinavian '{o/}' , Icelandic '{d/}' and Polish '{l/}' .

Use '{ae}' and '{ss}' for æ ligature and the German ß, with upper case forms '{AE}' and '{SS}' . Use '{th}' for Icelandic thorn.

We used to recommend , for cedilla, but we need a way to represent comma-below for Romanian, so we've repurposed , for that and now recommend c for cedilla instead.

If you're writing a new stemmer, see below for a file of suitable stringdef lines you can cut and paste into your code.

Links