Character codes

Snowball (since version 2.0) supports specifying non-ASCII characters using the standard Unicode notation U+XXXX where XXXX is a string of hex digits. For example, a suffix removal rule from the Spanish stemmer could be written like so:

            'a' 'o' '{U+00E1}' '{U+00ED}' '{U+00F3}'
                ( RV delete )

However, this doesn't make for very readable source code, so the Snowball scripts on this site define more mnemonic representations of the non-ASCII characters which they use - for example, the Spanish stemmer includes the lines

stringescapes {}

/* special characters */

stringdef a'   '{U+00E1}'  // a-acute
stringdef e'   '{U+00E9}'  // e-acute
stringdef i'   '{U+00ED}'  // i-acute
stringdef o'   '{U+00F3}'  // o-acute
stringdef u'   '{U+00FA}'  // u-acute
stringdef u"   '{U+00FC}'  // u-diaeresis
stringdef n~   '{U+00F1}'  // n-tilde

U+00E1 is Unicode notation for code point hex E1 which is á, etc. Then the code which follows uses '{a'}' when it wants á and similarly for other accented characters, so the example shown above is actually written

            'a' 'o' '{a'}' '{i'}' '{o'}'
                ( RV delete )

Using literal UTF-8-encoded Unicode characters in strings in the source file may work in some cases, but isn't really supported - the Snowball compiler doesn't (currently at least) have the concept of "source character set", so at best you'll limit which programming languages your stemmer can be used with.

If you wish to describe other Latin-alphabet based codesets for use in stemmers we recommend using the following conventions:

accent	ASCII form	example
acute	single quote (')	`'{e'}'` for é
grave	back quote (`)	'{a`}' for à
umlaut	double quote (")	`'{u"}'` for ü
circumflex	circumflex (^)	`'{i^}'` for î
cedilla	letter c	`'{cc}'` for ç
tilde	tilde (~)	`'{n~}'` for ñ
ring	letter o	`'{ao}'` for å
line through	solidus (/)	`'{o/}'` for ø
breve	plus (+)	`'{a+}'` for ă
double acute	letter q	`'{oq}'` for ő
comma below	comma (,)	`'{t,}'` for ț
caron/hacek	letter v	`'{cv}'` for č
dot above	full stop/period (.)	`'{e.}'` for ė
macron	minus (-)	`'{u-}'` for ū
ogonek	letter k	`'{uk}'` for ų
without dot	no suffix	`'{i}'` for ı

The ‘line-through’ accent covers a numbers of miscellaneous cases: the Scandinavian '{o/}' , Icelandic '{d/}' and Polish '{l/}' .

Use '{ae}' and '{ss}' for æ ligature and the German ß, with upper case forms '{AE}' and '{SS}' . Use '{th}' for Icelandic thorn.

We used to recommend , for cedilla, but we need a way to represent comma-below for Romanian, so we've repurposed , for that and now recommend c for cedilla instead.

If you're writing a new stemmer, see below for a file of suitable stringdef lines you can cut and paste into your code.

Character codes

Links