Snowball (since version 2.0) supports specifying non-ASCII characters using
the standard Unicode notation U+XXXX
where XXXX is a string of
hex digits. For example, a suffix removal rule from the Spanish stemmer
could be written like so:
'a' 'o' '{U+00E1}' '{U+00ED}' '{U+00F3}'
( RV delete )
However, this doesn't make for very readable source code, so the Snowball scripts on this site define more mnemonic representations of the non-ASCII characters which they use - for example, the Spanish stemmer includes the lines
stringescapes {}
/* special characters */
stringdef a' '{U+00E1}' // a-acute
stringdef e' '{U+00E9}' // e-acute
stringdef i' '{U+00ED}' // i-acute
stringdef o' '{U+00F3}' // o-acute
stringdef u' '{U+00FA}' // u-acute
stringdef u" '{U+00FC}' // u-diaeresis
stringdef n~ '{U+00F1}' // n-tilde
U+00E1 is Unicode notation for code point hex E1 which is á, etc.
Then the code which follows uses '{a'}'
when it wants
á and similarly for other accented characters, so the example
shown above is actually written
'a' 'o' '{a'}' '{i'}' '{o'}'
( RV delete )
Using literal UTF-8-encoded Unicode characters in strings in the source file may work in some cases, but isn't really supported - the Snowball compiler doesn't (currently at least) have the concept of "source character set", so at best you'll limit which programming languages your stemmer can be used with.
If you wish to describe other Latin-alphabet based codesets for use in stemmers we recommend using the following conventions:
accent | ASCII form | example | ||
acute | single quote (') | '{e'}'
for é
| ||
grave | back quote (`) | '{a`}'
for à
| ||
umlaut | double quote (") | '{u"}'
for ü
| ||
circumflex | circumflex (^) | '{i^}'
for î
| ||
cedilla | letter c | '{cc}'
for ç
| ||
tilde | tilde (~) | '{n~}'
for ñ
| ||
ring | letter o | '{ao}'
for å
| ||
line through | solidus (/) | '{o/}'
for ø
| ||
breve | plus (+) | '{a+}'
for ă
| ||
double acute | letter q | '{oq}'
for ő
| ||
comma below | comma (,) | '{t,}'
for ț
| ||
caron/hacek | letter v | '{cv}'
for č
| ||
dot above | full stop/period (.) | '{e.}'
for ė
| ||
macron | minus (-) | '{u-}'
for ū
| ||
ogonek | letter k | '{uk}'
for ų
| ||
without dot | no suffix | '{i}'
for ı
|
The ‘line-through’ accent covers a numbers of miscellaneous cases: the
Scandinavian '{o/}'
, Icelandic '{d/}'
and Polish '{l/}'
.
Use '{ae}'
and '{ss}'
for æ ligature and the German
ß, with
upper case forms '{AE}'
and '{SS}'
. Use '{th}'
for Icelandic thorn.
We used to recommend ,
for cedilla, but we need a way to
represent comma-below for Romanian, so we've repurposed ,
for that and now recommend c
for cedilla instead.
If you're writing a new stemmer, see below for a file of suitable
stringdef
lines you can cut and paste into your code.