Snowball (since version 2.0) supports specifying non-ASCII characters using
the standard Unicode notation U+XXXX
where XXXX is a string of
hex digits. However, this doesn't make for very readable source code, so the
Snowball scripts on this site define more mnemonic representations of the
non-ASCII characters which they use - for example, the German stemmer includes
the lines
/* special characters */
stringdef a" '{U+00E4}'
stringdef o" '{U+00F6}'
stringdef u" '{U+00FC}'
stringdef ss '{U+00DF}'
(In Unicode, hex values E4, F6, FC and DF are the numeric values of characters ä, ö, ü and ß respectively.)
Then the code which follows uses '{a"}'
when it wants
ä, etc.
Using literal Unicode character in strings in the source file may work in some cases, but isn't really supported - the snowball compiler doesn't (currently at least) have the concept of "source character set", so at best you'll limit which programming languages your stemmer can be used with.
If you wish to describe other Latin-alphabet based codesets for use in stemmers we recommend using the following conventions:
accent | ASCII form | example | ||
acute | single quote (') | e' for é
| ||
grave | back quote (`) | a` for à
| ||
umlaut | double quote (") | u" for ü
| ||
circumflex | circumflex (^) | i^ for î
| ||
cedilla | letter c | cc for ç
| ||
tilde | tilde (~) | n~ for ñ
| ||
ring | letter o | ao for å
| ||
line through | solidus (/) | o/ for ø
| ||
breve | plus (+) | a+ for ă
| ||
double acute | letter q | oq for ő
| ||
comma below | comma (,) | t, for ț
| ||
caron/hacek | letter v | cv for č
| ||
dot above | full stop/period (.) | e. for ė
| ||
macron | minus (-) | u- for ū
| ||
ogonek | letter k | uk for ų
| ||
without dot | no suffix | i for ı
|
The ‘line-through’ accent covers a numbers of miscellaneous cases: the
Scandinavian o/
, Icelandic d/
and Polish l/
.
Use ae
and ss
for æ ligature and the German
ß, with
upper case forms AE
and SS
. Use th
for Icelandic thorn.
We used to recommend ,
for cedilla, but we need a way to
represent comma-below for Romanian, so we've repurposed ,
for that and now recommend c
for cedilla instead.
If you're writing a new stemmer, see below for a file of suitable
stringdef
lines you can cut and paste into your code.