The Snowball scripts on this site define the codings of accented letters and other non-ASCII forms in a series of explicit declarations. For example, the German stemmer includes the lines
/* special characters (in ISO Latin I) */ stringdef a" hex 'E4' stringdef o" hex 'F6' stringdef u" hex 'FC' stringdef ss hex 'DF'
In the ISO Latin I code set, hex E4, F6, FC and DF are the numeric values of characters ä, ö, ü and ß respectively. These codings in the stemmer scripts then correspond to the codings used in the sample data.
For a more general approach, you may wish to replace the set of
get directive of the form,
possibly compiling with an
-include option that declares the
directory where this and other files are held,
snowball gstem.sbl -o gstem ... -include /home/shazzer/snowball/codesets
Appropriate code sets for ISO Latin I are provided via the links above, and others will be added on demand or if submitted to us.
For Russian, two sets of
stringdefs are given in the script — KOI8-R,
and (commented out) Unicode. For the other stemmers currently on offer the
Unicode placings correspond to the ISO-Latin I placings, so no extra headers
for Unicode need, at present, be given.
If you wish to describe other Latin-alphabet based codesets for use in Snowball headers, you should adhere to the following conventions:
|acute||single quote|| |
|umlaut||double quote|| |
|ring||letter o|| |
|line through||solidus|| |
|double acute||letter q|| |
And, should they ever arise, use
r for left and right
hook (as in Polish), and
v for hacek (as in Czech).
The ‘line-through’ accent covers a numbers of miscellaneous cases: the
d/ and Polish
ss for æ ligature and the German
upper case forms
th for Icelandic thorn.