Character codes

Links

The Snowball scripts on this site define the codings of accented letters and other non-ASCII forms in a series of explicit declarations. For example, the German stemmer includes the lines

    /* special characters (in ISO Latin I) */

    stringdef a"   hex 'E4'
    stringdef o"   hex 'F6'
    stringdef u"   hex 'FC'
    stringdef ss   hex 'DF'

In the ISO Latin I code set, hex E4, F6, FC and DF are the numeric values of characters ä, ö, ü and ß respectively. These codings in the stemmer scripts then correspond to the codings used in the sample data.

For a more general approach, you may wish to replace the set of  stringdefs by a  get  directive of the form,

    get 'ISO-Latin-1'

possibly compiling with an  -include  option that declares the directory where this and other files are held,

    snowball gstem.sbl -o gstem ... -include /home/shazzer/snowball/codesets

Appropriate code sets for ISO Latin I are provided via the links above, and others will be added on demand or if submitted to us.

For Russian, two sets of  stringdefs are given in the script — KOI8-R, and (commented out) Unicode. For the other stemmers currently on offer the Unicode placings correspond to the ISO-Latin I placings, so no extra headers for Unicode need, at present, be given.

If you wish to describe other Latin-alphabet based codesets for use in Snowball headers, you should adhere to the following conventions:

accent ASCII form example
acute single quote  e' for é
grave grave  a` for à
umlaut double quote  u" for ü
circumflex circumflex  i^ for î
cedilla comma  c, for ç
tilde tilde  n~ for ñ
ring letter o  ao for å
line through solidus  o/ for ø
breve plus  a+ for ă
double acute letter q  oq for ő

And, should they ever arise, use  r  for left and right hook (as in Polish), and  v  for hacek (as in Czech).

The ‘line-through’ accent covers a numbers of miscellaneous cases: the Scandinavian  o/, Icelandic  d/  and Polish  l/.

Use  ae  and  ss  for æ ligature and the German ß, with upper case forms  AE  and  SS. Use  th  for Icelandic thorn.