Snowball Manual

Links to resources

Snowball definition

Snowball is a small string-handling language, and its name was chosen as a tribute to SNOBOL (Farber 1964, Griswold 1968 — see the references at the end of the introduction), with which it shares the concept of string patterns delivering signals that are used to control the flow of the program.

1 Data types

The basic data types handled by Snowball are strings of characters, signed integers, and boolean truth values, or more simply strings, integers and booleans. Snowball supports Unicode characters, which may be represented as UTF-8, 8-bit characters, or 16-bit wide characters (depending on the programming language code is being generated for - for C, all these options are supported).

2 Names

A name in Snowball starts with an ASCII letter, followed by zero or more ASCII letters, digits and underscores. A name can be of type string, integer, boolean, routine, external or grouping. All names must be declared. A declaration has the form

    Ts ( ... )

where symbol  T  is one of  string,  integer  etc, and the region in brackets contains a list of names separated by whitespace. For example,

    integers ( p1 p2 )
    booleans ( Y_found )

    routines (
       shortv
       R1 R2
       Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5a Step_5b
    )

    externals ( stem )

    groupings ( v v_WXY v_LSZ )

p1  and  p2  are integers,  Y_found  is boolean, and so on. Snowball is quite strict about the declarations, so all the names go in the same name space, no name may be declared twice, all used names must be declared, no two routine definitions can have the same name, etc. Names declared and subsequently not used are merely reported in a warning message.

A name may not be one of the reserved words of Snowball. Additionally, names for externals must be valid function/method names in the language being generated in most cases, which generally means they can't be reserved words in that language (e.g. externals (null) will generate invalid Java code containing a method public boolean null().) For internal symbols we add a prefix to avoid this issue, but an external has to provide an external interface. When generating C code, the -eprefix option provides a potential solution to this problem.

Names in Snowball are case-sensitive, but external names which differ only in case will cause a problem for languages with case-insensitive identifiers (such as Pascal). This issue is avoided for internal symbols in such languages by encoding case difference via an added prefix.

So for portability a little care is needed when choosing names for externals. The convention when using Snowball to implement stemming algorithms is to have a single external named stem, which should be safe.

3 Literals

3.1 Integer Literals

A literal integer is an ASCII digit sequence, and is always interpreted as decimal.

3.2 String Literals

A literal string is written between single quotes, for example,

    'aeiouy'

Two special insert characters for use in literal strings are defined by the directive stringescapes AB , for example,

    stringescapes {}

Conventionally { and } are used as the insert characters, and we would recommend following this convention unless you want to use these as literal characters in your strings a lot. However,  A  and  B  can be any printing characters, except that  A  can't be a single quote. (If A  and  B are the same then  A  itself can never be escaped.)

A subsequent occurrence of the stringescapes directive redefines the insert characters (but any string macros already defined with stringdef remain defined).

Within insert characters, the following sequences are understood:

  • User-defined string macros which can be specified using stringdef. Macro  m  is defined in the form  stringdef m 'S', where  'S'  is a string, and  m  a sequence of one or more printing characters. Thereafter,  {m}  inside a string causes  S  to be substituted in place of  m.

  • New in Snowball 2.0: Unicode codepoints can be specified using the syntax U+ followed by one or more hex digits - for example, '{U+FFFD}' . These are automatically handled appropriately in all cases except if you want to generate C code to handle a single byte character set other than ISO-8859-1. Such cases are handled by defining string macros for the U+ codes in the character set, after which the same Snowball source can be used. You can't mix use of U+ codes defined as string macros and with their default meanings in the same compilation. When U+ codes are defined as string macros, snowball will upper case the characters after the + if there's no macro defined with the case as given.

  • By default  {'}  will substitute  '  and {{}  will substitute  {, although macros  '  and  {  may subsequently be redefined.

  • A further feature is that  {W}  inside a string, where  W  is a sequence of whitespace characters including one or more newlines, is ignored. This enables long strings to be written over a number of lines.

For example,

    stringescapes {}

    /* Spanish diacritics */

    stringdef a'   '{U+00E1}'  // a-acute
    stringdef e'   '{U+00E9}'  // e-acute
    stringdef i'   '{U+00ED}'  // i-acute
    stringdef o'   '{U+00F3}'  // o-acute
    stringdef u'   '{U+00FA}'  // u-acute
    stringdef u"   '{U+00FC}'  // u-diaeresis
    stringdef n~   '{U+00F1}'  // n-tilde

    /* All the characters in Spanish used to represent vowels */

    define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'

4 Routines

A routine definition has the form

    define R as C

where  R  is the routine name and  C  is a command, or bracketed group of commands. So a routine is defined as a sequence of zero or more commands. Snowball routines do not (at present) take parameters. For example,

    define Step_5b as (      // this defines Step_5b
        ['l']                // three commands here: [, 'l' and ]
        R2 'l'               // two commands, R2 and 'l'
        delete               // delete is one command
    )

    define R1 as $p1 <= cursor
        /* R1 is defined as the single command "$p1 <= cursor" */

A routine is called simply by using its name,  R, as a command.

5 Commands and signals

The flow of control in Snowball is arranged by the implicit use of signals, rather than the explicit use of constructs like the  if, else,  break  of C. The scheme is designed for handling strings, but is perhaps easier to introduce using integers. Suppose  x,  y,  z  ... are integers. The command

    $x = 1

sets  x  to 1. The command

    $x > 0

tests if  x  is greater than zero. Both commands give a signal t or f, (true or false), but while the second command gives t if  x  is greater than zero and f otherwise, the first command always gives t. In Snowball, every command gives a t or f signal. A sequence of commands can be turned into a single command by putting them in a list surrounded by round brackets:

    ( C1 C2 C3 ... Ci Ci+1 ... )

When this is obeyed,  Ci+1  will be obeyed if each of the preceding  C1  ... Ci  give t, but as soon as a  Ci  gives f, the subsequent  Ci+1 Ci+2  ... are ignored, and the whole sequence gives signal f. If all the  Ci  give t, however, the bracketed command sequence also gives t. So,

    $x > 0  $y = 1

sets  y  to 1 if  x  is greater than zero. If  x  is less than or equal to zero the two commands give f.

If  C1  and  C2  are commands, we can build up the larger commands,

C1 or C2
— Do  C1. If it gives t ignore  C2, otherwise do  C2. The resulting signal is t if and only  C1  or  C2  gave t.
C1 and C2
— Do  C1. If it gives f ignore  C2, otherwise do  C2. The resulting signal is t if and only  C1  and  C2  gave t.
not C
— Do  C. The resulting signal is t if  C  gave f, otherwise f.
try C
— Do  C. The resulting signal is t whatever the signal of  C.
fail C
— Do  C. The resulting signal is f whatever the signal of  C.

So for example,

($x > 0 $y = 1) or ($y = 0)
— sets  y  to 1 if  x  is greater than zero, otherwise to zero.
try( ($x > 0) and ($z > 0) $y = 1)
— sets  y  to 1 if both  x  and  z  are greater than 0, and gives t.

This last example is the same as

    try($x > 0  $z > 0  $y = 1)

so that  and  seems unnecessary here. But we will see that  and  has a particular significance in string commands.

When a ‘monadic’ construct like  not,  try  or  fail  is not followed by a round bracket, the construct applies to the shortest following valid command. So for example

    try not $x < 1 $z > 0

would mean

    try ( not ( $x < 1 ) ) $z > 0

because $x < 1 is the shortest valid command following  not, and then not $x < 1  is the shortest valid command following  try.

The ‘dyadic’ constructs like  and  and  or  must sit in a bracketed list of commands anyway, for example,

    ( C1 C2 and C3 C4 or C5 )

And then in this case  C2  and  C3  are connected by the  and;  C4  and  C5  are connected by the  or. So

    $x > 0  not $y > 0 or not $z > 0  $t > 0

means

    $x > 0  ((not ($y > 0)) or (not ($z > 0)))  $t > 0

and  and  or  are equally binding, and bind from left to right, so  C1 or C2 and C3  means  (C1 or C2) and C3  etc.

6 Integer commands

There are two sorts of integer commands - assignments and comparisons. Both are built from Arithmetic Expressions (AEs).

Arithmetic Expressions (AEs)

An AE consists of integer names, literal numbers and a few other things connected by dyadic  +,  -,  *  and  /, and monadic  -, with the same binding powers and semantics as C. As well as integer names and literal numbers, the following may be used in AEs:

minint  — the minimum negative number
maxint  — the maximum positive number
cursor  — the current value of the string cursor
limit  — the current value of the string limit
size  — the size of the string, in "slots"
sizeof s  — the number of "slots" in  s, where  s  is the name of a string or (since Snowball 2.1) a literal string
New in Snowball 2.0:
len  — the length of the string, in Unicode characters
lenof s  — the number of Unicode characters in  s, where  s  is the name of a string or (since Snowball 2.1) a literal string

size and sizeof count in "slots" - see the "Character representation" section below for details.

The cursor and limit concepts are explained below.

Integer assignments

An integer assignment has the form

    $X assign_op AE

where  X  is an integer name and assign_op is one of the five assignments  =,  +=,  -=,  *=, or  /=. The meanings are the same as in C.

For example,

    $p1 = limit    // set p1 to the string limit

Integer assignments always give the signal t.

Integer comparisons

An integer comparison has the form

    $X rel_op AE

or (since Snowball 2.0):

    $(AE1 rel_op AE2)

where  X  is an integer name and rel_op is one of the six tests  ==,  !=,  >=,  >, <=, or  <. Again, the meanings are the same as in C.

Examples of integer comparisons are,

    $p1 <= cursor  // signal is f if the cursor is before position p1
    $(len >= 3)    // signal is f unless the string is at least 3 characters long

The second form is more general since an integer name is a valid AE, but it also allows comparisons which don't involve integer variables. Before support for this was added the second example could only be achieved by assigning len to a variable and then testing that variable instead.

7 String commands

If  s  is a string name, a string command has the form

    $s C

where  C  is a command that operate on the string. Strings can be processed left-to-right or right-to-left, but we will describe only the left-to-right case for now. The string has a cursor, which we will denote by c, and a limit point, or limit, which we will denote by l. c advances towards l in the course of a string command, but the various constructs  and,  or,  not  etc have side-effects which keep moving it backwards. Initially c is at the start and l the end of the string. For example,

        'a|n|i|m|a|d|v|e|r|s|i|o|n'
        |                         |
        c                         l

c, and l, mark the boundaries between characters, and not characters themselves. The characters between c and l will be denoted by c:l.

If  C  gives t, the cursor c will have a new, well-defined value. But if  C gives f, c is undefined. Its later value will in fact be determined by the outer context of commands in which  C  came to be obeyed, not by  C  itself.

Here is a list of the commands that can be used to operate on strings.

a) Setting a value

= S
where  S  is the name of a string or a literal string. c:l is set equal to  S, and l is adjusted to point to the end of the copied string. The signal is t. For example,
        $x  = 'animadversion'    /* literal string */
        $y = x                  /* string name */

b) Basic tests

S
here and below,  S  is the name of a string or a literal string. If c:l begins with the substring  S, c is repositioned to the end of this substring, and the signal is t. Otherwise the signal is f. For example,
        $x 'anim'   /* gives t, assuming the string is 'animadversion' */
        $x ('anim' 'ad' 'vers')
                    /* ditto */

        $t = 'anim'
        $x t        /* ditto */
true,  false
true  is a dummy command that generates signal t.  false  generates signal f. They are sometimes useful for emphasis,
        define start_off as true       // nothing to do
        define exception_list as false // put in among(...) list later
 true  is equivalent to  ()
C1 or C2
This is like the case for integers described above, but the extra touch is that if  C1  gives f, c is set back to its old position after  C1  has given f and before  C2  is tried, so that the test takes place on the same point in the string. So we have
        $x ('anim'  /* signal t */
            'ation' /* signal f */
           ) or
           ( 'an'   /* signal t - from the beginning */
           )
C1 and C2
And similarly c is set back to its old position after  C1  has given t and before  C2  is tried. So,
        $x 'anim' and 'an'   /* signal t */
        $x ('anim'  'an')    /* signal f, since 'an' and 'ad' mis-match */
not C
try C
These are like the integer tests, with the added feature that c is set back to its old position after an f signal is turned into t. So,
        $x (not 'animation' not 'immersion')
            /* both tests are done at the start of the string */

        $x (try 'animus' try 'an'
            'imad')
            /* - gives t */
 try C  is equivalent to  C or true
test C
This does command  C  but without advancing c. Its signal is the same as the signal of  C, but following signal t, c is set back to its old value.
 test C  is equivalent to  not not C
 test C1 C2  is equivalent to  C1 and C2
fail C
This does  C  and gives signal f. It is equivalent to  C false. Like  false  it is useful, but only rarely.
do C
This does  C, puts c back to its old value and gives signal t. It is very useful as a way of suppressing the side effect of f signals and cursor movement.
 do C  is equivalent to  try test C
or  test try C
goto C
c is moved right until obeying  C  gives t. But if c cannot be moved right because it is at l the signal is f. c is set back to the position it had before the last obeying of  C, so the effect is to leave c before the pattern which matched against  C.
        $x goto 'ad'         /* positions c after 'anim' */
        $x goto 'ax'         /* signal f */
gopast C
Like goto, but c is not set back, so the effect is to leave c after the pattern which matched against  C.
        $x gopast 'ad'       /* positions c after 'animad' */
repeat C
C  is repeated until it gives f. When this happens c is set back to the position it had before the last repetition of  C, and  repeat C  gives signal t. For example,
        $x repeat gopast 'a' /* position c after the last 'a' */
loop AE C
This is like  C C ... C  written out AE times, where AE is an arithmetic expression. For example,
        $x loop 2 gopast ('a' or 'e' or 'i' or 'o' or 'u')
            /* position c after the second vowel */
The equivalent expression in C has the shape,
        {    int i;
             int limit = AE;
             for (i = 0; i < limit; i++) C;
        }
atleast AE C
This is equivalent to  loop AE C repeat C.
hop AE
moves c AE character positions towards l, but if AE is negative, or if there are less than AE characters between c and l the signal is f. For example,
        test hop 3
tests that c:l contains more than 2 characters.
next
is equivalent to  hop 1.

c) Moving text about

We have seen in (a) that  $x = y, when  x  and  y  are strings, sets c:l of  x to the value of  y. Conversely

        $x => y

sets the value of  y  to the c:l region of  x.

A more delicate mechanism for pushing text around is to define a substring, or slice of the string being tested. Then

[
sets the left-end of the slice to c,
]
sets the right-end of the slice to c,
-> s
copies the slice to variable  s,
<- S
replaces the slice with variable (or literal)  S.

For example

        /* assume x holds 'animadversion' */
        $x ( [          // '[animadversion' - [ set as indicated
             loop 2 gopast 'a'
                       // '[anima|dversion' - c is marked by '|'
             ]         // '[anima]dversion' - ] set as indicated
             -> y      // y is 'anima'
           )

For any string, the slice ends should be assumed to be unset until they are set with the two commands  [,  ]. Thereafter the slice ends will retain the same values until altered.

delete
is equivalent to <- ''

This next example deletes all vowels in x,

        define vowel ('a' or 'e' or 'i' or 'o' or 'u')
        /* ... */
        $ x repeat ( gopast([vowel]) delete )

As this example shows, the slice markers  [  and  ]  often appear as pairs in a bracketed style, which makes for easy reading of the Snowball scripts. But it must be remembered that, unusually in a computer programming language, they are not true brackets.

More simply, text can be inserted at c.

insert S
insert variable or literal  S  before c, moving c to the right of the insert.  <+  is a synonym for  insert.
attach S
the same, but leave c at the left of the insert.

d) Marks

The cursor, c, (and the limit, l) can be thought of as having a numeric value, from zero upwards:

         | a | n | i | m | a | d | v | e | r | s | i | o | n |
         0   1   2   3   4   5   6   7   8   9  10  11  12  13

It is these numeric values of c and l which are accessible through cursor  and  limit  in arithmetic expressions.

setmark X
sets  X  to the current value of c, where  X  is an integer variable. It's equivalent to: $X = cursor
tomark AE
moves c forward to the position given by AE,
atmark AE
tests if c is at position AE (t or f signal). It's equivalent to: $(cursor == AE)

In the case of tomark AE , a similar fail condition occurs as with hop AE . If c is already beyond AE, or if position l is before position AE, the signal is f.

In the stemming algorithms, certain regions of the word are defined by setting marks, and later the failure condition of tomark is used to see if c is inside a particular region.

Two other commands put c at l, and test if c is at l,

tolimit
moves c forward to l (signal t always),
atlimit
tests if c is at l (t or f signal).

e) Changing l

In this account of string commands we see c moving right towards l, while l stays fixed at the end. In fact l can be reset to a new position between c and its old position, to act as a shorter barrier for the movement of c.

setlimit C1 for C2
C1  is obeyed, and if it gives f the signal from  setlimit is f with no further action.

Otherwise, the final value of c becomes the new position of l. c is then set back to its old value before  C1  was obeyed, and  C2  is obeyed. Finally l is set back to its old position, and the signal of  C2  becomes the signal of  setlimit.

So the signal is f if either  C1  or  C2  gives f, otherwise t. For example,

    $x ( setlimit goto 's'  // 'animadver}sion' new l as marked '}'
         for                // below, '|' marks c after each goto
         ( goto 'a' and     // '|animadver}sion'
           goto 'e' and     // 'animadv|er}sion'
           goto 'i' and     // 'an|imadver}sion'
         )
       )

This checks that x has characters ‘a’, ‘e’ and ‘i’ before the first ‘s’.

f) Backward processing

String commands have been described with c to the left of l and moving right. But the process can be reversed.

backwards C
c and l are swapped over, and c moves left towards l.  C  is obeyed, the signal given by  C  becomes the signal of  backwards C, and c and l are swapped back to their old values (except that l may have been adjusted because of deletions and insertions).  C  cannot contain another backwards command.
reverse C
A similar idea, but here c simply moves left instead of moving right, with the beginning of the string as the limit, l.  C  can contain other reverse commands, but it cannot contain commands to do deletions or insertions — it must be used for testing only. (Without this restriction Snowball's semantics would become very untidy.)

Forward and backward processing are entirely symmetric, except that forward processing is the default direction, and literal strings are always written out forwards, even when they are being tested backwards. So the following are equivalent,

    $x (
        'ani' 'mad' 'version' atlimit
    )

    $x backwards (
        'version' 'mad' 'ani' atlimit
    )

If a routine is defined for backwards mode processing, it must be included inside a  backwardmode(...)  declaration.

g) substring and among

The use of substring and among is central to the implementation of the stemming algorithms. It is like a case switch on strings. In its simpler form,

        substring among('S1' 'S2' 'S3' ...)

searches for the longest matching substring  'S1'  or  'S2'  or  'S3'  ... from position c. (The  'Si'  must all be different.) So this has the same semantics as

        ('S1' or 'S2' or 'S3' ...)

— so long as the  'Si'  are written out in decreasing order of length.

substring  may be omitted, in which case it is attached to its following among, so

    among(/*...*/)

without a preceding substring is equivalent to

    (substring among(/*...*/))

substring may also be detached from its among , although it must precede it textually in the same routine in which the among appears. The more general form of substring /* ... */ among is,

    substring
    C
    among( 'S11' 'S12' ... (C1)
           'S21' 'S22' ... (C2)
           ...

           'Sn1' 'Sn2' ... (Cn)
         )

Obeying  substring  searches for a longest match among the  'Sij'. The signal from  substring  is t if a match is found, otherwise f. Any commands C between the substring and among will be run after this search and only if the search finds a match (it would be equivalent to remove C and replace each Ci with C Ci). When the among  comes to be obeyed, the  Ci  corresponding to the matched  'Sij'  is obeyed, and its signal becomes the signal of the  among  command.

substring/among  pairs must match up textually inside each routine definition. But there is no problem with an  among  containing other substring/among  pairs, and  substring  is optional before  among  anyway. The essential constraint is that two  substrings must be separated by an among, and each  substring  must be followed by an  among.

The effect of obeying  among  when the preceding  substring  is not obeyed is undefined. This would happen for example here,

    try($x != 617 substring)
    among(...) // 'substring' is bypassed in the exceptional case where x == 617

The significance of separating the  substring  from the  among  is to allow them to work in different contexts. For example,

    setlimit tomark L for substring

    among( 'S11' 'S12' ... (C1)
           ...

           'Sn1' 'Sn2' ... (Cn)
         )

Here the test for the longest  'Sij'  is constrained to the region between c and the mark point given by integer  L. But the commands  Ci  operate outside this limit. Another example is

    reverse substring

    among( 'S11' 'S12' ... (C1)
           ...

           'Sn1' 'Sn2' ... (Cn)
         )

The substring test is in the opposite direction in the string to the direction of the commands  Ci.

The last  (Cn)  may be omitted, in which case  (true)  is assumed.

Each string  'Sij'  may be optionally followed by a routine name,

    among(
           'S11' R11 'S12' R12 ... (C1)
           'S21' R21 'S22' R22 ... (C2)
           ...
           'Sn1' Rn1 'Sn2' Rn1 ... (Cn)
         )

If a routine name is not specified, it is equivalent to a routine which simply returns signal t,

    define null as true

— so we can imagine each  'Sij'  having its associated routine Rij. Then obeying the  among  causes a search for the longest 'Sij'  whose corresponding routine Rij  gives t.

The routines Rij  should be written without any side-effects, other than the inevitable cursor movement. (c is in any case set back to its old value following a call of Rij.)

8 Booleans

set B and unset B set  B  to true and false respectively, where  B  is a boolean name. B as a command gives a signal t if it is set true, f otherwise. For example,

    booleans ( Y_found )   // declare the boolean

    /* ... */

    unset Y_found          // unset it
    do ( ['y'] <-'Y' set Y_found )
       /* if c:l begins 'y' replace it by 'Y' and set Y_found */

    do repeat(goto (v ['y']) <-'Y' set Y_found)
       /* repeatedly move down the string looking for v 'y' and
          replacing 'y' with 'Y'. Whenever the replacement takes
          place set Y_found. v is a test for a vowel, defined as
          a grouping (see below). */


    /* Y_found means there are some letters Y in the string.
       Later we can use this to trigger a conversion back to
       lower case y. */

    /* ... */

    do (Y_found repeat(goto (['Y']) <- 'y')

9 Groupings

A grouping brings characters together and enables them to be looked for with a single test.

If  G  is declared as a grouping, it can be defined by

    define G G1 op G2 op G3 ...

where op is  +  or  -, and  G1,  G2,  G3  are literal strings, or groupings that have already been defined. (There can be zero or more of these additional op components). For example,

    define capital_letter  'ABDEFGHIJKLMNOPQRSTUVWXYZ'
    define small_letter    'abdefghijklmnopqrstuvwxyz'
    define letter          capital_letter + small_letter
    define vowel           'aeiou' + 'AEIOU'
    define consonant       letter - vowel
    define digit           '0123456789'
    define alphanumeric    letter + digit

Once  G  is defined, it can be used as a command, and is equivalent to a test

    'ch1' or 'ch2' or ...

where  ch1,  ch2  ... list all the characters in the grouping.

non G is the converse test, and matches any character except the characters of  G. Note that non G is not the same as not G , in fact

non G is equivalent to (not G next)

non may be optionally followed by hyphen, for example:

    non-vowel
    non-digit

Bear in mind that non-vowel doesn't only match a consonant - it'll match any character which isn't in the vowel grouping. Failing to consider this has lead to bugs in stemming algorithms - for example, here we intended to undouble a consonant:

    [non-vowel] -> ch
    ch
    delete

The problem with this code is it will also mangle numbers with repeated digits, for example 1900 would become 190. A good rule of thumb here seems to be to use an inclusive grouping check if the code goes on to delete the character matched:

    [consonant] -> ch
    ch
    delete

10 A Snowball program

A complete program consists of a sequence of declarations followed by a sequence of definitions of groupings and routines. Routines which are implicitly defined as operating on c:l from right to left must be included in a  backwardmode(...)  declaration.

A Snowball program is called up via a simple API through its defined externals. For example,

    externals ( stem1 stem2 )
    /* ... */
    define stem1 as ( /* stem1 commands */ )
    define stem2 as ( /* stem2 commands */ )

The API also allows a current string to be defined, and this becomes the c:l string for the external routine to work on. Its final value is the result handed back through the API.

The strings, integers and booleans are accessible from any point in the program, and exist throughout the running of the Snowball program. They are therefore like static declarations in C.

11 Comments, and other whitespace fillers

At a deeper level, a program is a sequence of tokens, interspersed with whitespace. Names, reserved words, literal numbers and strings are all tokens. Various symbols, made up of non-alphanumerics, are also tokens.

A name, reserved word or number is terminated by the first character that cannot form part of it. A symbol is recognised as the longest sequence of characters that forms a valid symbol. So  +=-  is two symbols,  +=  and -, because  +=  is a valid symbol in the language while  +=-  is not. Whitespace separates tokens but is otherwise ignored.

Occasionally a newer version of Snowball may add a new token. So as not to break existing programs, any such tokens declared as a name (via integers , routines , etc) will lose their token status for the rest of the program. This applies to the tokens len and lenof .

Anywhere that whitespace can occur, there may also occur:

(a) Comments, in the usual multi-line /* .... */ or single line // ... format.

(b) Get directives. These are like  #include  commands in C, and have the form get 'S' , where  'S'  is a literal string. For example,

    get '/home/martin/snowball/main-hdr' // include the file contents

(c) stringescapes XY where  X  and  Y  are any two printing characters.

(d) stringdef m 'S' where  m  is sequence of characters not including whitespace and terminated with whitespace, and  'S'  is a literal string.

12 Character representation

In this description of Snowball, it is assumed that strings are composed of characters, and that characters can be defined numerically, but the numeric range of these characters is not defined. As implemented, three different schemes are supported. Characters can either be (a) bytes in the range 0 to 255, as in traditional C strings, or (b) byte pairs in the range 0 to 65535, as in Java strings, or (c) UTF-8 encoded bytes sequences in the range 0 to 65535, so that a character may occupy 1, 2 or 3 bytes.

For case (c), we need to make a slight separation of the concept of characters into symbols, the units of text being represented, and slots, the units of space into which they map. (So in case (a), all slots are one byte; in case (b) all slots are two bytes.) c and l have numeric values that can be used in AEs (arithmetic expressions). These values count the number of slots. Similarly setmark,  tomark  and  atmark  are remembering and then using slot counts.  size  and  sizeof  measure string size in slots, not symbols. However,  hop N  moves c over  N  symbols, not  N  slots, and  next  is equivalent to  hop 1.

Snowball 2.0 adds len and lenof, which measure string length in symbols (so they're the same as size and sizeof in cases (a) and (b), but different in case (c)).

So long as these simple distinctions are recognised, the same Snowball script can be compiled to work with any of the three encoding schemes.

13 Legacy Features

This section documents features of Snowball for which there's a strongly preferred alternative. They're still supported for compatibility with existing code which uses them, but you shouldn't use them in new code. We document them here so that their meaning in existing code can be understood, and especially to aid updating to the preferred alternatives.

13.1 hex and decimal

In a  stringdef , string may be preceded by the word  hex, or the word  decimal. This was how non-ASCII characters were specified before support for specifying Unicode codepoints using the U+ notation was added.

hex and decimal mean that the contents of the string are interpreted as characters values written out in hexadecimal, or decimal, notation. The characters should be separated by spaces. For example,

    hex 'DA'        /* is character hex DA */
    hex 'D A'       /* is the two characters, hex D and A (carriage
                       return, and line feed) */
    decimal '10'    /* character 10 (line feed) */
    decimal '13 10' /* characters 13 and 10 (carriage return, and
                       line feed) */

The following forms are equivalent,

    hex 'd a'      /* lower case also allowed */
    hex '0D 000A'  /* leading zeroes ignored */
    hex ' D  A  '  /* extra spacing is harmless */

The interpretation of the values is as Unicode codepoints if command line option -utf8 or -widechars is specified, and as character values in an unspecified single byte character set otherwise. For ASCII and ISO-8859-1 the character values match Unicode codepoints, but to handle other single byte character sets (e.g. ISO-8859-2 or KOI8-R) you would need a special version of a Snowball source with different character values specified via stringdef. The U+ notation allows you to use a single Snowball source in this situation.

13.2 among starter command

The among command supports a "starter" command, C in this example:

    among( (C)
           'S11' 'S12' ... (C1)
           'S21' 'S22' ... (C2)
           ...
           'Sn1' 'Sn2' ... (Cn)
         )

This is equivalent to adding C at the start of each Ci:

    among( 'S11' 'S12' ... (C C1)
           'S21' 'S22' ... (C C2)
           ...
           'Sn1' 'Sn2' ... (C Cn)
         )

However, both are equivalent to:

    substring C
    among( 'S11' 'S12' ... (C1)
           'S21' 'S22' ... (C2)
           ...
           'Sn1' 'Sn2' ... (Cn)
         )

This requires an explicit substring but seems clearer so we recommend using this in new code and have designated the use of a starter as a legacy feature.

A starter is also allowed with an explicit substring, for example:

    substring
    Cs
    among( (Ca)
           'S11' 'S12' ... (C1)
           'S21' 'S22' ... (C2)
           ...
           'Sn1' 'Sn2' ... (Cn)
         )

is equivalent to:

    substring
    Cs
    Ca
    among( 'S11' 'S12' ... (C1)
           'S21' 'S22' ... (C2)
           ...
           'Sn1' 'Sn2' ... (Cn)
         )

Snowball syntax

In the grammar which follows, ||  is used for alternatives,  [X]  means that X is optional, and  [X]*  means that X is repeated zero or more times. meta-symbols are defined on the left.  <char>  means any character.

The definition of  literal string  does not allow for the escaping conventions established by the  stringescapes  directive. The command ?  is a debugging aid.

<letter>        ::= a || b || ... || z || A || B || ... || Z
<digit>         ::= 0 || 1 || ... || 9
<name>          ::= <letter> [ <letter> || <digit> || _ ]*
<s_name>        ::= <name>
<i_name>        ::= <name>
<b_name>        ::= <name>
<r_name>        ::= <name>
<g_name>        ::= <name>
<literal string>::= '[<char>]*'
<number>        ::= <digit> [ <digit> ]*

S               ::= <s_name> || <literal string>
G               ::= <g_name> || <literal string>

<declaration>   ::= strings ( [<s_name>]* ) ||
                    integers ( [<i_name>]* ) ||
                    booleans ( [<b_name>]* ) ||
                    routines ( [<r_name>]* ) ||
                    externals ( [<r_name>]* ) ||
                    groupings ( [<g_name>]* )

<r_definition>  ::= define <r_name> as C
<plus_or_minus> ::= + || -
<g_definition>  ::= define <g_name> G [ <plus_or_minus> G ]*

AE              ::= (AE) ||
                    AE + AE || AE - AE || AE * AE || AE / AE || - AE ||
                    maxint || minint || cursor || limit ||
                    size || sizeof S ||
                    len || lenof S ||
                    <i_name> || <number>

<i_assign>      ::= $ <i_name> = AE ||
                    $ <i_name> += AE || $ <i_name> -= AE ||
                    $ <i_name> *= AE || $ <i_name> /= AE

<i_test_op>     ::= == || != || > || >= || < || <=

<i_test>        ::= $ ( AE <i_test_op> AE ) ||
                    $ <i_name> <i_test_op> AE

<s_command>     ::= $ <s_name> C

C               ::= ( [C]* ) ||
                    <i_assign> || <i_test> || <s_command> || C or C || C and C ||
                    not C || test C || try C || do C || fail C ||
                    goto C || gopast C || repeat C || loop AE C ||
                    atleast AE C || S || = S || insert S || attach S ||
                    <- S || delete ||  hop AE || next ||
                    => <s_name> || [ || ] || -> <s_name> ||
                    setmark <i_name> || tomark AE || atmark AE ||
                    tolimit || atlimit || setlimit C for C ||
                    backwards C || reverse C || substring ||
                    among ( [<literal string> [<r_name>] || (C)]* ) ||
                    set <b_name> || unset <b_name> || <b_name> ||
                    <r_name> || <g_name> || non [-] <g_name> ||
                    true || false || ?

P              ::=  [P]* || <declaration> ||
                    <r_definition> || <g_definition> ||
                    backwardmode ( P )

<program>      ::=  P



synonyms:      <+ for insert