Snowball Manual

Links to resources

Snowball definition

Snowball is a small string-handling language, and its name was chosen as a tribute to SNOBOL (Farber 1964, Griswold 1968 — see the references at the end of the introduction), with which it shares the concept of string patterns delivering signals that are used to control the flow of the program.

1 Data types

The basic data types handled by Snowball are strings of characters, signed integers, and boolean truth values, or more simply strings, integers and booleans. Snowball supports Unicode characters, which may be represented as UTF-8, 8-bit characters, or 16-bit wide characters (depending on the programming language code is being generated for - for C, all these options are supported).

2 Names

A name in Snowball starts with an ASCII letter, followed by zero or more ASCII letters, digits and underscores. A name can be of type string, integer, boolean, routine, external or grouping. All names must be declared. A declaration has the form

    Ts ( ... )

where symbol T is one of string, integer etc, and the region in brackets contains a list of names separated by whitespace. For example,

    integers ( p1 p2 )
    booleans ( Y_found )

    routines (
       shortv
       R1 R2
       Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5a Step_5b
    )

    externals ( stem )

    groupings ( v v_WXY v_LSZ )

p1 and p2 are integers, Y_found is boolean, and so on. Snowball is quite strict about the declarations, so all the names go in the same name space, no name may be declared twice, all used names must be declared, no two routine definitions can have the same name, etc. Names declared and subsequently not used are merely reported in a warning message.

A name may not be one of the reserved words of Snowball. Additionally, names for externals must be valid function/method names in the language being generated in most cases, which generally means they can't be reserved words in that language (e.g. externals (null) will generate invalid Java code containing a method public boolean null().) For internal symbols we add a prefix to avoid this issue, but an external has to provide an external interface. When generating C code, the -eprefix option provides a potential solution to this problem.

Names in Snowball are case-sensitive, but external names which differ only in case will cause a problem for languages with case-insensitive identifiers (such as Pascal). This issue is avoided for internal symbols in such languages by encoding case difference via an added prefix.

So for portability a little care is needed when choosing names for externals. The convention when using Snowball to implement stemming algorithms is to have a single external named stem, which should be safe.

3 Literals

3.1 Integer Literals

A literal integer is an ASCII digit sequence, and is always interpreted as decimal.

3.2 String Literals

A literal string is written between single quotes, for example,

    'aeiouy'

Two special insert characters for use in literal strings are defined by the directive stringescapes AB , for example,

    stringescapes {}

Conventionally { and } are used as the insert characters, and we would recommend following this convention unless you want to use these as literal characters in your strings a lot. However, A and B can be any printing characters, except that A can't be a single quote. (If A and B are the same then A itself can never be escaped.)

A subsequent occurrence of the stringescapes directive redefines the insert characters (but any string macros already defined with stringdef remain defined).

Within insert characters, the following sequences are understood:

User-defined string macros which can be specified using stringdef. Macro m is defined in the form stringdef m 'S', where 'S' is a string, and m a sequence of one or more printing characters. Thereafter, {m} inside a string causes S to be substituted in place of m.
New in Snowball 2.0: Unicode codepoints can be specified using the syntax U+ followed by one or more hex digits - for example, '{U+FFFD}' . These are automatically handled appropriately in all cases except if you want to generate C code to handle a single byte character set other than ISO-8859-1. Such cases are handled by defining string macros for the U+ codes in the character set, after which the same Snowball source can be used. You can't mix use of U+ codes defined as string macros and with their default meanings in the same compilation. When U+ codes are defined as string macros, snowball will upper case the characters after the + if there's no macro defined with the case as given.
By default {'} will substitute ' and {{} will substitute {, although macros ' and { may subsequently be redefined.
A further feature is that {W} inside a string, where W is a sequence of whitespace characters including one or more newlines, is ignored. This enables long strings to be written over a number of lines.

For example,

    stringescapes {}

    /* Spanish diacritics */

    stringdef a'   '{U+00E1}'  // a-acute
    stringdef e'   '{U+00E9}'  // e-acute
    stringdef i'   '{U+00ED}'  // i-acute
    stringdef o'   '{U+00F3}'  // o-acute
    stringdef u'   '{U+00FA}'  // u-acute
    stringdef u"   '{U+00FC}'  // u-diaeresis
    stringdef n~   '{U+00F1}'  // n-tilde

    /* All the characters in Spanish used to represent vowels */

    define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'

4 Routines

A routine definition has the form

    define R as C

where R is the routine name and C is a command, or bracketed group of commands. So a routine is defined as a sequence of zero or more commands. Snowball routines do not (at present) take parameters. For example,

    define Step_5b as (      // this defines Step_5b
        ['l']                // three commands here: [, 'l' and ]
        R2 'l'               // two commands, R2 and 'l'
        delete               // delete is one command
    )

    define R1 as $p1 <= cursor
        /* R1 is defined as the single command "$p1 <= cursor" */

A routine is called simply by using its name, R, as a command.

5 Commands and signals

The flow of control in Snowball is arranged by the implicit use of signals, rather than the explicit use of constructs like the if, else, break of C. The scheme is designed for handling strings, but is perhaps easier to introduce using integers. Suppose x, y, z ... are integers. The command

    $x = 1

sets x to 1. The command

    $x > 0

tests if x is greater than zero. Both commands give a signal t or f, (true or false), but while the second command gives t if x is greater than zero and f otherwise, the first command always gives t. In Snowball, every command gives a t or f signal. A sequence of commands can be turned into a single command by putting them in a list surrounded by round brackets:

    ( C₁ C₂ C₃ ... C_i C_i+1 ... )

When this is obeyed, C_i+1 will be obeyed if each of the preceding C₁ ... C_i give t, but as soon as a C_i gives f, the subsequent C_i+1 C_i+2 ... are ignored, and the whole sequence gives signal f. If all the C_i give t, however, the bracketed command sequence also gives t. So,

    $x > 0  $y = 1

sets y to 1 if x is greater than zero. If x is less than or equal to zero the two commands give f.

If C₁ and C₂ are commands, we can build up the larger commands,

C₁ or C₂: — Do C₁. If it gives t ignore C₂, otherwise do C₂. The resulting signal is t if and only C₁ or C₂ gave t.
C₁ and C₂: — Do C₁. If it gives f ignore C₂, otherwise do C₂. The resulting signal is t if and only C₁ and C₂ gave t.
not C: — Do C. The resulting signal is t if C gave f, otherwise f.
try C: — Do C. The resulting signal is t whatever the signal of C.
fail C: — Do C. The resulting signal is f whatever the signal of C.

So for example,

($x > 0 $y = 1) or ($y = 0): — sets y to 1 if x is greater than zero, otherwise to zero.
try( ($x > 0) and ($z > 0) $y = 1): — sets y to 1 if both x and z are greater than 0, and gives t.

This last example is the same as

    try($x > 0  $z > 0  $y = 1)

so that and seems unnecessary here. But we will see that and has a particular significance in string commands.

When a ‘monadic’ construct like not, try or fail is not followed by a round bracket, the construct applies to the shortest following valid command. So for example

    try not $x < 1 $z > 0

would mean

    try ( not ( $x < 1 ) ) $z > 0

because $x < 1 is the shortest valid command following not, and then not $x < 1 is the shortest valid command following try.

The ‘dyadic’ constructs like and and or must sit in a bracketed list of commands anyway, for example,

    ( C₁ C₂ and C₃ C₄ or C₅ )

And then in this case C₂ and C₃ are connected by the and; C₄ and C₅ are connected by the or. So

    $x > 0  not $y > 0 or not $z > 0  $t > 0

means

    $x > 0  ((not ($y > 0)) or (not ($z > 0)))  $t > 0

and and or are equally binding, and bind from left to right, so C₁ or C₂ and C₃ means (C₁ or C₂) and C₃ etc.

6 Integer commands

There are two sorts of integer commands - assignments and comparisons. Both are built from Arithmetic Expressions (AEs).

Arithmetic Expressions (AEs)

An AE consists of integer names, literal numbers and a few other things connected by dyadic +, -, * and /, and monadic -, with the same binding powers and semantics as C. As well as integer names and literal numbers, the following may be used in AEs:

New in Snowball 2.0:
`minint`		— the minimum negative number
`maxint`		— the maximum positive number
`cursor`		— the current value of the string cursor
`limit`		— the current value of the string limit
`size`		— the size of the string, in "slots"
`sizeof s`		— the number of "slots" in `s`, where `s` is the name of a string or (since Snowball 2.1) a literal string
`len`		— the length of the string, in Unicode characters
`lenof s`		— the number of Unicode characters in `s`, where `s` is the name of a string or (since Snowball 2.1) a literal string

size and sizeof count in "slots" - see the "Character representation" section below for details.

The cursor and limit concepts are explained below.

Integer assignments

An integer assignment has the form

    $X assign_op AE

where X is an integer name and assign_op is one of the five assignments =, +=, -=, *=, or /=. The meanings are the same as in C.

For example,

    $p1 = limit    // set p1 to the string limit

Integer assignments always give the signal t.

Integer comparisons

An integer comparison has the form

    $X rel_op AE

or (since Snowball 2.0):

    $(AE₁ rel_op AE₂)

where X is an integer name and rel_op is one of the six tests ==, !=, >=, >, <=, or <. Again, the meanings are the same as in C.

Examples of integer comparisons are,

    $p1 <= cursor  // signal is f if the cursor is before position p1
    $(len >= 3)    // signal is f unless the string is at least 3 characters long

The second form is more general since an integer name is a valid AE, but it also allows comparisons which don't involve integer variables. Before support for this was added the second example could only be achieved by assigning len to a variable and then testing that variable instead.

7 String commands

If s is a string name, a string command has the form

    $s C

where C is a command that operate on the string. Strings can be processed left-to-right or right-to-left, but we will describe only the left-to-right case for now. The string has a cursor, which we will denote by c, and a limit point, or limit, which we will denote by l. c advances towards l in the course of a string command, but the various constructs and, or, not etc have side-effects which keep moving it backwards. Initially c is at the start and l the end of the string. For example,

        'a|n|i|m|a|d|v|e|r|s|i|o|n'
        |                         |
        c                         l

c, and l, mark the boundaries between characters, and not characters themselves. The characters between c and l will be denoted by c:l.

If C gives t, the cursor c will have a new, well-defined value. But if C gives f, c is undefined. Its later value will in fact be determined by the outer context of commands in which C came to be obeyed, not by C itself.

Here is a list of the commands that can be used to operate on strings.

a) Setting a value

= S

where S is the name of a string or a literal string. c:l is set equal to S, and l is adjusted to point to the end of the copied string. The signal is t. The slice should be considered unset afterwards, because this operation can change part of the string which overlaps the current slice. For example,

        $x  = 'animadversion'    /* literal string */
        $y = x                  /* string name */

b) Basic tests

S

here and below, S is the name of a string or a literal string. If c:l begins with the substring S, c is repositioned to the end of this substring, and the signal is t. Otherwise the signal is f. For example,

        $x 'anim'   /* gives t, assuming the string is 'animadversion' */
        $x ('anim' 'ad' 'vers')
                    /* ditto */

        $t = 'anim'
        $x t        /* ditto */

true, false

true is a dummy command that generates signal t. false generates signal f. They are sometimes useful for emphasis,

        define start_off as true       // nothing to do
        define exception_list as false // put in among(...) list later

true is equivalent to ()

C₁ or C₂

This is like the case for integers described above, but the extra touch is that if C₁ gives f, c is set back to its old position after C₁ has given f and before C₂ is tried, so that the test takes place on the same point in the string. So we have

        $x ('anim'  /* signal t */
            'ation' /* signal f */
           ) or
           ( 'an'   /* signal t - from the beginning */
           )

C₁ and C₂

And similarly c is set back to its old position after C₁ has given t and before C₂ is tried. So,

        $x 'anim' and 'an'   /* signal t */
        $x ('anim'  'an')    /* signal f, since 'an' and 'ad' mis-match */

not C

try C

These are like the integer tests, with the added feature that c is set back to its old position after an f signal is turned into t. So,

        $x (not 'animation' not 'immersion')
            /* both tests are done at the start of the string */

        $x (try 'animus' try 'an'
            'imad')
            /* - gives t */

try C is equivalent to C or true

test C

This does command C but without advancing c. Its signal is the same as the signal of C, but following signal t, c is set back to its old value.

`test C`		is equivalent to		`not not C`
`test C₁ C₂`		is equivalent to		`C₁ and C₂`

fail C

This does C and gives signal f. It is equivalent to C false. Like false it is useful, but only rarely.

do C

This does C, puts c back to its old value and gives signal t. It is very useful as a way of suppressing the side effect of f signals and cursor movement.

`do C`		is equivalent to		`try test C`
		or		`test try C`

goto C

c is moved right until obeying C gives t. But if c cannot be moved right because it is at l the signal is f. c is set back to the position it had before the last obeying of C, so the effect is to leave c before the pattern which matched against C.

        $x goto 'ad'         /* positions c after 'anim' */
        $x goto 'ax'         /* signal f */

gopast C

Like goto, but c is not set back, so the effect is to leave c after the pattern which matched against C.

        $x gopast 'ad'       /* positions c after 'animad' */

repeat C

C is repeated until it gives f. When this happens c is set back to the position it had before the last repetition of C, and repeat C gives signal t. For example,

        $x repeat gopast 'a' /* position c after the last 'a' */

loop AE C

This is like C C ... C written out AE times, where AE is an arithmetic expression. For example,

        $x loop 2 gopast ('a' or 'e' or 'i' or 'o' or 'u')
            /* position c after the second vowel */

The equivalent expression in C has the shape,

	 int n = AE;
	 for (int i = 0; i < n; i++) C;

atleast AE C

This is equivalent to loop AE C repeat C.

hop AE

moves c AE character positions towards l, but if AE is negative, or if there are less than AE characters between c and l the signal is f. For example,

        test hop 3

tests that c:l contains more than 2 characters.

next

is equivalent to hop 1.

c) Moving text about

We have seen in (a) that $x = y, when x and y are strings, sets c:l of x to the value of y.

A more delicate mechanism for pushing text around is to define a substring, or slice of the string being tested. Then

[: sets the left-end of the slice to c,
]: sets the right-end of the slice to c,
-> s: copies the slice to variable s,
<- S: replaces the slice with variable (or literal) S.

For example

        /* assume x holds 'animadversion' */
        $x ( [          // '[animadversion' - [ set as indicated
             loop 2 gopast 'a'
                       // '[anima|dversion' - c is marked by '|'
             ]         // '[anima]dversion' - ] set as indicated
             -> y      // y is 'anima'
           )

For any string, the slice ends should be assumed to be unset until they are set with the two commands [ and ]. Thereafter the slice ends will continue to mark the same substring until altered.

delete: is equivalent to <- ''

This next example deletes all vowels in x,

        define vowel ('a' or 'e' or 'i' or 'o' or 'u')
        /* ... */
        $ x repeat ( gopast([vowel]) delete )

As this example shows, the slice markers [ and ] often appear as pairs in a bracketed style, which makes for easy reading of the Snowball scripts. But it must be remembered that, unusually in a computer programming language, they are not true brackets.

More simply, text can be inserted at c.

insert S: insert variable or literal S before c, moving c to the right of the insert.
attach S: the same, but leave c at the left of the insert.

d) Marks

The cursor, c, (and the limit, l) can be thought of as having a numeric value, from zero upwards:

         | a | n | i | m | a | d | v | e | r | s | i | o | n |
         0   1   2   3   4   5   6   7   8   9  10  11  12  13

It is these numeric values of c and l which are accessible through cursor and limit in arithmetic expressions.

setmark X: sets X to the current value of c, where X is an integer variable. It's equivalent to: $X = cursor
tomark AE: moves c forward to the position given by AE,
atmark AE: tests if c is at position AE (t or f signal). It's equivalent to: $(cursor == AE)

In the case of tomark AE , a similar fail condition occurs as with hop AE . If c is already beyond AE, or if position l is before position AE, the signal is f.

In the stemming algorithms, certain regions of the word are defined by setting marks, and later the failure condition of tomark is used to see if c is inside a particular region.

Two other commands put c at l, and test if c is at l,

tolimit: moves c forward to l (signal t always),
atlimit: tests if c is at l (t or f signal).

e) Changing l

In this account of string commands we see c moving right towards l, while l stays fixed at the end. In fact l can be reset to a new position between c and its old position, to act as a shorter barrier for the movement of c.

setlimit C₁ for C₂

C₁ is obeyed, and if it gives f the signal from setlimit is f with no further action.

Otherwise, the final value of c becomes the new position of l. c is then set back to its old value before C₁ was obeyed, and C₂ is obeyed. Finally l is set back to its old position, and the signal of C₂ becomes the signal of setlimit.

So the signal is f if either C₁ or C₂ gives f, otherwise t. For example,

    $x ( setlimit goto 's'  // 'animadver}sion' new l as marked '}'
         for                // below, '|' marks c after each goto
         ( goto 'a' and     // '|animadver}sion'
           goto 'e' and     // 'animadv|er}sion'
           goto 'i'         // 'an|imadver}sion'
         )
       )

This checks that x has characters ‘a’, ‘e’ and ‘i’ before the first ‘s’.

f) Backward processing

String commands have been described with c to the left of l and moving right. But the process can be reversed.

backwards C: c and l are swapped over, and c moves left towards l. C is obeyed, the signal given by C becomes the signal of backwards C, and c and l are swapped back to their old values (except that l may have been adjusted because of deletions and insertions). C cannot contain another backwards command.
reverse C: A similar idea, but here c simply moves left instead of moving right, with the beginning of the string as the limit, l. C can contain other reverse commands, but it cannot contain commands to do deletions or insertions — it must be used for testing only. (Without this restriction Snowball's semantics would become very untidy.)

Forward and backward processing are entirely symmetric, except that forward processing is the default direction, and literal strings are always written out forwards, even when they are being tested backwards. So the following are equivalent,

    $x (
        'ani' 'mad' 'version' atlimit
    )

    $x backwards (
        'version' 'mad' 'ani' atlimit
    )

If a routine is defined for backwards mode processing, it must be included inside a backwardmode(...) declaration.

g) substring and among

The use of substring and among is central to the implementation of the stemming algorithms. It is like a case switch on strings. In its simpler form,

        substring among('S₁' 'S₂' 'S₃' ...)

searches for the longest matching substring 'S₁' or 'S₂' or 'S₃' ... from position c. (The 'S_i' must all be different.) So this has the same semantics as

        ('S₁' or 'S₂' or 'S₃' ...)

— so long as the 'S_i' are written out in decreasing order of length.

substring may be omitted, in which case it is attached to its following among, so

    among(/*...*/)

without a preceding substring is equivalent to

    (substring among(/*...*/))

substring may also be detached from its among , although it must precede it textually in the same routine in which the among appears. The more general form of substring /* ... */ among is,

    substring
    C
    among( 'S₁₁' 'S₁₂' ... (C₁)
           'S₂₁' 'S₂₂' ... (C₂)
           ...

           'S_n1' 'S_n2' ... (C_n)
         )

Obeying substring searches for a longest match among the 'S_ij'. The signal from substring is t if a match is found, otherwise f. Any commands C between the substring and among will be run after this search and only if the search finds a match (it would be equivalent to remove C and replace each C_i with C C_i). When the among comes to be obeyed, the C_i corresponding to the matched 'S_ij' is obeyed, and its signal becomes the signal of the among command.

substring/among pairs must match up textually inside each routine definition. But there is no problem with an among containing other substring/among pairs, and substring is optional before among anyway. The essential constraint is that two substrings must be separated by an among, and each substring must be followed by an among.

The effect of obeying among when the preceding substring is not obeyed is undefined. This would happen for example here,

    try($x != 617 substring)
    among(...) // 'substring' is bypassed in the exceptional case where x == 617

The significance of separating the substring from the among is to allow them to work in different contexts. For example,

    setlimit tomark L for substring

    among( 'S₁₁' 'S₁₂' ... (C₁)
           ...

           'S_n1' 'S_n2' ... (C_n)
         )

Here the test for the longest 'S_ij' is constrained to the region between c and the mark point given by integer L. But the commands C_i operate outside this limit. Another example is

    reverse substring

    among( 'S₁₁' 'S₁₂' ... (C₁)
           ...

           'S_n1' 'S_n2' ... (C_n)
         )

The substring test is in the opposite direction in the string to the direction of the commands C_i.

The last (C_n) may be omitted, in which case (true) is assumed.

Each string 'S_ij' may be optionally followed by a routine name,

    among(
           'S₁₁' R₁₁ 'S₁₂' R₁₂ ... (C₁)
           'S₂₁' R₂₁ 'S₂₂' R₂₂ ... (C₂)
           ...
           'S_n1' R_n1 'S_n2' R_n1 ... (C_n)
         )

If a routine name is not specified, it is equivalent to a routine which simply returns signal t,

    define null as true

— so we can imagine each 'S_ij' having its associated routine R_ij. Then obeying the among causes a search for the longest 'S_ij' whose corresponding routine R_ij gives t.

The routines R_ij should be written without any side-effects, other than the inevitable cursor movement. (c is in any case set back to its old value following a call of R_ij.)

8 Booleans

set B and unset B set B to true and false respectively, where B is a boolean name. B as a command gives a signal t if it is set true, f otherwise. For example,

    booleans ( Y_found )   // declare the boolean

    /* ... */

    unset Y_found          // unset it
    do ( ['y'] <-'Y' set Y_found )
       /* if c:l begins 'y' replace it by 'Y' and set Y_found */

    do repeat(goto (v ['y']) <-'Y' set Y_found)
       /* repeatedly move down the string looking for v 'y' and
          replacing 'y' with 'Y'. Whenever the replacement takes
          place set Y_found. v is a test for a vowel, defined as
          a grouping (see below). */


    /* Y_found means there are some letters Y in the string.
       Later we can use this to trigger a conversion back to
       lower case y. */

    /* ... */

    do (Y_found repeat(goto (['Y']) <- 'y')

9 Groupings

A grouping brings characters together and enables them to be looked for with a single test.

If G is declared as a grouping, it can be defined by

    define G G₁ op G₂ op G₃ ...

where op is + or -, and G₁, G₂, G₃ are literal strings, or groupings that have already been defined. (There can be zero or more of these additional op components). For example,

    define capital_letter  'ABDEFGHIJKLMNOPQRSTUVWXYZ'
    define small_letter    'abdefghijklmnopqrstuvwxyz'
    define letter          capital_letter + small_letter
    define vowel           'aeiou' + 'AEIOU'
    define consonant       letter - vowel
    define digit           '0123456789'
    define alphanumeric    letter + digit

Once G is defined, it can be used as a command, and is equivalent to a test

    'ch1' or 'ch2' or ...

where ch1, ch2 ... list all the characters in the grouping.

non G is the converse test, and matches any character except the characters of G. Note that non G is not the same as not G , in fact

non G is equivalent to (not G next)

non may be optionally followed by hyphen, for example:

    non-vowel
    non-digit

Bear in mind that non-vowel doesn't only match a consonant - it'll match any character which isn't in the vowel grouping. Failing to consider this has lead to bugs in stemming algorithms - for example, here we intended to undouble a consonant:

    [non-vowel] -> ch
    ch
    delete

The problem with this code is it will also mangle numbers with repeated digits, for example 1900 would become 190. A good rule of thumb here seems to be to use an inclusive grouping check if the code goes on to delete the character matched:

    [consonant] -> ch
    ch
    delete

10 A Snowball program

A complete program consists of a sequence of declarations followed by a sequence of definitions of groupings and routines. Routines which are implicitly defined as operating on c:l from right to left must be included in a backwardmode(...) declaration.

A Snowball program is called up via a simple API through its defined externals. For example,

    externals ( stem1 stem2 )
    /* ... */
    define stem1 as ( /* stem1 commands */ )
    define stem2 as ( /* stem2 commands */ )

The API also allows a current string to be defined, and this becomes the c:l string for the external routine to work on. Its final value is the result handed back through the API.

The strings, integers and booleans are accessible from any point in the program, and exist throughout the running of the Snowball program. They are therefore like static declarations in C.

11 Comments, and other whitespace fillers

At a deeper level, a program is a sequence of tokens, interspersed with whitespace. Names, reserved words, literal numbers and strings are all tokens. Various symbols, made up of non-alphanumerics, are also tokens.

A name, reserved word or number is terminated by the first character that cannot form part of it. A symbol is recognised as the longest sequence of characters that forms a valid symbol. So +=- is two symbols, += and -, because += is a valid symbol in the language while +=- is not. Whitespace separates tokens but is otherwise ignored.

Occasionally a newer version of Snowball may add a new token. So as not to break existing programs, any such tokens declared as a name (via integers , routines , etc) will lose their token status for the rest of the program. This applies to the tokens len and lenof .

Anywhere that whitespace can occur, there may also occur:

(a) Comments, in the usual multi-line /* .... */ or single line // ... format.

(b) Get directives. These are like #include commands in C, and have the form get 'S' , where 'S' is a literal string. For example,

    get '/home/martin/snowball/main-hdr' // include the file contents

(d) stringdef m 'S' where m is sequence of characters not including whitespace and terminated with whitespace, and 'S' is a literal string.

12 Character representation

In this description of Snowball, it is assumed that strings are composed of characters, and that characters can be defined numerically, but the numeric range of these characters is not defined. As implemented, three different schemes are supported. Characters can either be (a) bytes in the range 0 to 255, as in traditional C strings, or (b) byte pairs in the range 0 to 65535, as in Java strings, or (c) UTF-8 encoded bytes sequences in the range 0 to 65535, so that a character may occupy 1, 2 or 3 bytes.

For case (c), we need to make a slight separation of the concept of characters into symbols, the units of text being represented, and slots, the units of space into which they map. (So in case (a), all slots are one byte; in case (b) all slots are two bytes.) c and l have numeric values that can be used in AEs (arithmetic expressions). These values count the number of slots. Similarly setmark, tomark and atmark are remembering and then using slot counts. size and sizeof measure string size in slots, not symbols. However, hop N moves c over N symbols, not N slots, and next is equivalent to hop 1.

Snowball 2.0 adds len and lenof, which measure string length in symbols (so they're the same as size and sizeof in cases (a) and (b), but different in case (c)).

So long as these simple distinctions are recognised, the same Snowball script can be compiled to work with any of the three encoding schemes.

13 Legacy Features

This section documents features of Snowball for which there's a strongly preferred alternative. They're still supported for compatibility with existing code which uses them, but you shouldn't use them in new code. We document them here so that their meaning in existing code can be understood, and especially to aid updating to the preferred alternatives.

13.1 hex and decimal

In a stringdef , string may be preceded by the word hex, or the word decimal. This was how non-ASCII characters were specified before support for specifying Unicode codepoints using the U+ notation was added.

hex and decimal mean that the contents of the string are interpreted as characters value written out in hexadecimal, or decimal, notation. The characters should be separated by spaces. For example,

    hex 'DA'        /* is character hex DA */
    hex 'D A'       /* is the two characters, hex D and A (carriage
                       return, and line feed) */
    decimal '10'    /* character 10 (line feed) */
    decimal '13 10' /* characters 13 and 10 (carriage return, and
                       line feed) */

The following forms are equivalent,

    hex 'd a'      /* lower case also allowed */
    hex '0D 000A'  /* leading zeroes ignored */
    hex ' D  A  '  /* extra spacing is harmless */

The interpretation of the values is as Unicode codepoints if command line option -utf8 or -widechars is specified, and as character values in an unspecified single byte character set otherwise. For ASCII and ISO-8859-1 the character values match Unicode codepoints, but to handle other single byte character sets (e.g. ISO-8859-2 or KOI8-R) you would need a special version of a Snowball source with different character values specified via stringdef. The U+ notation allows you to use a single Snowball source in this situation.

13.2 among starter command

The among command supports a "starter" command, C in this example:

    among( (C)
           'S₁₁' 'S₁₂' ... (C₁)
           'S₂₁' 'S₂₂' ... (C₂)
           ...
           'S_n1' 'S_n2' ... (C_n)
         )

This is equivalent to adding C at the start of each C_i:

    among( 'S₁₁' 'S₁₂' ... (C C₁)
           'S₂₁' 'S₂₂' ... (C C₂)
           ...
           'S_n1' 'S_n2' ... (C C_n)
         )

However, both are equivalent to:

    substring C
    among( 'S₁₁' 'S₁₂' ... (C₁)
           'S₂₁' 'S₂₂' ... (C₂)
           ...
           'S_n1' 'S_n2' ... (C_n)
         )

This requires an explicit substring but seems clearer so we recommend using this in new code and have designated the use of a starter as a legacy feature.

A starter is also allowed with an explicit substring, for example:

    substring
    C_s
    among( (C_a)
           'S₁₁' 'S₁₂' ... (C₁)
           'S₂₁' 'S₂₂' ... (C₂)
           ...
           'S_n1' 'S_n2' ... (C_n)
         )

is equivalent to:

    substring
    C_s
    C_a
    among( 'S₁₁' 'S₁₂' ... (C₁)
           'S₂₁' 'S₂₂' ... (C₂)
           ...
           'S_n1' 'S_n2' ... (C_n)
         )

13.3 => command

        $x => y

sets the value of y to the c:l region of x.

However this was not the implemented behaviour - instead the region copied begins at the start of x rather than the cursor. This difference can be seen with code such as

        $x (next => y)

which as documented should set y to x without the first character but actually does the same as the first example.

The generated Java code in some cases is invalid and fails to compile. This may also be true for some other target languages.

There was only a single known use, in Martin Porter's implementation of the Schinke Latin stemmer where the code requires the implemented behaviour.

You should avoid using => in new code. We've replaced the uses in the Schinke stemmer and if there are other existing uses we recommend replacing them too. For example, the first example above can be rewritten as

        $y = x

If you have a use which you can't see how to replace, please get in touch and we can advise.

13.4 <+ synonym for insert

Snowball supports <+ as a synonym for insert. We recommend always using insert instead as it's clearer and not unnecessarily verbose.

Snowball syntax

In the grammar which follows, || is used for alternatives, [X] means that X is optional, and [X]* means that X is repeated zero or more times. meta-symbols are defined on the left. <char> means any character.

The definition of literal string does not allow for the escaping conventions established by the stringescapes directive. The command ? is a debugging aid.

<letter>        ::= a || b || ... || z || A || B || ... || Z
<digit>         ::= 0 || 1 || ... || 9
<name>          ::= <letter> [ <letter> || <digit> || _ ]*
<s_name>        ::= <name>
<i_name>        ::= <name>
<b_name>        ::= <name>
<r_name>        ::= <name>
<g_name>        ::= <name>
<literal string>::= '[<char>]*'
<number>        ::= <digit> [ <digit> ]*

S               ::= <s_name> || <literal string>
G               ::= <g_name> || <literal string>

<declaration>   ::= strings ( [<s_name>]* ) ||
                    integers ( [<i_name>]* ) ||
                    booleans ( [<b_name>]* ) ||
                    routines ( [<r_name>]* ) ||
                    externals ( [<r_name>]* ) ||
                    groupings ( [<g_name>]* )

<r_definition>  ::= define <r_name> as C
<plus_or_minus> ::= + || -
<g_definition>  ::= define <g_name> G [ <plus_or_minus> G ]*

AE              ::= (AE) ||
                    AE + AE || AE - AE || AE * AE || AE / AE || - AE ||
                    maxint || minint || cursor || limit ||
                    size || sizeof S ||
                    len || lenof S ||
                    <i_name> || <number>

<i_assign>      ::= $ <i_name> = AE ||
                    $ <i_name> += AE || $ <i_name> -= AE ||
                    $ <i_name> *= AE || $ <i_name> /= AE

<i_test_op>     ::= == || != || > || >= || < || <=

<i_test>        ::= $ ( AE <i_test_op> AE ) ||
                    $ <i_name> <i_test_op> AE

<s_command>     ::= $ <s_name> C

C               ::= ( [C]* ) ||
                    <i_assign> || <i_test> || <s_command> || C or C || C and C ||
                    not C || test C || try C || do C || fail C ||
                    goto C || gopast C || repeat C || loop AE C ||
                    atleast AE C || S || = S || insert S || attach S ||
                    <- S || delete ||  hop AE || next ||
                    => <s_name> || [ || ] || -> <s_name> ||
                    setmark <i_name> || tomark AE || atmark AE ||
                    tolimit || atlimit || setlimit C for C ||
                    backwards C || reverse C || substring ||
                    among ( [<literal string> [<r_name>] || (C)]* ) ||
                    set <b_name> || unset <b_name> || <b_name> ||
                    <r_name> || <g_name> || non [-] <g_name> ||
                    true || false || ?

P              ::=  [P]* || <declaration> ||
                    <r_definition> || <g_definition> ||
                    backwardmode ( P )

<program>      ::=  P



synonyms:      <+ for insert