Using Snowball

Links to resources

Compiling and running Snowball

When you download Snowball, it already contains a make file to allow you to build and install it, like so:

    make
    sudo make install

The snowball compiler can then be called up with the following syntax,

    snowball F1 [-o[utput] F2]
                [-s[yntax]]
                [-w[idechars]]  [-u[tf8]]
                [-j[ava]]  [-n[ame] C]
                [-ep[refix] S1]  [-vp[refix] S2]
                [-i[nclude] D]
                [-r[untime] P]

For example,

    snowball danish/stem.sbl -o q/danish
    snowball danish/stem.sbl -syntax
    snowball danish/stem.sbl -output q/danish -ep danish_

The first argument,  F1, is the name of the Snowball file to be compiled. If the  -java  option is absent, it produces two outputs, an ANSI C module in  F2.c  and a corresponding header file in  F2.h. If option  -java  is present, Java output is produced in  F2.java.

The  -widechars,  -utf8,  -eprefix  and  -vprefix  options belong with ANSI C generation; the  -name  option with Java generation.

ANSI C generation

In the absence of the  -eprefix  and  -vprefix  options, the list of declared externals in the Snowball program, for example,

    externals ( stem_1 stem_2 moderate )

gives rise to a header file containing,

    extern struct SN_env * create_env(void);
    extern void close_env(struct SN_env * z);

    extern int moderate(struct SN_env * z);
    extern int stem_2(struct SN_env * z);
    extern int stem_1(struct SN_env * z);

If  -eprefix  is used, its string,  S1, is prefixed to each external name, for example

    -eprefix Khotanese_

would give rise to the header file,

    extern struct SN_env * Khotanese_create_env(void);
    extern void Khotanese_close_env(struct SN_env * z);

    extern int Khotanese_moderate(struct SN_env * z);
    extern int Khotanese_stem_2(struct SN_env * z);
    extern int Khotanese_stem_1(struct SN_env * z);

If  -vprefix  is used, all Snowball strings, integers and booleans give rise to a  #define  line in the header file. For example

    -eprefix Khotanese_ -vprefix Khotanese_variable

would give rise the header file,

    extern struct SN_env * Khotanese_create_env(void);
    extern void Khotanese_close_env(struct SN_env * z);

    #define Khotanese_variable_ch (S[0])
    #define Khotanese_variable_Y_found (B[0])
    #define Khotanese_variable_p2 (I[1])
    #define Khotanese_variable_p1 (I[0])
    extern int Khotanese_stem(struct SN_env * z);

The  -widechars  option affects interpretation of Snowball hex and decimal strings, as in

    stringdef m hex 'H1 H2 ...'
    stringdef m decimal 'D1 D2 ...'

where  H1,  H2  ... are hex numbers and  D1,  D2  ... are decimal numbers. Without the  -widechars  option it is an error for these numbers to exceed 255. With the  -widechars  option it is only an error if they exceed 65535. So by default one byte characters are assumed, but -widechars  makes the assumptions that characters are two bytes. Note that (a) the output from Snowball is the same in both cases, and (b) the  -java  option automatically sets the  -widechars  option. Within the API header file  api.h,  symbol  is given a typedef of unsigned char,

        typedef unsigned char symbol;

— and a sequence of characters representing a word to be stemmed is then held in a  symbol  array. To switch to a 16 bit representation of characters, just replace  char  by  short  here:

        typedef unsigned short symbol;

The  -utf8  option is an alternative to  -widechars. Again, it allows characters in the range 0 to 65535 in  stringdefs, but these characters are then encoded as 2 or 3 byte characters in the UTF-8 encoding scheme. The ANSI C program output by Snowball is similarly adjusted to handle characters that can occupy multiple bytes. (See section 12 of the Snowball manual.)

The  -runtime  option is used to prepend a path to any  #include lines in the generated code, and is useful when the runtime header files (i.e. those files in the runtime directory in the standard distribution) are not in the same location as the generated source files. It is used when building the libstemmer library, and may be useful for other projects.

Java generation

The  -java  option automatically sets the  -widechars  option.

Other options

If  -syntax  is used the other options are ignored, and the syntax tree of the Snowball program is directed to  stdout. This can be a handy way of checking that you have got the bracketing right in the program you have written.

Any number of  -include  options may be present, for example,

    snowball testfile -output test -ep danish_  \
             -include /home/martin/Snowball/codesets  \
             -include extras

Each  -include  is followed by a directory name. With a chain of directories  D1,  D2  ...  Dn, a Snowball  get  directive,

    get 'F'

causes  F  to be searched for in the successive locations,

    F
    D1/F
    D2/F
    ...
    Dn/F

— that is, the current directory, followed in turn by directories  D1  to Dn.

The Snowball API

To access Snowball from C, include the header  api.h, and any headers generated from the Snowball scripts you wish to use.  api.h  declares

    struct SN_env { /* ... */ };
    extern void SN_set_current(struct SN_env * z, int size, char * s);

Continuing the previous example, you set up an environment to call the resources of the Khotanese module with

    struct SN_env * z;
    z = Khotanese_create_env();

Snowball has the concept of a ‘current string’. This can be set up by,

    SN_set_current(z, i, b);

This defines the current string as the  i  bytes of data starting at address  b. The externals can then be called,

    Khotanese_moderate(z);
    /* ... */
    Khotanese_stem_1(z);

They give a 1 or 0 result, corresponding to the t or f result of the Snowball routine.

And later,

    Khotanese_close_env(z);

To release the space raised by z back to the system. You can do this for a number of Snowball modules at the same time: you will need a separate struct SN_env * z;  for each module.

The current string is given by the  z->l  bytes of data starting at  z->p. The string is not zero-terminated, but you can zero terminate it yourself with

    z->p[z->l] = 0;

(There is always room for this last zero byte.) For example,

    SN_set_current(z, strlen(s), s);
    Khotanese_stem_1(z);
    z->p[z->l] = 0;
    printf("Khotanese-1 stems '%s' to '%s'\n", s, z->p);

The values of the other variables can be accessed via the  #define settings that result from the  -vprefix  option, although this should not usually be necessary:

    printf("p1 is %d\n", z->Khotanese_variable_p1);

The stemming scripts on this Web site use Snowball very simply. -vprefix  is left unset, and  -eprefix  is set to the name of the script (usually the language the script is for). All the programs are tested through a common driver program.

Getting started

The complete apparatus of the libstemmer download plus the make files can obscure the essential simplicity of the Snowball system. Just to get a bit of confidence in using it, here is something you can try safely at home.

First, install the compiler, as documented above.

Then copy the files  stem_ISO_8859_1.sbl,  voc.txt  and  output.txt  from the page for the Hungarian stemmer,

    http://snowballstem.org/algorithms/hungarian/stemmer.html

to the same directory, renaming the snowball script as  hungarian.sbl. Take the  runtime/  directory out of the snowball code download, and put it in the current directory, renamed as  q/, say.

q/  contains four files,  api.c,  api.h,  header.h,  utilities.c. Now compile the snowball script,

    snowball hungarian.sbl -o q/hungarian -ep H_ -utf8

(Note the  -utf8  option.) This put two more files into  q/,  hungarian.c  and hungarian.h. Next put into  q/  a driver program. You can download it from the link at the top of this page, but here it is,

    #include <stdio.h>
    #include <stdlib.h> /* for malloc, free */
    #include <string.h> /* for memmove */

    #include "api.h"
    #include "hungarian.h"


    /* This derives from the source file driver.template */

    /* A simple driver for a single ANSI C generated Hungarian stemmer.

       Following compilation with

           gcc -o H_prog q/*.c

       The command line syntax is

           ./H_prog file [-o[utput] file] -h[elp]]

       The first argument gives the input file, which consists of a list of words
       to be stemmed, one per line. (Words must be in lower case.) If omitted, stdin
       is used.

       The output is sent to stdout by default, otherwise to the -output file.

    */

    static void stem_file(struct SN_env * z, FILE * f_in, FILE * f_out) {
    #define INC 10
        int lim = INC;
        symbol * b = (symbol *) malloc(lim * sizeof(symbol));

        while(1) {
            int ch = getc(f_in);
            if (ch == EOF) {
                free(b); return;
            }
            {
                int i = 0;
                while(1) {
                    if (ch == '\n' || ch == EOF) break;
                    if (i == lim) {  /* make b bigger */
                        symbol * q = (symbol *) malloc((lim + INC) * sizeof(symbol));
                        memmove(q, b, lim * sizeof(symbol));
                        free(b); b = q;
                        lim = lim + INC;
                    }
                    b[i] = ch; i++;
                    ch = getc(f_in);
                }

                SN_set_current(z, i, b);
                H_stem(z);
                {
                    int j;
                    for (j = 0; j < z->l; j++) fprintf(f_out, "%c", z->p[j]);
                    fprintf(f_out, "\n");
                }
            }
        }
    }

    static int eq(char * s1, char * s2) {
        int s1_len = strlen(s1);
        int s2_len = strlen(s2);
        return s1_len == s2_len && memcmp(s1, s2, s1_len) == 0;
    }

    static void show_options(int n) {
        printf("options are: file [-o[utput] file] [-h[elp]]\n");
        exit(n);
    }

    int main(int argc, char * argv[])
    {   char * in = 0;
        char * out = 0;
        {   char * s;
            int i = 1;
            while(1) {
                if (i >= argc) break;
                s = argv[i++];
                if (s[0] == '-') {

                    if (eq(s, "-output") || eq(s, "-o")) {
                        if (i >= argc) {
                            fprintf(stderr, "%s requires an argument\n", s);
                            exit(1);
                        }
                        out = argv[i++];
                    } else if (eq(s, "-help") || eq(s, "-h")) {
                        show_options(0);
                    } else {
                        fprintf(stderr, "%s unknown\n", s);
                        show_options(1);
                    }
                }
                else in = s;
            }
        }

        /* initialise the stemming process: */

        {
            struct SN_env * z = H_create_env();
            FILE * f_in;
            FILE * f_out;
            f_in = in == 0 ? stdin : fopen(in, "r");
            if (f_in == 0) {
                fprintf(stderr, "file %s not found\n", in); exit(1);
            }
            f_out = out == 0 ? stdout : fopen(out, "w");
            if (f_out == 0) {
                fprintf(stderr, "file %s cannot be opened\n", out); exit(1);
            }
            stem_file(z, f_in, f_out);
            H_close_env(z);
        }

        return 0;
    }

Now compile the  q/  sources to the Hungarian stemmer program  H_prog,

    gcc -o H_prog q/*.c

And  H_prog  will turn the vocabulary into the stemmed output, as you can check by doing,

    ./H_prog voc.txt -o TEMP.txt
    diff output.txt TEMP.txt

In summary therefore,

    snowball hungarian.sbl -o q/hungarian -ep H_ -utf8
    gcc -o H_prog q/*.c
    ./H_prog voc.txt -o TEMP.txt
    diff output.txt TEMP.txt

(Not so hard.)

Debugging snowball scripts

In the rare event that your Snowball script does not run perfectly the first time:

Remember that the option  -syntax  prints out the syntax tree. A question mark can be included in Snowball as a command, and it will generate a call debug(...). The defined  debug  in  runtime/utilities.c  (usually commented out) can then be used. It causes the current string to sent to  stdout, with square brackets marking the slice and vertical bar the position of c. Curly brackets mark the end-limits of the string, which may be less than the whole string because of the action of  setlimit.

At present there is no way of reporting the value of an integer or boolean.

If desperate, you can put debugging lines into the generated C program. This is not so hard, since running comments show the correspondence with the Snowball source.

Compiler bugs

If you hit a snowball compiler bug, try to capture it in a small script before notifying us.

Known problems in Snowball

The main one is that it is possible to ‘pull the rug from under your own feet’ in constructions like this:

    [ do something ]
    do something_else
    ( C1 delete C2 ) or ( C3 )

Suppose  C1  gives t, the delete removes the slice established on the first line, and  C2  gives f, so C3 is done with c set back to the value it had before  C1  was obeyed — but this old value does not take account of the byte shift caused by the delete. This problem was foreseen from the beginning when designing Snowball, and recognised as a minor issue because it is an unnatural thing to want to do. (C3  should not be an alternative to something which has deletion as an occasional side-effect.) It may be addressed in the future.