When you download Snowball, it already contains a make file to allow you to build it, like so:
make
You can confirm it's working with a simple test like so:
echo "running" | ./stemwords -l en
which should output: run
There's no built in way to install snowball currently - you can either copy the snowball binary to somewhere that's on your PATH (e.g. on a typical Linux machine: sudo cp snowball /usr/local/bin) or just run it from the source tree with ./snowball).
The snowball compiler has the following command line syntax,
Usage: snowball SOURCE_FILE... [OPTIONS] Supported options: -o, -output OUTPUT_BASE -s, -syntax show syntax tree and stop -comments generate comments -coverage generate coverage report -ada generate Ada -c++ generate C++ -cs, -csharp generate C# -dart generate Dart -go generate Go -j, -java generate Java -js generate Javascript -pascal generate Pascal -php generate PHP -py, -python generate Python -rust generate Rust -zig generate Zig -w, -widechars -u, -utf8 -n, -name CLASS_NAME -ep, -eprefix EXTERNAL_PREFIX -vp, -vprefix VARIABLE_PREFIX -i, -include DIRECTORY -r, -runtime DIRECTORY -cheader header name to include from C/C++ file -hheader header name to include from C/C++ header -p, -parentclassname CLASS_NAME fully qualified parent class name -P, -Package PACKAGE_NAME package name for stemmers -S, -Stringclass STRING_CLASS StringBuffer-compatible class -a, -amongclass AMONG_CLASS fully qualified name of the Among class -gor, -goruntime PACKAGE_NAME Go snowball runtime package --help display this help and exit --version output version information and exit
For example,
snowball danish.sbl -o q/danish
snowball danish.sbl -syntax
snowball danish.sbl -output q/danish -ep danish_
The first argument, SOURCE_FILE, is the name of the Snowball file to be compiled. Unless you specify a different programming language to
generate code for, the default is to generate ISO C which results in two output
files, a C source in OUTPUT_BASE.c and a corresponding header file in OUTPUT_BASE.h. This is similar for other
programming languages, e.g. if option -java is
present, Java output is produced in OUTPUT_BASE.java.
Some options are only valid when generating code for particular programming
languages. For example, the -widechars,
-utf8, -eprefix and
-vprefix options are specific to C and C++.
In the absence of the -eprefix and -vprefix options, the list of
declared externals in the Snowball program, for example,
externals ( stem_1 stem_2 moderate )
gives rise to a header file containing,
extern struct SN_env * create_env(void);
extern void close_env(struct SN_env * z);
extern int moderate(struct SN_env * z);
extern int stem_2(struct SN_env * z);
extern int stem_1(struct SN_env * z);
If -eprefix is used, its string, S1, is prefixed to each external
name, for example
-eprefix Khotanese_
would give rise to the header file,
extern struct SN_env * Khotanese_create_env(void);
extern void Khotanese_close_env(struct SN_env * z);
extern int Khotanese_moderate(struct SN_env * z);
extern int Khotanese_stem_2(struct SN_env * z);
extern int Khotanese_stem_1(struct SN_env * z);
If -vprefix is used, then functions are generated to
provide access to Snowball strings, integers and booleans. (In Snowball 3.1.x
and earlier, Snowball variables were stored in a different way in the generated
C code and macros were generated if -vprefix was used).
For example
-eprefix Khotanese_ -vprefix Khotanese_variable_
would give rise the header file,
extern struct SN_env * Khotanese_create_env(void);
extern void Khotanese_close_env(struct SN_env * z);
extern const symbol * Khotanese_variable_ch(struct SN_env * z);
extern int Khotanese_variable_Y_found(struct SN_env * z);
extern int Khotanese_variable_p2(struct SN_env * z);
extern int Khotanese_variable_p1(struct SN_env * z);
extern int Khotanese_stem(struct SN_env * z);
The Snowball compiler will attempt to "localise" integer and boolean variables
used only in one routine, in which case they become local variables in the C
code and -vprefix won't generate code for them. Let us know if
this is a problem for you (the Snowball language may need to gain a way to mark
variables as external which could suppress this optimisation and generate an
access function.
The -utf8 and -widechars options affects how
the generated C/C++ code expects strings to be represented - UTF-8 or
wide-character Unicode (stored using 2 bytes per codepoint), or if neither is
specified, one byte per codepoint using either ISO-8859-1 or another encoding.
For other programming languages, one of these three options is effectively implicitly hard-coded (except wide-characters may be wider) - e.g. C#, Java, Javascript and Python use wide characters; Ada, Go and Rust use UTF-8; Pascal uses ISO-8859-1. Since Snowball 2.0 it's possible with a little care to write Snowball code that works regardless of how characters are represented. See section 12 of the Snowball manual for more details.
The -runtime option is used to prepend a path to any #include
lines in the generated code, and is useful when the runtime header files (i.e.
those files in the runtime directory in the standard distribution) are not
in the same location as the generated source files. It is used when
building the libstemmer library, and may be useful for other projects.
Any number of -include options may be present, for example,
snowball testfile -output test -ep danish_ \
-include /home/martin/Snowball/codesets \
-include extras
Each -include is followed by a directory name. With a chain of
directories D1, D2 ... Dn, a Snowball get directive,
get 'F'
causes F to be searched for in the successive locations,
F
D1/F
D2/F
...
Dn/F
— that is, the current directory, followed in turn by directories D1 to
Dn.
To access Snowball from C, include the header api.h, and any headers
generated from the Snowball scripts you wish to use. api.h declares
struct SN_env { /* ... */ };
extern void SN_set_current(struct SN_env * z, int size, char * s);
Continuing the previous example, you set up an environment to call the resources of the Khotanese module with
struct SN_env * z;
z = Khotanese_create_env();
Snowball has the concept of a ‘current string’. This can be set up by,
SN_set_current(z, i, b);
This defines the current string as the i bytes of data starting at
address b. The externals can then be called,
Khotanese_moderate(z);
/* ... */
Khotanese_stem_1(z);
They give a 1 or 0 result, corresponding to the t or f result of the Snowball routine.
And later,
Khotanese_close_env(z);
To release the space raised by z back to the system. You can do this for a
number of Snowball modules at the same time: you will need a separate
struct SN_env * z; for each module.
The current string is given by the z->l bytes of data starting at z->p.
The string is not zero-terminated, but you can zero terminate it yourself with
z->p[z->l] = 0;
(There is always room for this last zero byte.) For example,
SN_set_current(z, strlen(s), s);
Khotanese_stem_1(z);
z->p[z->l] = 0;
printf("Khotanese-1 stems '%s' to '%s'\n", s, z->p);
The values of the other variables can be accessed via the #define
settings that result from the -vprefix option, although this should not
usually be necessary:
printf("p1 is %d\n", z->Khotanese_variable_p1);
The stemming scripts on this Web site use Snowball very simply.
-vprefix is left unset, and -eprefix is set to the name of the
script (usually the language the script is for).
The Snowball compiler provides some options to support developing and debugging Snowball programs.
We aim to have the Snowball compiler issue a warning for code that's likely to be wrong. For example, it will warn about many cases where code can't be reached or a command has no effect. If you encounter a situation where the compiler could have usefully warned but didn't, please report it.
Snowball has a debug command, which is a question mark (?).
This generates a calll to a debug(...) helper function in the
target language. Currently this is implemented for C/C++ and (since Snowball
3.1.0) for Ada. It writes the current string to stdout, annotated
to show the positions of the cursor, limits, and slice ends (|
marks the cursor c, curly brackets ({ and
}) mark the limits, and square brackets ([ and
]) mark the slice ends. The limits may be less than the whole
string because of the action of setlimit.
At present there is no way of reporting the value of integer, boolean or string
variables. If desperate, you can put debugging lines into the generated code.
Passing -comments makes it easier to find where to add such debugging
code (see the next section).
You can pass -comments to the Snowball compiler to get it to
generate comments showing the correspondence with the Snowball source. This
is mainly useful when developing or debugging code generators.
The comments report commands as they exist in the syntax tree after some
optimisations. Reported line numbers should be correct, but sometimes the
command reported can be different to that in the source code (for example,
atleast 0 C is rewritten as repeat C, while
$x /= -1 is rewritten as $x *= -1 - the generated
comments for these cases will say "repeat" and "*=" respectively).
If command-line option -syntax is used then instead of generating
code the compiler writes out the syntax tree to stdout. This can
be a handy way of checking that you have got the bracketing right in the
program you have written.
The Snowball compiler performs some optimisations on the program as
it builds the syntax tree (for example, constant sub-expressions are evaluated,
no-op commands are warned about and replaced with true, etc - the
syntax tree is shown after such transformations.
Snowball 3.1.0 added a simple code coverage feature (currently only supported
for C/C++). This allow generating extra code to allow the runtime to report
which among cases and which grouping characters are
actually exercised. This can help find situations where a case is impossible
to trigger because the words it's mean to handle get dealt with by an earlier
step. It can also find gaps in the test vocabulary - often adding a handful
of words can complete the among coverage.
To use this feature, you need to enable it when generating the code and when building it:
make SNOWBALL_FLAGS=-coverage CPPFLAGS=-DSNOWBALL_COVERAGE
Then stemming a word will log coverage data to stderr, and you
can turn that into a coverage report using sort and
uniq (some of these options may be specific to GNU sort):
make -s check_utf8_swedish 2> coverage.log
sort -k2,2 -k3,3n -k3,3 -k5,5g coverage.log | uniq -c > coverage.report
The file coverage.report looks like this for an among:
24402 algorithms/danish.sbl:59: among 1 no match
32 algorithms/danish.sbl:60: among 1 : 0 of 4 string 'gd'
316 algorithms/danish.sbl:61: among 1 : 1 of 4 string 'dt'
536 algorithms/danish.sbl:61: among 1 : 2 of 4 string 'gt'
48 algorithms/danish.sbl:61: among 1 : 3 of 4 string 'kt'
The columns are:
Note that the indices are 0-based, so in the example above are "0 of 4" to "3 of 4" (there's no "4 of 4").
Cases/characters which never match are not in the report - check for gaps in the numbering to find them. Among strings are reported in the same order as they appear in the source code, so you can use the strings either side of a gap to locate the untriggered string in the source code.
The coverage feature could definitely be slicker to use, but it's already at a state where it seemed useful to include in releases.
It would also be useful to report coverage for every command, not just among cases and grouping characters.
If you hit a snowball compiler bug, try to capture it in a small script before notifying us.
The main one is that it is possible to ‘pull the rug from under your own feet’ in constructions like this:
[ do something ]
do something_else
( C1 delete C2 ) or ( C3 )
Suppose C1 gives t, the delete removes the slice established on the first
line, and C2 gives f, so C3 is done with c set back to the value it had
before C1 was obeyed — but this old value does not take account of the byte shift
caused by the delete. This problem was foreseen from the beginning when designing
Snowball, and recognised as a minor issue because it is an unnatural thing to want to
do. (C3 should not be an alternative to something which has deletion as an
occasional side-effect.) It may be addressed in the future.