Snowball is a small string-handling language, and its name was chosen as a tribute to SNOBOL (Farber 1964, Griswold 1968 — see the references at the end of the introduction), with which it shares the concept of string patterns delivering signals that are used to control the flow of the program.
The basic data types handled by Snowball are strings of characters, signed integers, and boolean truth values, or more simply strings, integers and booleans. Snowball supports Unicode characters, which may be represented as UTF-8, 8-bit characters, or 16-bit wide characters (depending on the programming language code is being generated for - for C, all these options are supported).
A name in Snowball starts with an ASCII letter, followed by zero or more ASCII letters, digits and underscores. A name can be of type string, integer, boolean, routine, external or grouping. All names must be declared. A declaration has the form
Ts ( ... )
where symbol T
is one of string
, integer
etc, and the region in
brackets contains a list of names separated by whitespace. For example,
integers ( p1 p2 )
booleans ( Y_found )
routines (
shortv
R1 R2
Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5a Step_5b
)
externals ( stem )
groupings ( v v_WXY v_LSZ )
p1
and p2
are integers, Y_found
is boolean, and so on. Snowball is quite
strict about the declarations, so all the names go in the same name space,
no name may be declared twice, all used names must be declared, no two
routine definitions can have the same name, etc. Names declared and
subsequently not used are merely reported in a warning message.
A name may not be one of the reserved words of Snowball. Additionally, names
for externals must be valid function/method names in the language being
generated in most cases, which generally means they can't be reserved words
in that language (e.g. externals (null)
will generate
invalid Java code containing a method public boolean null()
.)
For internal symbols we add a prefix to avoid this issue, but an external
has to provide an external interface. When generating C code, the
-eprefix
option provides a potential solution to this problem.
Names in Snowball are case-sensitive, but external names which differ only in case will cause a problem for languages with case-insensitive identifiers (such as Pascal). This issue is avoided for internal symbols in such languages by encoding case difference via an added prefix.
So for portability a little care is needed when choosing names for externals.
The convention when using Snowball to implement stemming algorithms is to have
a single external named stem
, which should be safe.
A literal integer is an ASCII digit sequence, and is always interpreted as decimal.
A literal string is written between single quotes, for example,
'aeiouy'
Two special insert characters for use in literal strings are defined by
the directive stringescapes AB
, for example,
stringescapes {}
Conventionally {
and }
are used as the insert
characters, and we would recommend following this convention unless you want to
use these as literal characters in your strings a lot. However,
A
and B
can be any printing
characters, except that A
can't be a single quote.
(If A
and B
are the same then
A
itself can never be escaped.)
A subsequent occurrence of the stringescapes
directive redefines
the insert characters (but any string macros already defined with
stringdef
remain defined).
Within insert characters, the following sequences are understood:
User-defined string macros which can be specified using
stringdef
. Macro m
is defined in the
form stringdef m 'S'
, where 'S'
is a
string, and m
a sequence of one or more printing
characters. Thereafter, {m}
inside a string causes
S
to be substituted in place of m
.
New in Snowball 2.0: Unicode codepoints can be specified using the syntax
U+
followed by one or more hex digits - for example,
'{U+FFFD}'
. These are automatically handled
appropriately in all cases except if you want to generate C code to handle a
single byte character set other than ISO-8859-1. Such cases are handled by
defining string macros for the U+
codes in the character set,
after which the same Snowball source can be used. You can't mix use of
U+
codes defined as string macros and with their default
meanings in the same compilation. When U+
codes are defined
as string macros, snowball will upper case the characters after the
+
if there's no macro defined with the case as given.
By default {'}
will substitute '
and
{{}
will substitute {
, although macros '
and {
may subsequently be
redefined.
A further feature is that {W}
inside
a string, where W
is a
sequence of whitespace characters including one or more newlines, is
ignored. This enables long strings to be written over a number of lines.
For example,
stringescapes {}
/* Spanish diacritics */
stringdef a' '{U+00E1}' // a-acute
stringdef e' '{U+00E9}' // e-acute
stringdef i' '{U+00ED}' // i-acute
stringdef o' '{U+00F3}' // o-acute
stringdef u' '{U+00FA}' // u-acute
stringdef u" '{U+00FC}' // u-diaeresis
stringdef n~ '{U+00F1}' // n-tilde
/* All the characters in Spanish used to represent vowels */
define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'
A routine definition has the form
define R as C
where R
is the routine name and C
is a command, or bracketed group of
commands. So a routine is defined as a sequence of zero or more commands.
Snowball routines do not (at present) take parameters. For example,
define Step_5b as ( // this defines Step_5b
['l'] // three commands here: [, 'l' and ]
R2 'l' // two commands, R2 and 'l'
delete // delete is one command
)
define R1 as $p1 <= cursor
/* R1 is defined as the single command "$p1 <= cursor" */
A routine is called simply by using its name, R
, as a command.
The flow of control in Snowball is arranged by the implicit use of
signals, rather than the explicit use of constructs like the if
,
else
, break
of C. The scheme is designed for handling strings, but is
perhaps easier to introduce using integers. Suppose x
, y
, z
... are
integers. The command
$x = 1
sets x
to 1. The command
$x > 0
tests if x
is greater than zero. Both commands give a signal t or f,
(true or false), but while the second command gives t if x
is greater
than zero and f otherwise, the first command always gives t. In Snowball,
every command gives a t or f signal. A sequence of commands can be turned
into a single command by putting them in a list surrounded by round
brackets:
( C1 C2 C3 ... Ci Ci+1 ... )
When this is obeyed, Ci+1
will be obeyed if each of the preceding C1
...
Ci
give t, but as soon as a Ci
gives f, the subsequent Ci+1 Ci+2
...
are ignored, and the whole sequence gives signal f. If all the Ci
give t,
however, the bracketed command sequence also gives t. So,
$x > 0 $y = 1
sets y
to 1 if x
is greater than zero. If x
is less than or equal to zero
the two commands give f.
If C1
and C2
are commands, we can build up the larger commands,
C1 or C2
C1
. If it gives t ignore C2
, otherwise do C2
. The resulting
signal is t if and only C1
or C2
gave t.
C1 and C2
C1
. If it gives f ignore C2
, otherwise do C2
. The resulting
signal is t if and only C1
and C2
gave t.
not C
C
. The resulting signal is t if C
gave f, otherwise f.
try C
C
. The resulting signal is t whatever the signal of C
.
fail C
C
. The resulting signal is f whatever the signal of C
.
So for example,
($x > 0 $y = 1) or ($y = 0)
y
to 1 if x
is greater than zero, otherwise to zero.
try( ($x > 0) and ($z > 0) $y = 1)
y
to 1 if both x
and z
are greater than 0, and gives t.
This last example is the same as
try($x > 0 $z > 0 $y = 1)
so that and
seems unnecessary here. But we will see that and
has a
particular significance in string commands.
When a ‘monadic’ construct like not
, try
or fail
is not followed by a
round bracket, the construct applies to the shortest following valid command.
So for example
try not $x < 1 $z > 0
would mean
try ( not ( $x < 1 ) ) $z > 0
because $x < 1
is the shortest valid command following not
, and then
not $x < 1
is the shortest valid command following try
.
The ‘dyadic’ constructs like and
and or
must sit in a bracketed list
of commands anyway, for example,
( C1 C2 and C3 C4 or C5 )
And then in this case C2
and C3
are connected by the and
; C4
and C5
are
connected by the or
. So
$x > 0 not $y > 0 or not $z > 0 $t > 0
means
$x > 0 ((not ($y > 0)) or (not ($z > 0))) $t > 0
and
and or
are equally binding, and bind from left to right,
so C1 or C2 and C3
means (C1 or C2) and C3
etc.
There are two sorts of integer commands - assignments and comparisons. Both are built from Arithmetic Expressions (AEs).
An AE consists of integer names, literal numbers and a few other things
connected by dyadic +
, -
, *
and /
, and monadic -
, with the same
binding powers and semantics as C. As well as integer names and literal
numbers, the following may be used in AEs:
minint | — the minimum negative number | |
maxint | — the maximum positive number | |
cursor | — the current value of the string cursor | |
limit | — the current value of the string limit | |
size | — the size of the string, in "slots" | |
sizeof s | — the number of "slots" in s , where s is the name of a string or (since Snowball 2.1) a literal string
| |
New in Snowball 2.0: | ||
---|---|---|
len | — the length of the string, in Unicode characters | |
lenof s | — the number of Unicode characters in s , where s is the name of a string or (since Snowball 2.1) a literal string
|
size
and sizeof
count in
"slots" - see the "Character representation" section below for details.
The cursor and limit concepts are explained below.
$X assign_op AE
where X
is an integer name and assign_op is one of the five assignments
=
, +=
, -=
, *=
, or /=
.
The meanings are the same as in C.
For example,
$p1 = limit // set p1 to the string limit
Integer assignments always give the signal t.
$X rel_op AE
or (since Snowball 2.0):
$(AE1 rel_op AE2)
where X
is an integer name and rel_op is one of the six tests
==
, !=
, >=
,
>
, <=
, or <
.
Again, the meanings are the same as in C.
Examples of integer comparisons are,
$p1 <= cursor // signal is f if the cursor is before position p1
$(len >= 3) // signal is f unless the string is at least 3 characters long
The second form is more general since an integer name is a valid AE, but it
also allows comparisons which don't involve integer variables. Before support
for this was added the second example could only be achieved by assigning
len
to a variable and then testing that variable instead.
If s
is a string name, a string command has the form
$s C
where C
is a command that operate on the string. Strings can be processed
left-to-right or right-to-left, but we will describe only the
left-to-right case for now. The string has a cursor, which we will
denote by c, and a limit point, or limit, which we will denote by l. c
advances towards l in the course of a string command, but the various
constructs and
, or
, not
etc have side-effects which keep moving it
backwards. Initially c is at the start and l the end of the string. For
example,
'a|n|i|m|a|d|v|e|r|s|i|o|n' | | c l
c, and l, mark the boundaries between characters, and not characters themselves. The characters between c and l will be denoted by c:l.
If C
gives t, the cursor c will have a new, well-defined value. But if C
gives f, c is undefined. Its later value will in fact be determined by the
outer context of commands in which C
came to be obeyed, not by C
itself.
Here is a list of the commands that can be used to operate on strings.
= S
S
is the name of a string or a literal string. c:l is set equal
to S
, and l is adjusted to point to the end of the copied string. The
signal is t. For example,
$x = 'animadversion' /* literal string */
$y = x /* string name */
S
S
is the name of a string or a literal string. If c:l
begins with the substring S
, c is repositioned to the end of this
substring, and the signal is t. Otherwise the signal is f. For example,
$x 'anim' /* gives t, assuming the string is 'animadversion' */
$x ('anim' 'ad' 'vers')
/* ditto */
$t = 'anim'
$x t /* ditto */
true
, false
true
is a dummy command that generates signal t. false
generates
signal f. They are sometimes useful for emphasis,
define start_off as true // nothing to do
define exception_list as false // put in among(...) list later
true
is equivalent to ()
C1 or C2
C1
gives f, c is set back to its old position after
C1
has given f and before C2
is tried, so that the test takes place on
the same point in the string. So we have
$x ('anim' /* signal t */
'ation' /* signal f */
) or
( 'an' /* signal t - from the beginning */
)
C1 and C2
C1
has given t
and before C2
is tried. So,
$x 'anim' and 'an' /* signal t */
$x ('anim' 'an') /* signal f, since 'an' and 'ad' mis-match */
not C
try C
$x (not 'animation' not 'immersion')
/* both tests are done at the start of the string */
$x (try 'animus' try 'an'
'imad')
/* - gives t */
try C | is equivalent to | C or true
|
test C
C
but without advancing c. Its signal is the same as
the signal of C
, but following signal t, c is set back to its old
value.
test C | is equivalent to | not not C
| ||
test C1 C2 | is equivalent to | C1 and C2
|
fail C
C
and gives signal f. It is equivalent to C false
. Like
false
it is useful, but only rarely.
do C
C
, puts c back to its old value and gives signal t. It is
very useful as a way of suppressing the side effect of f signals and
cursor movement.
do C | is equivalent to | try test C
| ||
or | test try C
|
goto C
C
gives t. But if c cannot be moved
right because it is at l the signal is f. c is set back to the position
it had before the last obeying of C
, so the effect is to leave c before
the pattern which matched against C
.
$x goto 'ad' /* positions c after 'anim' */
$x goto 'ax' /* signal f */
gopast C
C
.
$x gopast 'ad' /* positions c after 'animad' */
repeat C
C
is repeated until it gives f. When this happens c is set back to the
position it had before the last repetition of C
, and repeat C
gives
signal t. For example,
$x repeat gopast 'a' /* position c after the last 'a' */
loop AE C
C C ... C
written out AE times, where AE is an arithmetic
expression. For example,
$x loop 2 gopast ('a' or 'e' or 'i' or 'o' or 'u')
/* position c after the second vowel */
{ int i;
int limit = AE;
for (i = 0; i < limit; i++) C;
}
atleast AE C
loop AE C repeat C
.
hop AE
test hop 3
next
hop 1
.
We have seen in (a) that $x = y
, when x
and y
are strings, sets c:l of x
to the value of y
. Conversely
$x => y
sets the value of y
to the c:l region of x
.
A more delicate mechanism for pushing text around is to define a substring, or slice of the string being tested. Then
[
]
-> s
s
,
<- S
S
.
For example
/* assume x holds 'animadversion' */
$x ( [ // '[animadversion' - [ set as indicated
loop 2 gopast 'a'
// '[anima|dversion' - c is marked by '|'
] // '[anima]dversion' - ] set as indicated
-> y // y is 'anima'
)
For any string, the slice ends should be assumed to be unset until they are
set with the two commands [
, ]
. Thereafter the slice ends will retain
the same values until altered.
delete
<- ''
This next example deletes all vowels in x,
define vowel ('a' or 'e' or 'i' or 'o' or 'u')
/* ... */
$ x repeat ( gopast([vowel]) delete )
As this example shows, the slice markers [
and ]
often appear as
pairs in a bracketed style, which makes for easy reading of the Snowball
scripts. But it must be remembered that, unusually in a computer
programming language, they are not true brackets.
More simply, text can be inserted at c.
insert S
S
before c, moving c to the right of the
insert. <+
is a synonym for insert
.
attach S
The cursor, c, (and the limit, l) can be thought of as having a numeric value, from zero upwards:
| a | n | i | m | a | d | v | e | r | s | i | o | n | 0 1 2 3 4 5 6 7 8 9 10 11 12 13
It is these numeric values of c and l which are accessible through
cursor
and limit
in arithmetic expressions.
setmark X
X
to the current value of c, where X
is an integer variable.
It's equivalent to: $X = cursor
tomark AE
atmark AE
$(cursor == AE)
In the case of tomark AE
, a similar fail condition occurs as with hop AE
.
If c is already beyond AE, or if position l is before position AE, the
signal is f.
In the stemming algorithms, certain regions of the word are defined by
setting marks, and later the failure condition of tomark
is used to see if
c is inside a particular region.
Two other commands put c at l, and test if c is at l,
tolimit
atlimit
In this account of string commands we see c moving right towards l, while l stays fixed at the end. In fact l can be reset to a new position between c and its old position, to act as a shorter barrier for the movement of c.
setlimit C1 for C2
C1
is obeyed, and if it gives f the signal from setlimit
is f with no further action.
Otherwise, the final value of c becomes the new
position of l. c is then set back to its old value before C1
was
obeyed, and C2
is obeyed. Finally l is set back to its old position,
and the signal of C2
becomes the signal of setlimit
.
So the signal is f if either C1
or C2
gives f, otherwise t.
For example,
$x ( setlimit goto 's' // 'animadver}sion' new l as marked '}'
for // below, '|' marks c after each goto
( goto 'a' and // '|animadver}sion'
goto 'e' and // 'animadv|er}sion'
goto 'i' and // 'an|imadver}sion'
)
)
This checks that x has characters ‘a’, ‘e’ and ‘i’ before the first ‘s’.
String commands have been described with c to the left of l and moving right. But the process can be reversed.
backwards C
C
is obeyed, the
signal given by C
becomes the signal of backwards C
, and c and l are
swapped back to their old values (except that l may have been adjusted
because of deletions and insertions). C
cannot contain another
backwards
command.
reverse C
C
can contain other
reverse
commands, but it cannot contain commands to do deletions or
insertions — it must be used for testing only. (Without this
restriction Snowball's semantics would become very untidy.)
Forward and backward processing are entirely symmetric, except that forward processing is the default direction, and literal strings are always written out forwards, even when they are being tested backwards. So the following are equivalent,
$x (
'ani' 'mad' 'version' atlimit
)
$x backwards (
'version' 'mad' 'ani' atlimit
)
If a routine is defined for backwards mode processing, it must be included
inside a backwardmode(...)
declaration.
The use of substring
and among
is central to the implementation of the
stemming algorithms. It is like a case switch on strings. In its simpler
form,
substring among('S1' 'S2' 'S3' ...)
searches for the longest matching substring 'S1'
or 'S2'
or 'S3'
... from
position c. (The 'Si'
must all be different.) So this has the same
semantics as
('S1' or 'S2' or 'S3' ...)
— so long as the 'Si'
are written out in decreasing order of length.
substring
may be omitted, in which case it is attached to its following
among
, so
among(/*...*/)
without a preceding substring
is equivalent to
(substring among(/*...*/))
substring
may also be detached from its among
, although it must
precede it textually in the same routine in which the among
appears.
The more general form of substring /* ... */ among
is,
substring C among( 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )
Obeying substring
searches for a longest match among the 'Sij'
. The
signal from substring
is t if a match is found, otherwise f.
Any commands C
between the substring
and among
will be run after this
search and only if the search finds a match (it would be equivalent to remove C
and replace each
Ci
with C Ci
). When the
among
comes to be obeyed, the Ci
corresponding to the matched 'Sij'
is
obeyed, and its signal becomes the signal of the among
command.
substring/among
pairs must match up textually inside each routine
definition. But there is no problem with an among
containing other
substring/among
pairs, and substring
is optional before among
anyway.
The essential constraint is that two substring
s must be separated by an
among
, and each substring
must be followed by an among
.
The effect of obeying among
when the preceding substring
is not obeyed
is undefined. This would happen for example here,
try($x != 617 substring)
among(...) // 'substring' is bypassed in the exceptional case where x == 617
The significance of separating the substring
from the among
is to allow
them to work in different contexts. For example,
setlimit tomark L for substring among( 'S11' 'S12' ... (C1) ... 'Sn1' 'Sn2' ... (Cn) )
Here the test for the longest 'Sij'
is constrained to the region between c
and the mark point given by integer L
. But the commands Ci
operate outside
this limit. Another example is
reverse substring among( 'S11' 'S12' ... (C1) ... 'Sn1' 'Sn2' ... (Cn) )
The substring test is in the opposite direction in the string to the
direction of the commands Ci
.
The last (Cn)
may be omitted, in which case (true)
is assumed.
Each string 'Sij'
may be optionally followed by a
routine name,
among( 'S11' R11 'S12' R12 ... (C1) 'S21' R21 'S22' R22 ... (C2) ... 'Sn1' Rn1 'Sn2' Rn1 ... (Cn) )
If a routine name is not specified, it is equivalent to a routine which simply returns signal t,
define null as true
— so we can imagine each 'Sij'
having its associated routine
Rij
. Then obeying the among
causes a search for the longest
'Sij'
whose corresponding routine
Rij
gives t.
The routines Rij
should be written without any
side-effects, other than the inevitable cursor movement. (c is in
any case set back to its old value following a call of
Rij
.)
set B
and unset B
set B
to true and false respectively, where B
is a
boolean name. B
as a command gives a signal t if it is set true, f
otherwise. For example,
booleans ( Y_found ) // declare the boolean
/* ... */
unset Y_found // unset it
do ( ['y'] <-'Y' set Y_found )
/* if c:l begins 'y' replace it by 'Y' and set Y_found */
do repeat(goto (v ['y']) <-'Y' set Y_found)
/* repeatedly move down the string looking for v 'y' and
replacing 'y' with 'Y'. Whenever the replacement takes
place set Y_found. v is a test for a vowel, defined as
a grouping (see below). */
/* Y_found means there are some letters Y in the string.
Later we can use this to trigger a conversion back to
lower case y. */
/* ... */
do (Y_found repeat(goto (['Y']) <- 'y')
A grouping brings characters together and enables them to be looked for with a single test.
If G
is declared as a grouping, it can be defined by
define G G1 op G2 op G3 ...
where op is +
or -
, and G1
, G2
, G3
are literal strings, or groupings that
have already been defined. (There can be zero or more of these additional
op components). For example,
define capital_letter 'ABDEFGHIJKLMNOPQRSTUVWXYZ'
define small_letter 'abdefghijklmnopqrstuvwxyz'
define letter capital_letter + small_letter
define vowel 'aeiou' + 'AEIOU'
define consonant letter - vowel
define digit '0123456789'
define alphanumeric letter + digit
Once G
is defined, it can be used as a command, and is equivalent to a test
'ch1' or 'ch2' or ...
where ch1
, ch2
... list all the characters in the grouping.
non G
is the converse test, and matches any character except the
characters of G
. Note that non G
is not the same as not G
, in fact
non G
is equivalent to (not G next)
non
may be optionally followed by hyphen, for example:
non-vowel
non-digit
Bear in mind that non-vowel
doesn't only match a
consonant - it'll match any character which isn't in the vowel
grouping. Failing to consider this has lead to bugs in stemming algorithms -
for example, here we intended to undouble a consonant:
[non-vowel] -> ch
ch
delete
The problem with this code is it will also mangle numbers with repeated digits,
for example 1900
would become 190
. A good rule of
thumb here seems to be to use an inclusive grouping check if the code goes on
to delete the character matched:
[consonant] -> ch
ch
delete
A complete program consists of a sequence of declarations followed by a
sequence of definitions of groupings and routines. Routines which are
implicitly defined as operating on c:l from right to left must be included
in a backwardmode(...)
declaration.
A Snowball program is called up via a simple API through its defined externals. For example,
externals ( stem1 stem2 )
/* ... */
define stem1 as ( /* stem1 commands */ )
define stem2 as ( /* stem2 commands */ )
The API also allows a current string to be defined, and this becomes the c:l string for the external routine to work on. Its final value is the result handed back through the API.
The strings, integers and booleans are accessible from any point in the program, and exist throughout the running of the Snowball program. They are therefore like static declarations in C.
At a deeper level, a program is a sequence of tokens, interspersed with whitespace. Names, reserved words, literal numbers and strings are all tokens. Various symbols, made up of non-alphanumerics, are also tokens.
A name, reserved word or number is terminated by the first character that
cannot form part of it. A symbol is recognised as the longest sequence of
characters that forms a valid symbol. So +=-
is two symbols, +=
and
-
, because +=
is a valid symbol in the language while +=-
is not.
Whitespace separates tokens but is otherwise ignored.
Occasionally a newer version of Snowball may add a new token. So as not to
break existing programs, any such tokens declared as a name (via
integers
, routines
, etc)
will lose their token status for the rest of the program. This applies
to the tokens
len
and
lenof
.
Anywhere that whitespace can occur, there may also occur:
(a) Comments, in the usual multi-line /* .... */
or single line
// ...
format.
(b) Get directives. These are like #include
commands in C, and have the form
get 'S'
, where 'S'
is a literal string. For example,
get '/home/martin/snowball/main-hdr' // include the file contents
(c) stringescapes XY
where X
and Y
are any two printing characters.
(d) stringdef m 'S'
where m
is sequence of characters not including
whitespace and terminated with whitespace, and 'S'
is a literal string.
In this description of Snowball, it is assumed that strings are composed of characters, and that characters can be defined numerically, but the numeric range of these characters is not defined. As implemented, three different schemes are supported. Characters can either be (a) bytes in the range 0 to 255, as in traditional C strings, or (b) byte pairs in the range 0 to 65535, as in Java strings, or (c) UTF-8 encoded bytes sequences in the range 0 to 65535, so that a character may occupy 1, 2 or 3 bytes.
For case (c), we need to make a slight separation of the concept of
characters into symbols, the units of text being represented, and
slots, the units of space into which they map. (So in case (a), all
slots are one byte; in case (b) all slots are two bytes.)
c and l have numeric values that can be used in AEs (arithmetic
expressions). These values count the number of slots. Similarly
setmark
, tomark
and atmark
are remembering and then using slot
counts. size
and sizeof
measure string size
in slots, not symbols. However, hop N
moves c over N
symbols,
not N
slots, and next
is equivalent to hop 1
.
Snowball 2.0 adds len
and lenof
, which measure string length in symbols
(so they're the same as size
and sizeof
in cases (a) and (b), but
different in case (c)).
So long as these simple distinctions are recognised, the same Snowball script can be compiled to work with any of the three encoding schemes.
This section documents features of Snowball for which there's a strongly preferred alternative. They're still supported for compatibility with existing code which uses them, but you shouldn't use them in new code. We document them here so that their meaning in existing code can be understood, and especially to aid updating to the preferred alternatives.
In a stringdef
, string may be preceded by the word hex
,
or the word decimal
. This was how non-ASCII characters
were specified before support for specifying Unicode codepoints using the
U+
notation was added.
hex
and decimal
mean that the contents of the string
are interpreted as characters values written out in hexadecimal, or decimal,
notation. The characters should be separated by spaces. For example,
hex 'DA' /* is character hex DA */
hex 'D A' /* is the two characters, hex D and A (carriage
return, and line feed) */
decimal '10' /* character 10 (line feed) */
decimal '13 10' /* characters 13 and 10 (carriage return, and
line feed) */
The following forms are equivalent,
hex 'd a' /* lower case also allowed */
hex '0D 000A' /* leading zeroes ignored */
hex ' D A ' /* extra spacing is harmless */
The interpretation of the values is as Unicode codepoints if command
line option -utf8
or -widechars
is specified, and as
character values in an unspecified single byte character set otherwise. For
ASCII and ISO-8859-1 the character values match Unicode codepoints, but to
handle other single byte character sets (e.g. ISO-8859-2 or KOI8-R) you would
need a special version of a Snowball source with different character values
specified via stringdef
. The U+
notation allows
you to use a single Snowball source in this situation.
The among
command supports a "starter" command, C
in this example:
among( (C) 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )
This is equivalent to adding C
at the start of each
Ci
:
among( 'S11' 'S12' ... (C C1) 'S21' 'S22' ... (C C2) ... 'Sn1' 'Sn2' ... (C Cn) )
However, both are equivalent to:
substring C among( 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )
This requires an explicit substring
but seems clearer so
we recommend using this in new code and have designated the use of a starter as
a legacy feature.
A starter is also allowed with an explicit substring
, for example:
substring Cs among( (Ca) 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )
is equivalent to:
substring Cs Ca among( 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )
In the grammar which follows, ||
is used for alternatives,
[X]
means that X is
optional, and [X]*
means that X is repeated zero or more
times. meta-symbols are defined on the left. <char>
means any
character.
The definition of literal string
does not allow for the escaping
conventions established by the stringescapes
directive. The command
?
is a debugging aid.
<letter> ::= a || b || ... || z || A || B || ... || Z <digit> ::= 0 || 1 || ... || 9 <name> ::= <letter> [ <letter> || <digit> || _ ]* <s_name> ::= <name> <i_name> ::= <name> <b_name> ::= <name> <r_name> ::= <name> <g_name> ::= <name> <literal string>::= '[<char>]*' <number> ::= <digit> [ <digit> ]* S ::= <s_name> || <literal string> G ::= <g_name> || <literal string> <declaration> ::= strings ( [<s_name>]* ) || integers ( [<i_name>]* ) || booleans ( [<b_name>]* ) || routines ( [<r_name>]* ) || externals ( [<r_name>]* ) || groupings ( [<g_name>]* ) <r_definition> ::= define <r_name> as C <plus_or_minus> ::= + || - <g_definition> ::= define <g_name> G [ <plus_or_minus> G ]* AE ::= (AE) || AE + AE || AE - AE || AE * AE || AE / AE || - AE || maxint || minint || cursor || limit || size || sizeof S || len || lenof S || <i_name> || <number> <i_assign> ::= $ <i_name> = AE || $ <i_name> += AE || $ <i_name> -= AE || $ <i_name> *= AE || $ <i_name> /= AE <i_test_op> ::= == || != || > || >= || < || <= <i_test> ::= $ ( AE <i_test_op> AE ) || $ <i_name> <i_test_op> AE <s_command> ::= $ <s_name> C C ::= ( [C]* ) || <i_assign> || <i_test> || <s_command> || C or C || C and C || not C || test C || try C || do C || fail C || goto C || gopast C || repeat C || loop AE C || atleast AE C || S || = S || insert S || attach S || <- S || delete || hop AE || next || => <s_name> || [ || ] || -> <s_name> || setmark <i_name> || tomark AE || atmark AE || tolimit || atlimit || setlimit C for C || backwards C || reverse C || substring || among ( [<literal string> [<r_name>] || (C)]* ) || set <b_name> || unset <b_name> || <b_name> || <r_name> || <g_name> || non [-] <g_name> || true || false || ? P ::= [P]* || <declaration> || <r_definition> || <g_definition> || backwardmode ( P ) <program> ::= P synonyms: <+ for insert