Snowball: A language for stemming algorithms

Summary

Algorithmic stemmers continue to have great utility in IR, despite the promise of out-performance by dictionary-based stemmers. Nevertheless, there are few algorithmic descriptions of stemmers, and even when they exist they are liable to misinterpretation. Here we look at the ideas underlying stemming, and on this website define a language, Snowball, in which stemmers can be exactly defined, and from which fast stemmer programs in ANSI C or Java can be generated. A range of stemmers is presented in parallel algorithmic and Snowball form, including the original Porter stemmer for English.

1 Introduction

There are two main reasons for creating Snowball. One is the lack of readily available stemming algorithms for languages other than English. The other is the consciousness of a certain failure on my part in promoting exact implementations of the stemming algorithm described in (Porter 1980), which has come to be called the Porter stemming algorithm. The first point needs some qualification: a great deal of work has been done on stemmers in a wide range of natural languages, both in their development and evaluation, (a complete bibliography cannot be attempted here). But it is rare to see a stemmer laid out in an unambiguous algorithmic form from which encodings in C, Java, Perl etc might easily be made. When exact descriptions are attempted, it is often with approaches to stemming that are relatively simple, for example the Latin stemmer of Schinke (Schinke 1996), or the Slovene stemmer of Popovic (Popovic 1990). A more complex, and therefore more characteristic stemmer is the Kraaij-Pohlmann stemmer for Dutch (Kraaij 1994), which is presented as open source code in ANSI C. To extract an algorithmic description of their stemmer from the source code proves to be quite hard.

The disparity between the Porter stemmer definition and many of its purported implementations is much wider than is generally realised in the IR community. Three problems seem to compound: one is a misunderstanding of the meaning of the original algorithm, another is bugs in the encodings, and a third is the almost irresistible urge of programmers to add improvements.

For example, a Perl script advertised on the Web as an implementation of the Porter algorithm was tested in October 2001, and it was found that 14 percent of words were stemmed incorrectly when given a large sample vocabulary. Most words of English have very simple endings, so this means that it was effectively getting everything wrong. At certain points on the Web are demonstrations of the Porter stemmer. You type some English into a box and the stemmed words are displayed. These are frequently faulty. (A good test is to type in agreement. It should stem to agreement — the same word. If it stems to agreem there is an error.) Researchers frequently pick up faulty versions of the stemmer and report that they have applied ‘Porter stemming’, with the result that their experiments are not quite repeatable. Researchers who work on stemming will sometimes give incorrect examples of the behaviour of the Porter stemmer in their published works.

To address all these problems I have tried to develop a rigorous system for defining stemming algorithms. A language, Snowball, has been invented, in which the rules of stemming algorithms can be expressed in a natural way. Snowball is quite small, and can be learned by an experienced programmer in an hour or so. On this website a number of foreign language stemmers is presented (a) in Snowball, and (b) in a less formal English-language description. (b) can be thought of as the program comments for (a). A Snowball compiler translates each Snowball definition into (c) an equivalent program in ANSI C or Java. Finally (d) standard vocabularies of words and their stemmed equivalents are provided for each stemmer. The combination of (a), (b), (c) and (d) can be used to pin down the definition of a stemmer exactly, and it is hoped that Snowball itself will be a useful resource in creating stemmers in the future.

2 Some ideas underlying stemming

Work in stemming has produced a number of different approaches, albeit tied together by a number of common assumptions. It is worthwhile looking at some of them to see exactly where Snowball fits into the whole picture.

A point tacitly assumed in almost all of the stemming literature is that stemmers are based upon the written, and not the spoken, form of the language. This is also the assumption here. Historically, grammarians often regarded the written language as the real language and the spoken as a mere derivative form. Almost in reaction, many modern linguists have taken a precisely opposite view (Palmer, 1965 pp 2-3). A more balanced position is that the two languages are distinct though connected, and require separate treatment. One can in fact imagine parallel stemming algorithms for the spoken language, or rather for the phoneme sequence into which the spoken language is transformed. Stress and intonation could be used as clues for an indexing process in the same way that punctuation and capitalisation are used as clues in the written language. But currently stemmers work on the written language for the good reason that there is so much of it available in machine readable form from which to build our IR systems. Inevitably therefore the stemmers get caught up in accidental details of orthography. In English, removing the ing from rotting should be followed by undoubling the tt, whereas in rolling we do not undouble the ll. In French, removing the er from ennuyer should be followed by changing the y to i, so that the resulting word conflates with ennui, and so on.

The idea of stemming is to improve IR performance generally by bringing under one heading variant forms of a word which share a common meaning. Harman (1991) was first to present compelling evidence that it may not do so, when her experiments discovered no significant improvement with the use of stemming. Similarly Lennon (1981) discovered no appreciable difference between different stemmers running on a constant collection. Later work has modified this position however. Krovetz (1995) found significant, although sometimes small, improvements across a range of test collections. What he did discover is that the degree of improvement varies considerably between different collections. These tests were however done on collections in English, and the reasonable assumption of IR researchers has always been that for languages that are more highly inflected than English (and nearly all are), greater improvements will be observed when stemming is applied. My own view is that stemming helps regularise the vocabulary of an IR system, and this leads to advantages that are not easily quantifiable through standard IR experiments. For example, it helps in presenting lists of terms associated with the query back to the IR user in a relevance feedback cycle, which is one of the underlying ideas of the probabilistic model. More will be said on the use of a stemmed vocabulary in section 5.

Stemming is not a concept applicable to all languages. It is not, for example, applicable in Chinese. But to languages of the Indo-European (*) group (and most of the stemmers on this site are for Indo-European languages), a common pattern of word structure does emerge. Assuming words are written left to right, the stem, or root of a word is on the left, and zero or more suffixes may be added on the right. If the root is modified by this process it will normally be at its right hand end. And also prefixes may be added on the left. So unhappiness has a prefix un, a suffix ness, and the y of happy has become i with the addition of the suffix. Usually, prefixes alter meaning radically, so they are best left in place (German and Dutch ge is an exception here). But suffixes can, in certain circumstances, be removed. So for example happy and happiness have closely related meanings, and we may wish to stem both forms to happy, or happi. Infixes can occur, although rarely: ge in German and Dutch, and zu in German.

One can make some distinction between root and stem. Lovins (1968) sees the root as the stem minus any prefixes. But here we will think of the stem as the residue of the stemming process, and the root as the inner word from which the stemmed word derives, so we think of root to some extent in an etymological way. It must be admitted that when you start thinking hard about these concepts root, stem, suffix, prefix ... they turn out to be very difficult indeed to define. Nor do definitions, even if we arrive at them, help us much. After all, suffix stripping is a practical aid in IR, not an exercise in linguistics or etymology. This is especially true of the central concept of root. We think of the etymological root of a word as something we can discover with certainty from a dictionary, forgetting that etymology itself is a subject with its own doubts and controversies (Jesperson 1922, Chapter XVI). Indeed, Jesperson goes so far as to say that

‘It is of course impossible to say how great a proportion of the etymologies given in dictionaries should strictly be classed under each of the following heads: (1) certain, (2) probable, (3) possible, (4) improbable, (5) impossible — but I am afraid the first two classes would be the least numerous.’

Here we will simply assume a common sense understanding of the basic idea of stem and suffix, and hope that this proves sufficient for designing and discussing stemming algorithms.

We can separate suffixes out into three basic classes, which will be called d-, i- and a-suffixes.

An a-suffix, or attached suffix, is a particle word attached to another word. (In the stemming literature they sometimes get referred to as ‘enclitics’.) In Italian, for example, personal pronouns attach to certain verb forms:

mandargli =		mandare + *gli*		=		to send + to him
mandarglielo =		mandare + *gli* + lo		=		to send + it + to him

a-suffixes appear in Italian and Spanish, and also in Portuguese, although in Portuguese they are separated by hyphen from the preceding word, which makes them easy to eliminate.

An i-suffix, or inflectional suffix, forms part of the basic grammar of a language, and is applicable to all words of a certain grammatical type, with perhaps a small number of exceptions. In English for example, the past of a verb is formed by adding ed. Certain modifications may be required in the stem:

fit + ed		→		fitted (double t)
love + ed		→		loved (drop the final e of love)

but otherwise the rule applies in a regular way to all verbs in contemporary English, with about 150 (Palmer, 1965) exceptional forms,

bear		beat		become		begin		bend		....
bore		beat		became		began		bent

A d-suffix, or derivational suffix, enables a new word, often with a different grammatical category, or with a different sense, to be built from another word. Whether a d-suffix can be attached is discovered not from the rules of grammar, but by referring to a dictionary. So in English, ness can be added to certain adjectives to form corresponding nouns (littleness, kindness, foolishness ...) but not to all adjectives (not for example, to big, cruel, wise ...) d-suffixes can be used to change meaning, often in rather exotic ways. So in Italian astro means a sham form of something else:

medico + *astro*		=		medicastro		=		quack doctor
poeta + *astro*		=		poetastro		=		poetaster

Generally i-suffixes follow d-suffixes. i-suffixes can precede d-suffixes, for example lovingly, devotedness, but such cases are exceptional. To be a little more precise, d-suffixes can sometimes be added to participles. devoted, used adjectivally, is a participle derived from the verb devote, and ly can be added to turn the adjective into an adverb, or ness to turn it into a noun. The same feature occurs in other Indo-European languages.

Sometimes it is hard to say whether a suffix is a d-suffix or i-suffix, the comparative and superlative endings er, est of English for example.

A d-suffix can serve more than one function. In English, for example, ly standardly turns an adjective into an adverb (greatly), but it can also turn a noun into an adjective (kingly). In French, ement also standardly turns an adjective into an adverb (grandement), but it can also turn a verb into a noun (rapprochement). (Referring to the French stemmer, this double use is ultimately why ement is tested for being in the RV rather than the R2 region of the word being stemmed.)

It is quite common for an i-suffix to serve more than one function. In English, s can either be (1) a verb ending attached to third person singular forms (runs, sings), (2) a noun ending indicating the plural (dogs, cats) or (3) a noun ending indicating the possessive (boy’s, girls’). By an orthographic convention now several hundred years old, the possessive is written with an apostrophe, but nowadays this is frequently omitted in familiar phrases (a girls school). (Usage (3) is relatively rare compared with (1) and (2): there are only nine uses of ’s in this document.)

Since the normal order of suffixes is d, i and a, we can expect them to be removed from the right in the order a, i and d. Usually we want to remove all a- and i-suffixes, and some of the d-suffixes.

If the stemming process reduces two words to the same stem, they are said to be conflated.

3 Stemming errors, and the use of dictionaries

One way of thinking of the relation between terms and documents in an IR system is to see the documents as being about concepts, and the terms as words that describe the concepts. Then, of course, one word can cover many concepts, so pound can mean a unit of currency, a weight, an enclosure, or a beating. Pound is a homonym. And one concept can be described by many words, as with money, capital, cash, currency. These words are synonyms. There is a many-many mapping therefore between the set of terms and the set of concepts. Stemming is a process that transforms this mapping to advantage, on the whole reducing the number of synonyms, but occasionally creating new homonyms. It is worth remembering that what are called stemming errors are usually just the introduction of new homonyms into vocabularies that already contain very large numbers of homonyms.

Words which have no place in this term-concept mapping are those which describe no concepts. The particle words of grammar, the, of, and ..., known in IR as stopwords, fall into this category. Stopwords can be useful for retrieval but only in searching for phrases, ‘to be or not to be’, ‘do as you would be done by’ etc. This suggests that stemming stopwords is not useful. More will be said on stopwords in section 7.

In the literature, a distinction is often made between under-stemming, which is the error of taking off too small a suffix, and over-stemming, which is the error of taking off too much. In French, for example, croûtons is the plural of croûton, ‘a crust’, so to remove ons would be over-stemming, while croulons is a verb form of crouler, ‘to totter’, so to remove s would be under-stemming. We would like to introduce a further distinction between mis-stemming and over-stemming. Mis-stemming is taking off what looks like an ending, but is really part of the stem. Over-stemming is taking off a true ending which results in the conflation of words of different meanings.

So for example ly can be removed from cheaply, but not from reply, because in reply ly is not a suffix. If it was removed, reply would conflate with rep, (the commonly used short form of representative). Here we have a case of mis-stemming.

To illustrate over-stemming, look at these four words,

	verb	adjective
First pair:	prove	provable
Second pair:	probe	probable

Morphologically, the two pairs are exactly parallel (in the written, if not the spoken language). They also have a common etymology. All four words derive from the Latin probare, ‘to prove or to test’, and the idea of testing connects the meanings of the words. But the meanings are not parallel. provable means ‘able to be proved’; probable does not mean ‘able to be probed’. Most people would judge conflation of the first pair as correct, and of the second pair, incorrect. In other words, to remove able from probable is a case of over-stemming.

We can try to avoid mis-stemming and over-stemming by using a dictionary. The dictionary can tell us that reply does not derive from rep, and that the meanings of probe and probable are well separated in modern English. It is important to realise however that a dictionary does not give a complete solution here, but can be a tool to improve the conflation process.

In Krovetz’s dictionary experiments (Krovetz 1995), he noted that in looking up a past participle like suited, one is led either to suit or to suite as plausible infinitive forms. suite can be rejected, however, because the dictionary tells us that although it is a word of English it is not a verb form. Cases like this (and Krovetz found about 60) had to be treated as exceptions. But the form routed could either derive from the verb rout or the verb route:

At Waterloo Napoleon’s forces were routed
The cars were routed off the motorway

Such cases in English are extremely rare, but they are commoner in more highly inflected languages. In French for example, affiliez can either be the verb affiler, to sharpen, with imperfect ending iez, or the verb affilier, to affiliate, with present indicative ending ez:

vous affiliez		=		vous affil-iez		=		you sharpened
vous affiliez		=		vous affili-ez		=		you affiliate

If the second is intended, removal of iez is mis-stemming.

With over-stemming we must rely upon the dictionary to separate meanings. There are different ways of doing this, but all involve some degree of reliance upon the lexicographers. Krovetz’s methods are no doubt best, because the most objective: he uses several measures, but they are based on the idea of measuring the similarity in meaning of two words by the degree of overlap among the words used to define them, and this is at a good remove from a lexicographer’s subjective judgement about semantic similarity.

There is an interesting difference between mis-stemming and over-stemming to do with language history. The morphology of a language changes less rapidly than the meanings of the words in it. When extended to include a few archaic endings, such as ick as an alternative to ic, a stemmer for contemporary English can be applied to the English of 300 years ago. Mis-stemmings will be roughly the same, but the pattern of over-stemming will be different because of the changing meaning of words in the language. For example, relativity in the 19th century merely meant ‘the condition of being relative to’. With that meaning, it is acceptable to conflate it with relative. But with the 20th century meaning brought to it by Einstein, stemming to relativ is over-stemming. Here we see the word with the suffix changing its meaning, but it can happen the other way round. transpire has come to mean ‘happen’, and its old meaning of ’exhalation’ or ‘breathing out’ is now effectively lost. (That is the bitter reality, although dictionaries still try to persuade us otherwise). But transpiration still carries the earlier meaning. So what was formerly an acceptable stemming may be judged now as an over-stemming, not because the word being stemmed has changed its meaning, but because some cognate word has changed its meaning.

In these examples we are presenting words as if they had single meanings, but the true picture is more complicated. Krovetz uses a model of word meanings which is extremely helpful here. He makes a distinction between homonyms and polysemes. The meaning of homonyms are quite unrelated. For example, ground in the sense of ‘earth’, and ‘ground’ as the past participle of ‘grind’ are homonyms. Etymologically homonyms have different stories, and they usually have separate entries in a dictionary. But each homonym form can have a range of polysemic forms, corresponding to different shades of meaning. So ground can mean the earth’s surface, or the bottom of the sea, or soil, or any base, and so the basis of an argument, and so on. Over time new polysemes appear and old ones die. At any moment, the use of a word will be common in some polysemic forms and rare in others. If a suffix is attached to a word the new word will get a different set of polysemes. For example, grounds = ground + s acquires the sense of ‘dregs’ and ‘estate lands’, loses the sense of ‘earth’, and shares the sense of ‘basis’.

Consider the conflation of mobility with mobile. mobile has acquired two new polysemes not shared with mobility. One is the ‘mobile art object’, common in the nursery. This arrived in the 1960s, and is still in use. The other is the ‘mobile phone’ which is now very dominant, although it may decline in the future when it has been replaced by some new gadget with a different name. We might draw a graph of the degree of separation of the meanings of mobility and mobile against time, which would depend upon the number of polysemes and the intensity of their use. What seemed like a valid conflation of the two words in 1940 may seem to be invalid today.

In general therefore one can say that judgements about whether words are over-stemmed change with time as the meanings of words in the language change.

The use of a dictionary should reduce errors of mis-stemming and errors of over-stemming. And, for English at least, the mis-stemming errors should reduce well, even if there are problems with over-stemming errors. Of course, it depends on the quality of the dictionary. A dictionary will need to be very comprehensive, fully up-to-date, and with good word definitions to achieve the best results.

Historically, stemmers have often been thought of as either dictionary-based or algorithmic. The presentation of studies of stemming in the literature has perhaps helped to create this division. In the Lovins’ stemmer the algorithmic description is central. In accounts of dictionary-based stemmers the emphasis tends to be on dictionary content and structure, and IR effectiveness. Savoy’s French stemmer (Savoy, 1993) is a good example of this. But the two approaches are not really distinct. An algorithmic stemmer can include long exception lists that are effectively mini-dictionaries, and a dictionary-based stemmer usually needs a process for removing at least i-suffixes to make the look-up in the dictionary possible. In fact in a language in which proper names are inflected (Latin, Finnish, Russian ...), a dictionary-based stemmer will need to remove i-suffixes independently of dictionary look-up, because the proper names will not of course be in the dictionary.

The stemmers available on the Snowball website are all purely algorithmic. They can be extended to include built-in exception lists, they could be used in combination with a full dictionary, but they are still presented here in their simplest possible form. Being purely algorithmic, they are, or ought to be, inferior to the performance of well-constructed dictionary-based stemmers. But they are still very useful, for the following reasons:

Algorithmic stemmers are (or can be made) very lean and very fast. The stemmers presented here generate code that will process about a million words in six seconds on a conventional 500MHz PC. Nowadays we can generate very large IR systems with quite modest resources, and tools that assist in this have value.
Despite the errors they can be seen to make, algorithmic stemmers still give good practical results. As Krovetz (1995) says in surprise of the algorithmic stemmer, ‘Why does it do so well?’ (page 89).
Dictionary-based stemmers require dictionary maintenance, to keep up with an ever-changing language, and this is actually quite a problem. It is not just that a dictionary created to assist stemming today will probably require major updating in a few years time, but that a dictionary in use for this purpose today may already be several years out of date.

We can hazard an answer to Krovetz’s question, as to why algorithmic stemmers perform as well as they do, when they reveal so many cases of under-, over- and mis-stemming. Under-stemming is a fault, but by itself it will not degrade the performance of an IR system. Because of under-stemming words may fail to conflate that ought to have conflated, but you are, in a sense, no worse off than you were before. Mis-stemming is more serious, but again mis-stemming does not really matter unless it leads to false conflations, and that frequently does not happen. For example, removing the ate ending in English, can result in useful conflations (luxury, luxuriate; affection, affectionate), but very often produces stems that are not English words (enerv-ate, accommod-ate, deliber-ate etc). In the literature, these are normally classed as stemming errors — overstemming — although in our nomenclature they are examples of mis-stemming. However these residual stems, enerv, accommod, deliber ... do not conflate with other word forms, and so behave in an IR system in the same way as if they still retained their ate ending. No false conflations arise, and so there is no over-stemming here.

To summarise, one can say that just as a word can be over-stemmed but not mis-stemmed (relativity → relative), so it can be mis-stemmed but not over-stemmed (enervate → enerv). And, of course, even over-stemming does not matter, if the over-stemmed word falsely conflates with other words that exist in the language, but are not encountered in the IR system which is being used.

Of the three types of error, over-stemming is the most important, and using a dictionary does not eliminate all over-stemmings, but does reduce their incidence.

4 Stemming as part of an indexing process

Stemming is part of a composite process of extracting words from text and turning them into index terms in an IR system. Because stemming is somewhat complex and specialised, it is usually studied in isolation. Even so, it cannot really be separated from other aspect of the indexing process:

What is a word? For indexing purposes, a word in a European language is approximately a sequence of letters bounded by non-letters, but some punctuation needs special consideration. For example, an internal apostrophe typically does not split a word.
What is a letter? Clearly letters define words, but different languages use different alphabets. Even only considering languages using the Latin alphabet there are different accented letters.

English speakers, perhaps influenced by the ASCII character set, typically regard their alphabet of a to z as the norm, and other forms (for example, Danish å and ø, or German ß) as somewhat abnormal. But this is an insular point of view. In Italian, for example, the letters j, k, w, x and y are not part of the alphabet, and are only seen in foreign words. We also tend to regard other alphabets as only used for isolated languages, and that is not strictly true. Cyrillic is used for a range of languages other than Russian, among which additional letters and accented forms abound.

In English, a broad definition of letter would be anything that could be accepted as a pronounceable element of a word. This would include accented Roman letters (naïve, Fauré), and certain ligature forms (encyclopædia). It would exclude letters of foreign alphabets, such as Greek and Cyrillic. The a to z alphabet is one of those where letters come in two styles, upper and lower case, which historically correspond (very roughly) to the shapes you get if you use a chisel or a pen. Across all languages, the exact relation of upper to lower case is not so easy to define. In Italian, for example, an accented lower case letter is sometimes represented in upper case by the unaccented letter followed by an apostrophe (this seems to be because Italian keywords have keys for accented lower case letters but typing accented upper case letters is more complicated). This is even sometimes seen in published books.
The Snowball stemmers expect that the input words have been converted to lower case.

(Incidentally, because the stemmers work on lower case words, some of the algorithms sometimes turn letters to upper case internally for flagging purposes.)
The algorithms for languages which use accented characters are written assuming that no additional accent stripping is done before or after stemming. If you strip accents you may cause conflation of unrelated words - for example, in French châtiment (punishment) and chat (cat) would produce the same stem if accents are stripped.
Identifying stopwords. Invariant stopwords are more easily found before stemming is applied, but inflecting stopwords (for example, German kein, keine, keinem, keinen ... ) may be easier to find after — because there are fewer forms. There is a case for building stopword identification into the stemming process. See section 7.
Conflating irregular forms. More will be said on this in section 6.

5 The use of stemmed words

The idea of how stemmed words might be employed in an IR system has evolved slightly over the years. The Lovins stemmer (Lovins 1968) was developed not for indexing document texts, but the subject terms attached to them. With queries stemmed in the same way, the user needed no special knowledge of the form of the subject terms. Rijsbergen (1979, Chapter 2) assumes document text analysis: stopwords are removed, the remaining words are stemmed, and the resulting set of stemmed word constitute the IR index (and this style of use is widespread today). More flexibility however is obtained by indexing all words in a text in an unstemmed form, and keeping a separate two-column relation which connects the words to their stemmed equivalents. The relation can be denoted by R(s, w), which means that s is the stemmed form of word w. From the relation we can get, for any word w, its unique stemmed form, stem(w), and for any stem s, the set of words, words(s), that stem to s.

The user should not have to see the stemmed form of a word. If a list of stems is to be presented back for query expansion, in place of a stem, s, the user should be shown a single representative from the set words(s), the one of highest frequency perhaps. The user should also be able to choose for the whole query, or at a lower level for each word in a query, whether or not it should be stemmed. In the absence of such choices, the system can make its own decisions. Perhaps single word queries would not undergo stemming; long queries would; stopwords would be removed except in phrases. In query expansion, the system would work with stemmed forms, ignoring stopwords.

Query expansion with stemming results in a much cleaner vocabulary list than without, and this is a main strength of using a stemming process.

A question arises: if the user never sees the stemmed form, does its appearance matter? The answer must be no, although the Porter stemmer tries to make the unstemmed forms guessable from the stemmed forms. For example, from appropri you can guess appropriate. At least, trying to achieve this effect acts as a useful control. Similarly with the other stemmers presented here, an attempt has been made to keep the appearance of the stemmed forms as familiar as possible.

6 Irregular grammatical forms

All languages contain irregularities, but to what extent should they be accommodated in a stemming algorithm? An English stemmer, for example, can convert regular plurals to singular form without difficulty (boys, girls, hands ...). Should it do the same with irregular plurals (men, children, feet, ...)? Here we have irregular cases with i-suffixes, but there are irregularities with d-suffixes, which Lovins calls ‘spelling exceptions’. absorb/absorption and conceive/conception are examples of this. Etymologically, the explanation of the first is that the Latin root, sorbere, is an irregular verb, and of the second that the word conceive comes to us from the French rather than straight from the Latin. It is interesting that, even with no knowledge of the etymology, we do recognise the connection between the words.

Lovins tries to solve spelling exceptions by formulating general respelling rules (turn rpt into rb for example), but it might be easier to have simply a list of exceptional stems.

The Porter stemmer does not handle irregularities at all, but from the author’s own experience, this has never been an area of complaint. Complaints in fact are always about false conflations, for example new and news.

Possibly Lovins was right in wanting to resolve d-suffix irregularities, and not being concerned about i-suffix irregularities. i-suffix irregularities in English go with short, old words, that are either in very common use (man/men, woman/women, see/saw ...) or are used only rarely (ox/oxen, louse/lice, forsake/forsook ...). The latter class can be ignored, and the former has its own problems which are not always solved by stemming. For example man is a verb, and saw can mean a cutting instrument, or, as a verb, can mean to use such an instrument. Conflation of these forms frequently leads to an error like mis-stemming therefore.

An algorithmic stemmer really needs holes where the irregular forms can be plugged in as necessary. This is more serviceable than attempting to embed special lists of these irregular forms into software.

7 Stopwords

We have suggested that stemming stopwords is not useful. There is a grammatical connection between being and be, but conflation of the two forms has little use in IR because they have no shared meaning that would entitle us to think of them as synonyms. being and be have a morphological connection as well, but that is not true of am and was, although they have a grammatical connection. Generally speaking, inflectional stopwords exhibit many irregularities, which means that stemming is not only not useful, but not possible, unless one builds into the stemmer tables of exceptions.

Switching from English to French, consider être, the equivalent form of be. It has about 40 different forms, including,

suis es sommes serez étaient fus furent sois été

Notice that several of these words have other meanings unrelated to "to be": suis is a form of the verb suivre ("to follow"), sommes is both a form of the verb sommer ("to sum up" or "to summon") and the plural of the noun somme ("sum"), été also means "summer". These are called homonyms, or more strictly speaking "homographs" as we are only considering the written form and two words which are spelled the same but pronounced differently are also a concern here. This is not unique to French - some stopword candidates have homographs in general, and is something that should be given consideration when deciding which words to treat as stopwords.

Passing all forms through a rule-based stemmer creates something of a mess. An alternative approach is to recognise this group of words, and other groups, and take special action. The recognition could take place inside the stemmer, or be done before the stemmer is called. One special action would be to stem (perhaps one should say ‘map’) all the forms to a standard form, ETRE, to indicate that they are parts of the verb être. Deciding what to do with the term ETRE, and it would probably be to discard it, would be done outside the stemming process. Another special action would be to recognize a whole class of stopwords and simply discard them.

The strategy adopted will depend upon the underlying IR model, so what one needs is the flexibility to create modified forms of a standard stemmer. Usually we present Snowball stemmers in their unadorned form. Thereafter, the addition of stopword tables is quite easy.

8 Rare forms

Stemmers do not need to handle linguistic forms that turn up only very rarely, but in practice it is hard to design a stemmer with all rare forms eliminated without there appearing to be some gaps in the thinking. For this reason one should not worry too much about their occasional presence. For example, in contemporary Portuguese, use of the second person plural form of verbs has almost completely disappeared. Even so, endings for those forms are included in the Portuguese stemmer. They appear in all the grammar books, and will in any case be found in older texts. The habit of putting in rare forms to ‘complete the picture’ is well established, and usually passes unnoticed. An example is the list of English stopwords in van Rijsbergen (1979). This includes yourselves, by analogy with himself, herself etc., although yourselves is actually quite a rare word in English.

References

Farber DJ, Griswold RE and Polonsky IP (1964) SNOBOL, a string manipulation language. Journal of the Association for Computing Machinery, 11: 21-30.

Griswold RE, Poage JF and Polonsky IP (1968) The SNOBOL4 programming language. Prentice-Hall, New Jersey.

Harman D (1991) How effective is suffixing? Journal of the American Society for Information Science, 42: 7-15.

Jesperson O (1921) Language, its nature, origin and development. George Allen & Unwin, London.

Kraaij W and Pohlmann R. (1994) Porter’s stemming algorithm for Dutch. In Noordman LGM and de Vroomen WAM, eds. Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg, 1994. pp. 167-180.

Kraaij W and Pohlmann R (1995) Evaluation of a Dutch stemming algorithm. Rowley J, ed. The New Review of Document and Text Management, volume 1, Taylor Graham, London, 1995. pp. 25-43,

Krovetz B (1995) Word sense disambiguation for large text databases. PhD Thesis. Department of Computer Science, University of Massachusetts Amherst.

Lennon M, Pierce DS, Tarry BD and Willett P (1981) An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3: 177-183.

Lovins JB (1968) Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22-31.

Palmer FR (1965) A linguistic study of the English verb. Longmans, London.

Popovic M and Willett P (1990) Processing of documents and queries in a Slovene language free text retrieval system. Literary and Linguistic Computing, 5: 182-190.

Porter MF (1980) An algorithm for suffix stripping. Program, 14: 130-137.

Rijsbergen CJ (1979) Information retrieval. Second edition. Butterworths, London.

Savoy J (1993) Stemming of French words based on grammatical categories. Journal of the American Society for Information Science, 44: 1-9.

Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.