Stemming algorithms

Stemming for various European languages

We present stemming algorithms (with implementations in Snowball) for the following languages:

There are two English stemmers, the original Porter stemmer, and an improved stemmer which has been called Porter2. Read the accounts of them to learn a bit more about using Snowball.

Each formal algorithm should be compared with the corresponding Snowball program.

Surprisingly, among the Indo-European languages (*), the French stemmer turns out to be the most complicated, whereas the Russian stemmer, despite its large number of suffixes, is very simple. In fact it is interesting that English, with its minimal use of i-suffixes, has such a complex stemmer. This is partly due to the delicate nature of i-suffix removal (undoubling the p after removing ing from hopping etc), and partly to the wealth of forms of d-suffixes, deriving as they do from the mixed Romance and Germanic ancestry of the language.

Note that by i-suffix we mean inflexional suffix, and by d-suffix, derivational suffix (*).

Other Stemming Algorithms

We also provide Snowball implementations of some algorithms developed by other parties: