For the work on this site there are two possible lines of development, one is Snowball itself — the language and compiler — and the other is the stemmers which are written in Snowball. At the moment it is the latter that is the real area of interest.
It is useful to have suggestions about improvements to the existing stemmers, especially for the ones which are not English. However, the process of piecemeal improvement can be taken too far, and it is important in making these suggestions to recognise the inevitable limitations of accuracy of algorithmic stemmers. But more importantly: —
Stemming algorithms have a well-understood place in IR (Information Retrieval), and as language-specific tools in an IR system, they have an extremely useful part to play. It is therefore something of a scandal that there are so very few stemming algorithms which are readily available, so if you want to make a contribution to Snowball, the best thing you can do is to create a good quality stemmer for a new language. This must include an algorithmic description of the stemmer, an implementation in Snowball, and a representative language vocabulary of about 30,000 words that can be used as part of a standard test.
Alternatively, you might come up with the algorithm and be able to provide representative texts from which to derive the vocabulary, but hesitate about the Snowball implementation. If so, get in touch, and we might be able to complete the work collaboratively.
We are also interested in:
It may seem like stating the obvious, but if you do hit a technical problem, please, please send in a full notice of the system being used, the activity you were engaged on, and the errors that you encounter.
Finally, if you want to contribute to this site, you must be prepared to release under the BSD license (i.e. to make your work free).
Martin Porter
Richard Boulton