An a-suffix, or attached suffix, is a particle word attached to another word. (In the stemming literature they sometimes get referred to as ‘enclitics’.) In Italian, for example, personal pronouns attach to certain verb forms:
mandargli = | mandare + gli | = | to send + to him | |||
mandarglielo = | mandare + gli + lo | = | to send + it + to him |
a-suffixes appear in Italian and Spanish, and also in Portuguese, although in Portuguese they are separated by hyphen from the preceding word, which makes them easy to eliminate.
An i-suffix, or inflectional suffix, forms part of the basic grammar of a language, and is applicable to all words of a certain grammatical type, with perhaps a small number of exceptions. In English for example, the past of a verb is formed by adding ed. Certain modifications may be required in the stem:
fit + ed | → | fitted (double t) | ||
love + ed | → | loved (drop the final e of love) |
A d-suffix, or derivational suffix, enables a new word, often with a different grammatical category, or with a different sense, to be built from another word. Whether a d-suffix can be attached is discovered not from the rules of grammar, but by referring to a dictionary. So in English, ness can be added to certain adjectives to form corresponding nouns (littleness, kindness, foolishness ...) but not to all adjectives (not for example, to big, cruel, wise ...) d-suffixes can be used to change meaning, often in rather exotic ways. So in italian astro means a sham form of something else:
medico + astro | = | medicastro | = | quack doctor | ||||
poeta + astro | = | poetastro | = | poetaster |
Most European and many Asian languages belong to the Indo-European language group. Historically, it includes the Latin, Greek, Persian and Sanskrit of the ancient world, and with the rise of the European empires, languages of this group are now dominant in the Americas, Australia and large parts of Africa. Indo-European languages are therefore the main languages of modern Western culture, and they are all similarly amenable to stemming.
The Indo-European group has many recognisable sub-groups, for example Romance (Italian, French, Spanish ...), Slavonic (Russian, Polish, Czech ...), Celtic (Irish Gaelic, Scottish Gaelic, Welsh ...). The Germanic sub-group includes German and Dutch, and the Scandinavian languages are also usually classed as Germanic, although for convenience we have made a separate grouping of them on the Snowball site. English is also classed as Germanic, although it has been classed separately by us. This is not for reasons of narrow chauvinism, but because the suffix structure of English clearly lies mid-way between the Germanic and Romance groups, and it therefore requires separate treatment.
The Uralic languages are spoken mainly in Northern Russia and Europe. They are divided into Samoyed, spoken mainly in the Siberian region, and Finno-Ugric, spoken mainly in Europe. Although the number of languages in the group is substantial, the total number of speakers is relatively small. The best known Uralic languages are perhaps Hungarian, Finnish and Estonian. Finnish and Estonian are in fact fairly similar. On the other hand Hungarian and Finnish are as different as are, say, French and Persian in the Indo-European group.
Like the Indo-European languages, the Uralic languages are amenable to stemming.