The question occasionally arises of how far the English (or earlier Porter) stemming algorithm can be adapted to handle older forms of the English language.
Historically, English is usually divided into three periods of development,
Old English is so different from Modern English that it may be regarded as a distinct language.
Middle English is problematical for a number of reasons. There is no standard spelling in the original texts, and the grammatical differences between Middle and Modern English prevent the spelling from being simply ‘modernised’. It is however possible to normalise the spelling according to some modern scheme, but again there is no standard modern scheme. Middle English itself had great regional variations, so that for example the English of Chaucer and his contemporary the Gawain poet (both late 14th century) are strikingly different. Finally, grammar was fluid even for one writer, so Chaucer might use they love or they loven, he sitteth or he sit.
We may take Modern English to mean English which can be cast into a modern spelling form without too much damage being done to the original. From this point of view Shakespeare and the Authorised Version of the Bible are in Modern English. The ending structure of words in early Modern English differ from contemporary English in the est and eth endings of verbs in the present indicative,
Both of these endings underwent rapid decline. The eth form occurs in Shakespeare, but is much rarer than the modern s form. The language of the Authorised Version, in which both forms abound, seemed archaic even on its first publication. Consequently the eth form survives now only in the language of the traditional Bible and Book of Common Prayer. The est form disappeared more slowly, as the use of thou became displaced by you in conversation.
To put the endings into the Porter stemmer, the rules
(m>0) EED | → | EE | ||
(*v*) ED | → | |||
(*v*) ING | → |
should be extended to
(m>0) EED | → | EE | ||
(*v*) ED | → | |||
(*v*) ING | → | |||
(*v*) EST | → | |||
(*v*) ETH | → |
And to put the endings into the English stemmer, the list
As far as the Snowball scripts are concerned, the endings 'est' 'eth'
must
be added against ending 'ing'
.
The inclusion of these endings does produce certain ‘side effects’. est is the ending of adjectival superlatives (greatest, unkindest), where it will also be removed. Words like brandreth, deforest will be mis-stemmed. Nevertheless, for the vocabulary of the Bible, the inclusion of these extra endings is not harmful (see this demonstration — for example, search for the text love in 1000 verses).