Here is a sample of Early Modern English vocabulary, with the stemmed forms that will be generated by this algorithm:
| word | stem | word | stem | |||||||||
|
abhor abhord abhore abhorr abhorr'd abhorre abhorred abhorrest abhorreth abhorring abhors abhorson abia abiah abiathar abib abidan abide abides abideth abiding abiel abiezer abiezrites abig abigail abihail abihu abijah abijam abilities ability ability's abimael abimelech abimelech's abinadab abinoam abiram abishag |
⇒ |
abhor abhord abhor abhorr abhor abhorr abhor abhor abhor abhor abhor abhorson abia abiah abiathar abib abidan abid abid abid abid abiel abiez abiezrit abig abigail abihail abihu abijah abijam abil abil abil abimael abimelech abimelech abinadab abinoam abiram abishag |
o'ercame o'ercast o'ercharg'd o'ercharged o'ercome o'ercount o'erflow o'erflowing o'erflows o'ergone o'ergrown o'erhanging o'erheard o'erleap o'erlook o'erlook'd o'erlooked o'erpast o'erpowered o'erpressed o'erreach o'errule o'errun o'ershades o'ershot o'erspread o'ersway o'ersways o'erswell o'erta'en o'ertake o'erthrow o'erthrown o'ertook o'ertop o'erturn o'erweening o'erweigh o'erwhelm o'erwhelmed |
⇒ |
overcam overcast overcharg overcharg overcom overcount overflow overflow overflow overgon overgrown overhang overheard overleap overlook overlook overlook overpast overpow overpress overreach overrul overrun overshad overshot overspread oversway oversway overswel overta'en overtak overthrow overthrown overtook overtop overturn overween overweigh overwhelm overwhelm |
This algorithm is an adapted version of the Snowball English stemmer which aims to handle older forms of the English language.
We can divide the history of English into the following periods of development,
Late Modern English is best handled by the English (Porter2) stemmer.
Early Modern English has enough differences from Late Modern English that a seperate, specialised stemmer makes sense. It's implemented by adding some rules to the English (porter2) stemmer, but some of the extra rules have side-effects which make it hard to produce a single stemmer targetting both.
Middle English presents a number of challenges for a search implementer, particularly because there's much variation in spelling in the original texts. Middle English underwent significant change over time, and also had great regional variations, so that for example the English of Chaucer and his contemporary the Gawain poet (both late 14th century) are strikingly different. The grammatical differences between Middle and Modern English prevent the spelling from being simply ‘modernised’. It is however possible to normalise the spelling according to some modern scheme, but again there is no standard modern scheme. Finally, grammar was fluid even for one writer, so Chaucer might use they love or they loven, he sitteth or he sit.
The stemmer presented here primarily targets Early Modern English, but it may be of use with Middle English too. It will probably given better retrieval results than not using a stemmer (e.g. sitteth and sit both stem to sit) but many of the challenges remain.
Old English is so different from Modern English that it may be regarded as a distinct language - for example, Beowulf opens "Hƿæt ƿē Gārde/na ingēar dagum þēod cyninga / þrym ge frunon...".
We may take Modern English to mean English which can be cast into a modern spelling form without too much damage being done to the original. From this point of view Shakespeare and the Authorised Version of the Bible are in Modern English. The ending structure of words in early Modern English differ from contemporary English, the most common examples being the est and eth endings of verbs in the present indicative,
Both of these endings underwent rapid decline. The eth form occurs in Shakespeare, but is much rarer than the modern s form. The language of the Authorised Version, in which both forms abound, seemed archaic even on its first publication. Consequently the eth form survives now only in the language of the traditional Bible and Book of Common Prayer. The est form disappeared more slowly, as the use of thou became displaced by you in conversation.
The inclusion of eth and est does produce certain ‘side effects’. est is the ending of adjectival superlatives (greatest, unkindest), where it will also be removed. Words like lambeth and forest will be mis-stemmed. This seems an acceptable trade-off for Early Modern English where these suffixes are common, but for contemporary English we don't remove them (since stemming superlatives doesn't seem compelling enough to justify the problematic cases).
The letters æ, œ and ſ are normalised. The last two are not in iso-8859-1 so this stemmer is only supported for UTF-8 and wide character Unicode. It would be possible to offer a version without them if there's still demand for iso-8859-1 support.
The following are not (currently) handled:
This stemmer does not attempt to remove the suffix -en. This can occur in more situations in Early Modern English, but is difficult to remove without overstemming words which end -en where it is not a suffix.
Various prefixes and suffixes which include an apostrophe are removed, rewritten, or otherwise handled, but some are not because there's not a simple replacement which reliably undoes the elision (e.g. court'sy is courtesy but phant'sy is phantasy). We also don't currently attempt to eliminate elisions within the stem.
The printed letters u and v were somewhat interchangeable - often the latter was used as the start of a word (e.g. vpon for "upon") and the former otherwise (e.g. euer for "ever"). Similarly for i and j. While the stemmer could conceptually rewrite these letters using a heuristic based on the letters before and after, it seems in practice this isn't a problem we need to solve since Electronic versions of Early Modern English texts generally seem to have already been "corrected" for this.
We don't handle þ (thorn) specially. We're usually dealing with printed texts, and in Early Modern Englsh printing thorn was represented by the Latin Y.
We don't have special handling for the silent e often appended to words, but in many cases the existing rule for removing -e do the right thing (e.g. ſpeake -> speak; cowarde -> coward). The last consonant was sometimes doubled when an e was appended (e.g. manne (for man) and runne). These don't currently get undoubled.
There's more variation in spelling than in contemporary English, which the stemmer mostly doesn't attempt to solve. Some instances happen to get normalised by new or existing rules, and we've added a rule to replace -lie with -ly (for e.g. assemblie).
The algorithm is described in terms of additions to the English stemmer. The additions are marked in the Snowball source by comments such as:
// Additional step for Early Modern English.
Before doing anything else, replace any instances of the following characters:
Before establishing the regions R1 and R2, if the string starts with:
In step 1a, these additional suffixes are added to this list:
In step 1b, these additional suffixes are added to this list alongside ed and handle them in the same way:
In step 5, these additional suffixes are added to this list:
integers ( p1 p2 )
booleans ( Y_found )
routines (
prelude postlude
mark_regions
shortv
R1 R2
Step_0
Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5
exception1
)
externals ( stem )
groupings ( aeo v v_WXY valid_LI )
stringescapes {}
stringdef ae '{U+00E6}'
stringdef oe '{U+0153}'
stringdef long_s '{U+017F}'
define aeo 'aeo'
define v 'aeiouy'
define v_WXY v + 'wxY'
define valid_LI 'cdeghkmnrt'
define prelude as (
// Additional step for Early Modern English.
do repeat (
[substring] among(
'{ae}' (<- 'ae')
'{oe}' (<- 'oe')
'{long_s}' (<- 's')
'' (next)
)
)
unset Y_found
do ( ['{'}'] delete)
do ( ['y'] <-'Y' set Y_found)
do repeat(goto (v ['y']) <-'Y' set Y_found)
)
// Additional step for Early Modern English.
define Step_0 as (
[substring] among (
'o{'}er' // e.g. o'erwhelm -> overwhelm
(<- 'over')
'th{'}' // e.g. th'earth -> earth
't{'}' // e.g. t'assume -> assume
(not atlimit delete)
)
)
define mark_regions as (
$p1 = limit
$p2 = limit
do(
among (
'gener' // generate/general/generic/generous
'commun' // communication/communism/community
'arsen' // arsenic/arsenal
'past' // past/paste
'univers' // universe/universal/university
'later' // lateral/later
'emerg' // emerge/emergency
'organ' // organ/organic/organize
'inter' // intern/internal/international/internment; interfere; interval
// ... extensions possible here ...
) or (gopast v gopast non-v)
setmark p1
gopast v gopast non-v setmark p2
)
)
backwardmode (
define shortv as (
( non-v_WXY v non-v )
or
( non-v v atlimit )
or
( 'past' ) // pasted/pasting
)
define R1 as $p1 <= cursor
define R2 as $p2 <= cursor
define Step_1a as (
try (
[substring] among (
'{'}' '{'}s' '{'}s{'}'
(delete)
)
)
[substring] among (
'sses' (<-'ss')
'ied' 'ies'
((hop 2 <-'i') or <-'ie')
's' (next gopast v delete)
'us' 'ss'
()
// Additions for Early Modern English.
'e{'}er' // e.g. whoe'er -> whoever
(<- 'ever')
'lie' // e.g. assemblie -> assembly
(R1 <- 'ly')
)
)
define Step_1b as (
[substring] among (
'eed' 'eedly'
(
do (
R1
among (
'proc' 'exc' 'succ'
(atlimit)
) or (
<-'ee'
)
)
)
'ed' 'edly' 'ingly'
(false) // Handled below.
// Additions for Early Modern English.
'{'}d' // e.g. lov'd -> lov (-> love)
'{'}dly' // e.g. favour'dly -> favour
'{'}dst' // e.g. call'dst -> call
'{'}st' // e.g. know'st -> know
'{'}t' // e.g. advanc't -> advanc
'edst' // e.g. commandedst -> command
'est' // e.g. knowest -> know (but e.g. forest -> for (-> fore))
'eth' // e.g. knoweth -> know (but e.g. lambeth -> lamb)
(false) // Handled below.
'ing'
( // Handle exceptional cases here, rest handled below.
among (
// dying->die, lying->die, tying->tie, vying->vie
'y'
(test(non-v atlimit) ] <-'ie')
// Leave inning, outing, etc alone.
'inn' 'out' 'cann' 'herr' 'earr' 'even'
(atlimit)
)
)
'' ()
) or (
// Handle 'ed' 'edly' 'ing' 'ingly'
test gopast v delete
[] test (
substring among(
'at' 'bl' 'iz'
(fail(<- 'e'))
'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
// ignoring double c, h, j, k, q, v, w, and x
(not (aeo atlimit))
'' (fail(atmark p1 test shortv <- 'e'))
)
)
[next] delete
)
)
define Step_1c as (
['y' or 'Y']
non-v not atlimit
<-'i'
)
define Step_2 as (
[substring] R1 among (
'tional' (<-'tion')
'enci' (<-'ence')
'anci' (<-'ance')
'abli' (<-'able')
'entli' (<-'ent')
'izer' 'ization'
(<-'ize')
'ational' 'ation' 'ator'
(<-'ate')
'alism' 'aliti' 'alli'
(<-'al')
'fulness' (<-'ful')
'ousli' 'ousness'
(<-'ous')
'iveness' 'iviti'
(<-'ive')
'biliti' 'bli'
(<-'ble')
'ogist' (<-'og')
'ogi' ('l' <-'og')
'fulli' (<-'ful')
'lessli' (<-'less')
'li' (valid_LI delete)
)
)
define Step_3 as (
[substring] R1 among (
'tional' (<-'tion')
'ational' (<-'ate')
'alize' (<-'al')
'icate' 'iciti' 'ical'
(<-'ic')
'ful' 'ness'
(delete)
'ative'
(R2 delete)
)
)
define Step_4 as (
[substring] R2 among (
'al' 'ance' 'ence' 'er' 'ic' 'able' 'ible' 'ant' 'ement'
'ment' 'ent' 'ism' 'ate' 'iti' 'ous' 'ive' 'ize'
(delete)
'ion' ('s' or 't' delete)
)
)
define Step_5 as (
[substring] among (
'e' (R2 or (R1 not shortv) delete)
'l' (R2 'l' delete)
// Additions for Early Modern English.
'{'}n' // e.g. heav'n -> heaven
'{'}nd' // e.g. quick'nd -> quicken
(<- 'en')
'{'}r' // e.g. (rememb'red -> rememb'r) -> remember
'{'}rous' // e.g. murd'rous -> murder
(<- 'er')
'{'}ri' // e.g. (wat'ry -> wat'ri) -> wateri
(<- 'eri')
'{'}li' // e.g. (happ'ly -> happ'li) -> happili
(<- 'ili')
)
)
)
define exception1 as (
[substring] atlimit among(
/* special changes: */
'skis' (<-'ski')
'skies' (<-'sky')
/* special -LY cases */
'idly' (<-'idl')
'gently' (<-'gentl')
'ugly' (<-'ugli')
'early' (<-'earli')
'only' (<-'onli')
'singly' (<-'singl')
// ... extensions possible here ...
/* invariant forms: */
'sky'
'news'
'howe'
'atlas' 'cosmos' 'bias' 'andes' // not plural forms
// ... extensions possible here ...
)
)
define postlude as (Y_found repeat(goto (['Y']) <-'y'))
define stem as (
exception1 or
not hop 3 or (
do prelude
do Step_0
do mark_regions
backwards (
do Step_1a
do Step_1b
do Step_1c
do Step_2
do Step_3
do Step_4
do Step_5
)
do postlude
)
)