Early Modern English stemming algorithm

Links to resources

Here is a sample of Early Modern English vocabulary, with the stemmed forms that will be generated by this algorithm:

word stem          word stem
abhor
abhord
abhore
abhorr
abhorr'd
abhorre
abhorred
abhorrest
abhorreth
abhorring
abhors
abhorson
abia
abiah
abiathar
abib
abidan
abide
abides
abideth
abiding
abiel
abiezer
abiezrites
abig
abigail
abihail
abihu
abijah
abijam
abilities
ability
ability's
abimael
abimelech
abimelech's
abinadab
abinoam
abiram
abishag
abhor
abhord
abhor
abhorr
abhor
abhorr
abhor
abhor
abhor
abhor
abhor
abhorson
abia
abiah
abiathar
abib
abidan
abid
abid
abid
abid
abiel
abiez
abiezrit
abig
abigail
abihail
abihu
abijah
abijam
abil
abil
abil
abimael
abimelech
abimelech
abinadab
abinoam
abiram
abishag
o'ercame
o'ercast
o'ercharg'd
o'ercharged
o'ercome
o'ercount
o'erflow
o'erflowing
o'erflows
o'ergone
o'ergrown
o'erhanging
o'erheard
o'erleap
o'erlook
o'erlook'd
o'erlooked
o'erpast
o'erpowered
o'erpressed
o'erreach
o'errule
o'errun
o'ershades
o'ershot
o'erspread
o'ersway
o'ersways
o'erswell
o'erta'en
o'ertake
o'erthrow
o'erthrown
o'ertook
o'ertop
o'erturn
o'erweening
o'erweigh
o'erwhelm
o'erwhelmed
overcam
overcast
overcharg
overcharg
overcom
overcount
overflow
overflow
overflow
overgon
overgrown
overhang
overheard
overleap
overlook
overlook
overlook
overpast
overpow
overpress
overreach
overrul
overrun
overshad
overshot
overspread
oversway
oversway
overswel
overta'en
overtak
overthrow
overthrown
overtook
overtop
overturn
overween
overweigh
overwhelm
overwhelm

This algorithm is an adapted version of the Snowball English stemmer which aims to handle older forms of the English language.

We can divide the history of English into the following periods of development,

  1. ca. 1660-today: Late Modern English, the language of Dickens and the present day,
  2. ca. 1485-ca. 1660: Early Modern English, the language of Shakespeare,
  3. ca. 1066-ca. 1485: Middle English, the language of Chaucer,
  4. ca. 450-ca. 1250: Old English (or Anglo-Saxon), the language of Beowulf.

Late Modern English is best handled by the English (Porter2) stemmer.

Early Modern English has enough differences from Late Modern English that a seperate, specialised stemmer makes sense. It's implemented by adding some rules to the English (porter2) stemmer, but some of the extra rules have side-effects which make it hard to produce a single stemmer targetting both.

Middle English presents a number of challenges for a search implementer, particularly because there's much variation in spelling in the original texts. Middle English underwent significant change over time, and also had great regional variations, so that for example the English of Chaucer and his contemporary the Gawain poet (both late 14th century) are strikingly different. The grammatical differences between Middle and Modern English prevent the spelling from being simply ‘modernised’. It is however possible to normalise the spelling according to some modern scheme, but again there is no standard modern scheme. Finally, grammar was fluid even for one writer, so Chaucer might use they love or they loven, he sitteth or he sit.

The stemmer presented here primarily targets Early Modern English, but it may be of use with Middle English too. It will probably given better retrieval results than not using a stemmer (e.g. sitteth and sit both stem to sit) but many of the challenges remain.

Old English is so different from Modern English that it may be regarded as a distinct language - for example, Beowulf opens "Hƿæt ƿē Gārde/na ingēar dagum þēod cyninga / þrym ge frunon...".

Design Notes

We may take Modern English to mean English which can be cast into a modern spelling form without too much damage being done to the original. From this point of view Shakespeare and the Authorised Version of the Bible are in Modern English. The ending structure of words in early Modern English differ from contemporary English, the most common examples being the est and eth endings of verbs in the present indicative,

I bring
thou bringest
he bringeth
we bring
you bring
they bring

Both of these endings underwent rapid decline. The eth form occurs in Shakespeare, but is much rarer than the modern s form. The language of the Authorised Version, in which both forms abound, seemed archaic even on its first publication. Consequently the eth form survives now only in the language of the traditional Bible and Book of Common Prayer. The est form disappeared more slowly, as the use of thou became displaced by you in conversation.

The inclusion of eth and est does produce certain ‘side effects’. est is the ending of adjectival superlatives (greatest, unkindest), where it will also be removed. Words like lambeth and forest will be mis-stemmed. This seems an acceptable trade-off for Early Modern English where these suffixes are common, but for contemporary English we don't remove them (since stemming superlatives doesn't seem compelling enough to justify the problematic cases).

The letters æ, œ and ſ are normalised. The last two are not in iso-8859-1 so this stemmer is only supported for UTF-8 and wide character Unicode. It would be possible to offer a version without them if there's still demand for iso-8859-1 support.

The following are not (currently) handled:

  • This stemmer does not attempt to remove the suffix -en. This can occur in more situations in Early Modern English, but is difficult to remove without overstemming words which end -en where it is not a suffix.

  • Various prefixes and suffixes which include an apostrophe are removed, rewritten, or otherwise handled, but some are not because there's not a simple replacement which reliably undoes the elision (e.g. court'sy is courtesy but phant'sy is phantasy). We also don't currently attempt to eliminate elisions within the stem.

  • The printed letters u and v were somewhat interchangeable - often the latter was used as the start of a word (e.g. vpon for "upon") and the former otherwise (e.g. euer for "ever"). Similarly for i and j. While the stemmer could conceptually rewrite these letters using a heuristic based on the letters before and after, it seems in practice this isn't a problem we need to solve since Electronic versions of Early Modern English texts generally seem to have already been "corrected" for this.

  • We don't handle þ (thorn) specially. We're usually dealing with printed texts, and in Early Modern Englsh printing thorn was represented by the Latin Y.

  • We don't have special handling for the silent e often appended to words, but in many cases the existing rule for removing -e do the right thing (e.g. ſpeake -> speak; cowarde -> coward). The last consonant was sometimes doubled when an e was appended (e.g. manne (for man) and runne). These don't currently get undoubled.

  • There's more variation in spelling than in contemporary English, which the stemmer mostly doesn't attempt to solve. Some instances happen to get normalised by new or existing rules, and we've added a rule to replace -lie with -ly (for e.g. assemblie).

History of functional changes to the algorithm

  • (2006-04-24): Martin Porter added a page describing variants of the Porter and English stemmers which also removed -eth and -est.
  • Snowball 3.2.0: Algorithm improved and added to the distribution (based on the English stemmer).

The stemming algorithm

The algorithm is described in terms of additions to the English stemmer. The additions are marked in the Snowball source by comments such as:

// Additional step for Early Modern English.

Before doing anything else, replace any instances of the following characters:

æ
replace with ae
œ
replace with oe
ſ
replace with s

Before establishing the regions R1 and R2, if the string starts with:

o'er
replace with over
th'
remove if any characters follow
t'
remove if any characters follow

In step 1a, these additional suffixes are added to this list:

e'er
replace with ever
lie
replace with ly if in R1

In step 1b, these additional suffixes are added to this list alongside ed and handle them in the same way:

'd
'dly
'dst
'st
't
edst
est
eth

In step 5, these additional suffixes are added to this list:

'n
'nd
replace with en
'r
'rous
replace with er
'ri
replace with eri
'li
replace with ili

The full algorithm in Snowball

integers ( p1 p2 )
booleans ( Y_found )

routines (
    prelude postlude
    mark_regions
    shortv
    R1 R2
    Step_0
    Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5
    exception1
)

externals ( stem )

groupings ( aeo v v_WXY valid_LI )

stringescapes {}

stringdef ae     '{U+00E6}'
stringdef oe     '{U+0153}'
stringdef long_s '{U+017F}'

define aeo      'aeo'
define v        'aeiouy'
define v_WXY    v + 'wxY'

define valid_LI 'cdeghkmnrt'

define prelude as (
    // Additional step for Early Modern English.
    do repeat (
        [substring] among(
            '{ae}'     (<- 'ae')
            '{oe}'     (<- 'oe')
            '{long_s}' (<- 's')
            ''         (next)
        )
    )

    unset Y_found
    do ( ['{'}'] delete)
    do ( ['y'] <-'Y' set Y_found)
    do repeat(goto (v ['y']) <-'Y' set Y_found)
)

// Additional step for Early Modern English.
define Step_0 as (
    [substring] among (
        'o{'}er' // e.g. o'erwhelm -> overwhelm
            (<- 'over')
        'th{'}'  // e.g. th'earth -> earth
        't{'}'   // e.g. t'assume -> assume
            (not atlimit delete)
    )
)

define mark_regions as (
    $p1 = limit
    $p2 = limit
    do(
        among (
            'gener'   // generate/general/generic/generous
            'commun'  // communication/communism/community
            'arsen'   // arsenic/arsenal
            'past'    // past/paste
            'univers' // universe/universal/university
            'later'   // lateral/later
            'emerg'   // emerge/emergency
            'organ'   // organ/organic/organize
            'inter'   // intern/internal/international/internment; interfere; interval
            // ... extensions possible here ...
        ) or (gopast v  gopast non-v)
        setmark p1
        gopast v  gopast non-v  setmark p2
    )
)

backwardmode (

    define shortv as (
        ( non-v_WXY v non-v )
        or
        ( non-v v atlimit )
        or
        ( 'past' ) // pasted/pasting
    )

    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define Step_1a as (
        try (
            [substring] among (
                '{'}' '{'}s' '{'}s{'}'
                       (delete)
            )
        )
        [substring] among (
            'sses' (<-'ss')
            'ied' 'ies'
                   ((hop 2 <-'i') or <-'ie')
            's'    (next gopast v delete)
            'us' 'ss'
                   ()
             // Additions for Early Modern English.
            'e{'}er' // e.g. whoe'er -> whoever
                   (<- 'ever')
            'lie'    // e.g. assemblie -> assembly
                   (R1  <- 'ly')
        )
    )

    define Step_1b as (
        [substring] among (
            'eed' 'eedly'
                (
                do (
                    R1
                    among (
                        'proc' 'exc' 'succ'
                            (atlimit)
                    ) or (
                        <-'ee'
                    )
                )
            )
            'ed' 'edly' 'ingly'
                (false) // Handled below.
            // Additions for Early Modern English.
            '{'}d'   // e.g. lov'd -> lov (-> love)
            '{'}dly' // e.g. favour'dly -> favour
            '{'}dst' // e.g. call'dst -> call
            '{'}st'  // e.g. know'st -> know
            '{'}t'   // e.g. advanc't -> advanc
            'edst'   // e.g. commandedst -> command
            'est'    // e.g. knowest -> know (but e.g. forest -> for (-> fore))
            'eth'    // e.g. knoweth -> know (but e.g. lambeth -> lamb)
                (false) // Handled below.
            'ing'
                ( // Handle exceptional cases here, rest handled below.
                among (
                    // dying->die, lying->die, tying->tie, vying->vie
                    'y'
                        (test(non-v atlimit) ] <-'ie')
                    // Leave inning, outing, etc alone.
                    'inn' 'out' 'cann' 'herr' 'earr' 'even'
                        (atlimit)
                )
            )
            ''  ()
        ) or (
            // Handle 'ed' 'edly' 'ing' 'ingly'
            test gopast v  delete
            [] test (
                substring among(
                    'at' 'bl' 'iz'
                         (fail(<- 'e'))
                    'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
                    // ignoring double c, h, j, k, q, v, w, and x
                         (not (aeo atlimit))
                    ''   (fail(atmark p1  test shortv  <- 'e'))
                )
            )
            [next]  delete
        )
    )

    define Step_1c as (
        ['y' or 'Y']
        non-v not atlimit
        <-'i'
    )

    define Step_2 as (
        [substring] R1 among (
            'tional'  (<-'tion')
            'enci'    (<-'ence')
            'anci'    (<-'ance')
            'abli'    (<-'able')
            'entli'   (<-'ent')
            'izer' 'ization'
                      (<-'ize')
            'ational' 'ation' 'ator'
                      (<-'ate')
            'alism' 'aliti' 'alli'
                      (<-'al')
            'fulness' (<-'ful')
            'ousli' 'ousness'
                      (<-'ous')
            'iveness' 'iviti'
                      (<-'ive')
            'biliti' 'bli'
                      (<-'ble')
            'ogist'   (<-'og')
            'ogi'     ('l' <-'og')
            'fulli'   (<-'ful')
            'lessli'  (<-'less')
            'li'      (valid_LI delete)
        )
    )

    define Step_3 as (
        [substring] R1 among (
            'tional'  (<-'tion')
            'ational' (<-'ate')
            'alize'   (<-'al')
            'icate' 'iciti' 'ical'
                      (<-'ic')
            'ful' 'ness'
                      (delete)
            'ative'
                      (R2 delete)
        )
    )

    define Step_4 as (
        [substring] R2 among (
            'al' 'ance' 'ence' 'er' 'ic' 'able' 'ible' 'ant' 'ement'
            'ment' 'ent' 'ism' 'ate' 'iti' 'ous' 'ive' 'ize'
                      (delete)
            'ion'     ('s' or 't' delete)
        )
    )

    define Step_5 as (
        [substring] among (
            'e' (R2 or (R1 not shortv) delete)
            'l' (R2 'l' delete)
            // Additions for Early Modern English.
            '{'}n'    // e.g. heav'n -> heaven
            '{'}nd'   // e.g. quick'nd -> quicken
                (<- 'en')
            '{'}r'    // e.g. (rememb'red -> rememb'r) -> remember
            '{'}rous' // e.g. murd'rous -> murder
                (<- 'er')
            '{'}ri'   // e.g. (wat'ry -> wat'ri) -> wateri
                (<- 'eri')
            '{'}li'   // e.g. (happ'ly -> happ'li) -> happili
                (<- 'ili')
        )
    )
)

define exception1 as (

    [substring] atlimit among(

        /* special changes: */

        'skis'      (<-'ski')
        'skies'     (<-'sky')

        /* special -LY cases */

        'idly'      (<-'idl')
        'gently'    (<-'gentl')
        'ugly'      (<-'ugli')
        'early'     (<-'earli')
        'only'      (<-'onli')
        'singly'    (<-'singl')

        // ... extensions possible here ...

        /* invariant forms: */

        'sky'
        'news'
        'howe'

        'atlas' 'cosmos' 'bias' 'andes' // not plural forms

        // ... extensions possible here ...
    )
)

define postlude as (Y_found  repeat(goto (['Y']) <-'y'))

define stem as (

    exception1 or
    not hop 3 or (
        do prelude
        do Step_0
        do mark_regions
        backwards (

            do Step_1a

            do Step_1b
            do Step_1c

            do Step_2
            do Step_3
            do Step_4

            do Step_5
        )
        do postlude
    )
)