The English (Porter2) stemming algorithm

Links to resources

Here is a sample of English vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

consign
consigned
consigning
consignment
consist
consisted
consistency
consistent
consistently
consisteth
consisting
consistory
consists
consolate
consolation
consolations
consolatory
console
consoled
consoler
consoles
consolidate
consolidated
consolidating
consolidation
consoling
consolingly
consols
consonancy
consonant
consort
consorted
consortest
consorting
conspectuities
conspicuous
conspicuously
conspir
conspiracy
conspirant

⇒

consign
consign
consign
consign
consist
consist
consist
consist
consist
consisteth
consist
consistori
consist
consol
consol
consol
consolatori
consol
consol
consol
consol
consolid
consolid
consolid
consolid
consol
consol
consol
conson
conson
consort
consort
consortest
consort
conspectu
conspicu
conspicu
conspir
conspiraci
conspir

knack
knackeries
knacks
knag
knapp
knapsack
knav
knave
knaveries
knavery
knaves
knavish
knead
kneaded
kneading
knee
kneel
kneeled
kneeling
kneels
knees
knell
knelled
knelt
knew
knewest
knick
knicknacks
knif
knife
knight
knighted
knighthood
knighthoods
knightly
knights
knightsbridge
knit
knits
knitted

⇒

knack
knackeri
knack
knag
knapp
knapsack
knav
knave
knaveri
knaveri
knave
knavish
knead
knead
knead
knee
kneel
kneel
kneel
kneel
knee
knell
knell
knelt
knew
knewest
knick
knicknack
knif
knife
knight
knight
knighthood
knighthood
knight
knight
knightsbridg
knit
knit
knit

Developing the English stemmer

(Revised slightly, December 2001)
(Further revised, September 2002)

Martin Porter made more than one attempt to improve the structure of the Porter algorithm by making it follow the pattern of ending removal of the Romance language stemmers. It is not hard to see why one should want to do this: step 1b of the Porter stemmer removes ed and ing, which are i-suffixes (*) attached to verbs. If these suffixes are removed, there should be no need to remove d-suffixes which are not verbal, although it will try to do so. This seems to be a deficiency in the Porter stemmer, not shared by the Romance stemmers. Again, the divisions between steps 2, 3 and 4 seem rather arbitrary, and are not found in the Romance stemmers.

Nevertheless, these attempts at improvement were abandoned. They seem to lead to a more complicated algorithm with no very obvious improvements. A reason for not taking note of the outcome of step 1b may be that English endings do not determine word categories quite as strongly as endings in the Romance languages. For example, condition and position in French have to be nouns, but in English they can be verbs as well as nouns,

We are all conditioned by advertising
They are positioning themselves differently today

A possible reason for having separate steps 2, 3 and 4 is that d-suffix combinations in English are quite complex, a point which has been made elsewhere.

But it is hardly surprising that after twenty years of use of the Porter stemmer, certain improvements did suggest themselves, and a new algorithm for English is therefore offered here. (It could be called the ‘Porter2’ stemmer to distinguish it from the Porter stemmer, from which it derives.) The changes are not so very extensive:

[In C Porter stemmer but not in paper] Extra rule in Step 2: logi -> log
[In C Porter stemmer but not in paper] Step 2 rule: abli -> able replace by bli -> ble
[In C Porter stemmer but not in paper] The algorithm leaves alone strings of length 2 (so as and is not longer lose s.
Terminating y is changed to i rather less often
Suffix us does not lose its s
A few additional suffixes are included for removal, including suffix ly
A small list of exceptional forms is included
[December 2001] Steps 5a and 5b of the old Porter stemmer were combined into a single step. This means that undoubling final ll is not done with removal of final e
[December 2001] In Step 3 ative is removed only when in region R2.
[September 2002] Exception added to prevent herring from stemming to her.
[May 2005] commun added to exceptional forms
[July 2005] A small adjustment was made (including a new step 0) to handle apostrophe.
[January 2006] "Words" ied and ies now stem to ie rather than i.
[January 2006] The implementation was fixed to follow the algorithm as documented here and now always treats an initial y as a consonant.
[November 2006] arsen added to exceptional forms
Snowball 3.0.0 Don't undouble if preceded by exactly a, e or o
Snowball 3.0.0 Exception added to prevent evening from stemming to even.
Snowball 3.0.0 Removed exception for skis as the algorithm gives the same stem without it!
Snowball 3.0.0 Avoid conflating past with paste/pastes/pasted/pasting.
Snowball 3.0.0 Avoid conflating universe/universes with universal/universally and university/universities.
Snowball 3.0.0 Avoid conflating lateral/laterally with later.
Snowball 3.0.0 Replace -ogist with -og to conflate geologist with geology, etc.
Snowball 3.0.0 Handle -eed and -ing exceptions in respective rules.
Snowball 3.1.0 Restored exception for skis which b is needed.

Design Notes

Most comparatives (-er) are not handled. There is a rule to remove -er in R2 (intended mostly for other forms with that ending rather than comparatives - e.g. flatterer, observer, publisher) which means longer comparatives are handled (e.g. cleverer, yellower) but most longer adjectives form the comparative with more e.g. more attractive rather than attractiver). Extending this rule to check R1 as well would affect too many words where removal would be problematic - for example charter, master, meter, number, mother, offer, proper, sober, solder, temper, wither. The problem is worse in US English where many words which end -re in British English are instead spelled -er - e.g. center, fiber, liter, luster.

Superlatives are not handled at all due to too many cases where it would be problematic - for example attest, behest, request, tempest. There is not a rule to remove -est only in R2 because it would help in very few cases but be harmful for e.g. deforest, disinterest, interest, manifest, redigest. These could be avoided by additional conditions on the removal, but the added complexity doesn't seem justifiable by the small number of words removing -est in R2 would improve the stemming of.

Suffix -ian is not removed. A significant problem with removal is that sometimes we want to remove the whole three letters (e.g. orwellian, politician), sometimes -an (e.g. comedian, historian, italian) and sometimes just -n (e.g. indian, bolivian, californian).

Definition of the English stemmer

To begin with, here is the basic algorithm without reference to the exceptional forms. An exact comparison with the Porter algorithm needs to be done quite carefully if done at all. Here we indicate by * points of departure, and by + additional features. In the sample vocabulary, Porter and Porter2 stem slightly over 5% of words to different forms.

Define a vowel as one of

a e i o u y

Define a double as one of

bb dd ff gg mm nn pp rr tt

Define a valid li-ending as one of

c d e g h k m n r t

R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. (This definition may be modified for certain exceptional words — see below.)

R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)

Define a short syllable in a word as either (a) a vowel followed by a non-vowel other than w, x or Y and preceded by a non-vowel, or * (b) a vowel at the beginning of the word followed by a non-vowel, or (c) past.

So rap, trap, entrap end with a short syllable, and ow, on, at, past are classed as short syllables. But uproot, bestow, disturb do not end with a short syllable.

A word is called short if it ends in a short syllable, and if R1 is null.

So bed, shed and shred are short words, bead, embed, beds are not short words.

An apostrophe (') may be regarded as a letter. (See note on apostrophes in English.)

If the word has two letters or less, leave it as it is.

Otherwise, do each of the following operations,

Remove initial ', if present. + Then,

Set initial y, or y after a vowel, to Y, and then establish the regions R1 and R2. (See note on vowel marking.)

Step 0: +

Search for the longest among the suffixes,

'
's
's': and remove if found.

Step 1a:

Search for the longest among the following suffixes, and perform the action indicated.

sses: replace by ss
ied+ ies*: replace by i if preceded by more than one letter, otherwise by ie (so ties → tie, cries → cri)
s: delete if the preceding word part contains a vowel not immediately before the s (so gas and this retain the s, gaps and kiwis lose it)
us+ ss: do nothing

Step 1b:

Search for the longest among the following suffixes, and perform the action indicated.

eed eedly+

replace by ee if in R1

ed edly+ ing ingly+

for ing, check if the word before the suffix is exactly one of the following exceptional cases:

if it's a non-vowel followed by y, replace y and ing with ie (so dying → die), then go to step 1c.
if it's exactly one of inn, out, cann, herr, earr or even then go to step 1c.

delete if the preceding word part contains a vowel, and after the deletion:

if the word ends at, bl or iz add e (so luxuriat → luxuriate), or

if the word ends with a double preceded by something other than exactly a, e or o then remove the last letter (so hopp → hop but add, egg and off are not changed), or

if the word does not end with a double and is short, add e (so hop → hope)

Step 1c: *

replace suffix y or Y by i if preceded by a non-vowel which is not the first letter of the word (so cry → cri, by → by, say → say)

Step 2:

Search for the longest among the following suffixes, and, if found and in R1, perform the action indicated.

tional: replace by tion
enci: replace by ence
anci: replace by ance
abli: replace by able
entli: replace by ent
izer ization: replace by ize
ational ation ator: replace by ate
alism aliti alli: replace by al
fulness: replace by ful
ousli ousness: replace by ous
iveness iviti: replace by ive
biliti bli+: replace by ble
ogist+: replace by og
ogi+: replace by og if preceded by l
fulli+: replace by ful
lessli+: replace by less
li+: delete if preceded by a valid li-ending

Step 3:

Search for the longest among the following suffixes, and, if found and in R1, perform the action indicated.

tional+: replace by tion
ational+: replace by ate
alize: replace by al
icate iciti ical: replace by ic
ful ness: delete
ative*: delete if in R2

Step 4:

Search for the longest among the following suffixes, and, if found and in R2, perform the action indicated.

al ance ence er ic able ible ant ement ment ent ism ate iti ous ive ize: delete
ion: delete if preceded by s or t

Step 5: *

Search for the following suffixes, and, if found, perform the action indicated.

e: delete if in R2, or in R1 and not preceded by a short syllable
l: delete if in R2 and preceded by l

Finally, turn any remaining Y letters in the word back into lower case.

Exceptional forms in general

It is quite easy to expand a Snowball script so that certain exceptional word forms get special treatment. The standard case is that certain words W₁, W₂ ..., instead of passing through the stemming process, are mapped to the forms X₁, X₂ ... respectively. If the script does the stemming by means of the call

    define stem as C

where C is a command, the exceptional cases can be dealt with by extending this to

    define stem as ( exception or C )

and putting in a routine exception:

    define exception as (
        [substring] atlimit among(
            'W₁'  ( <- 'X₁' )
            'W₂'  ( <- 'X₂' )
            ...
        )
    )

atlimit causes the whole string to be tested for equality with one of the W_i, and if a match is found, the string is replaced with X_i.

More precisely we might have a group of words W₁₁, W₁₂ ... that need to be mapped to X₁, another group W₂₁, W₂₂ ... that need to be mapped to X₂, and so on, and a list of words V₁, V₂ ... V_k that are to remain invariant. The exception routine may then be written as follows:

    among( 'W₁₁' 'W₁₂' ... (<- 'X₁')
           'W₂₁' 'W₂₂' ... (<- 'X₂')
           ...
           'W_n1' 'W_n2' ... (<- 'X_n')
           'V₁' 'V₂' ... 'V_k'
         )

And indeed the exception1 routine for the English stemmer has just that shape:

    define exception1 as (

        [substring] atlimit among(

            /* special changes: */

            'skis'      (<-'ski')
            'skies'     (<-'sky')

            /* special -LY cases */

            'idly'      (<-'idl')
            'gently'    (<-'gentl')
            'ugly'      (<-'ugli')
            'early'     (<-'earli')
            'only'      (<-'onli')
            'singly'    (<-'singl')

            // ... extensions possible here ...

            /* invariant forms: */

            'sky'
            'news'
            'howe'

            'atlas' 'cosmos' 'bias' 'andes' // not plural forms

            // ... extensions possible here ...
        )
    )

(More will be said about the words that appear here shortly.)

Here we see words being treated exceptionally before stemming is done, but equally we could treat stems exceptionally after stemming is done, and so, if we wish, map absorpt to absorb, reduct to reduc etc., as in the Lovins stemmer. But more generally, throughout the algorithm, each significant step may have recognised exceptions, and a suitably placed among will take care of them. For example, a point made at least twice in the literature is that words beginning gener are overstemmed by the Porter stemmer:

generate
generates
generated
generating
general
generally
generic
generically
generous
generously

→

gener

To fix cases of over-stemming like this which involve multiple words and suffixes, it's best to add an exception to the definition of the start of R1, which in the Snowball implementation means adjusting the code which sets p1, so we replace

    gopast v  gopast non-v  setmark p1

with

    among (
        'gener'
        // ... and other stems may be included here ...
    ) or (gopast v  gopast non-v)
    setmark p

after which the words beginning gener stem as follows:

generate generates generated generating	→	generat
general generally	→	general
generic generically	→	generic
generous generously	→	generous

Exceptions specific to a single suffix are typically better handled where that suffix is handled rather than with a global exception, since the overhead of checking a global exception is incurred by all words.

For example, we want to make exceptions to -ing handling for e.g. dying → die (done for any word consisting of a consonant plus ying), and to leave words such as herring alone (rather than stemming to her):

                among (
                    // dying->die, lying->die, tying->tie, vying->vie
                    'y'
                        (test(non-v atlimit) ] <-'ie')
                    // Leave inning, outing, etc alone.
                    'inn' 'out' 'cann' 'herr' 'earr' 'even'
                        (atlimit)
                )

This happens in step 1b, so step 1a may have removed terminal s, so we also stem dyings → die, herrings → herring, etc.

Snowball makes it easy therefore to add in lists of exceptions. But deciding what the lists of exceptions should be is far from easy. Essentially there are two lines of attack, the systematic and the piecemeal. One might systematically treat as exceptions the stem changes of irregular verbs, for example. The piecemeal approach is to add in exceptions as people notice them — like gener above. The problem with the systematic approach is that it should be done by investigating the entire language vocabulary, and that is more than most people are prepared to do. The problem with the piecemeal approach is that it is arbitrary, and usually yields little.

The exception lists in the English stemmer are meant to be illustrative (‘this is how it is done if you want to do it’), and were derived piecemeal.

a) The new stemmer improves on the Porter stemmer in handling short words ending e and y. There is however a mishandling of the four forms sky, skies, ski, skis, which is easily corrected by treating three of these words as special cases.

b) Similarly there is a problem with the ing form of three letter verbs ending ie. There are five such verbs: die, hie, lie, tie and vie, which are handled by a special case when -ing is preceded by exactly a non-vowel and y.

c) One has to be a little careful of certain ing forms. inning, outing, canning, herring, earring, evening, which one does not wish to be stemmed to in, out, can, her, ear, even.

d) The removal of suffix ly, which is not in the Porter stemmer, has a number of exceptions. Certain short-word exceptions are idly, gently, ugly, early, only, singly. Rarer words (bristly, burly, curly, surly ...) are not included.

e) The remaining words were included following complaints from users of the Porter algorithm. news is not the plural of new (noticed when IR systems were being set up for Reuters). Howe is a surname, and needs to be separated from how (noticed when doing a search for ‘Sir Geoffrey Howe’ in a demonstration at the House of Commons). succeed etc are not past participles, so the ed should not be removed (pointed out to me in an email from India). herring should not stem to her (another email from Russia).

f) Finally, a few non-plural words ending s have been added.

Incidentally, this illustrates how much feedback to expect from the real users of a stemming algorithm: seven or eight words in twenty years!

The definition of the English stemmer above is therefore supplemented by the following:

Exceptional forms in the English stemmer

If the word begins gener, commun, arsen, past or univers, later, set R1 to be the remainder of the word.

Stem certain special words as follows,

skis	→	ski
skies	→	sky
idly gently ugly early only singly	→	idl gentl ugli earli onli singl

If one of the following is found, leave it invariant,

sky news howe
atlas		cosmos		bias		andes

The full algorithm in Snowball

integers ( p1 p2 )
booleans ( Y_found )

routines (
    prelude postlude
    mark_regions
    shortv
    R1 R2
    Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5
    exception1
)

externals ( stem )

groupings ( aeo v v_WXY valid_LI )

stringescapes {}

define aeo      'aeo'
define v        'aeiouy'
define v_WXY    v + 'wxY'

define valid_LI 'cdeghkmnrt'

define prelude as (
    unset Y_found
    do ( ['{'}'] delete)
    do ( ['y'] <-'Y' set Y_found)
    do repeat(goto (v ['y']) <-'Y' set Y_found)
)

define mark_regions as (
    $p1 = limit
    $p2 = limit
    do(
        among (
            'gener'   // generate/general/generic/generous
            'commun'  // communication/communism/community
            'arsen'   // arsenic/arsenal
            'past'    // past/paste
            'univers' // universe/universal/university
            'later'   // lateral/later
            'emerg'   // emerge/emergency
            'organ'   // organ/organic/organize
            // ... extensions possible here ...
        ) or (gopast v  gopast non-v)
        setmark p1
        gopast v  gopast non-v  setmark p2
    )
)

backwardmode (

    define shortv as (
        ( non-v_WXY v non-v )
        or
        ( non-v v atlimit )
        or
        ( 'past' ) // pasted/pasting
    )

    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define Step_1a as (
        try (
            [substring] among (
                '{'}' '{'}s' '{'}s{'}'
                       (delete)
            )
        )
        [substring] among (
            'sses' (<-'ss')
            'ied' 'ies'
                   ((hop 2 <-'i') or <-'ie')
            's'    (next gopast v delete)
            'us' 'ss'
        )
    )

    define Step_1b as (
        [substring] among (
            'eed' 'eedly'
                (
                do (
                    among (
                        'proc' 'exc' 'succ'
                            (atlimit)
                    ) or (
                        R1 <-'ee'
                    )
                )
            )
            'ed' 'edly' 'ingly'
                (false) // Handled below.
            'ing'
                ( // Handle exceptional cases here, rest handled below.
                among (
                    // dying->die, lying->die, tying->tie, vying->vie
                    'y'
                        (test(non-v atlimit) ] <-'ie')
                    // Leave inning, outing, etc alone.
                    'inn' 'out' 'cann' 'herr' 'earr' 'even'
                        (atlimit)
                )
            )
            ''  ()
        ) or (
            // Handle 'ed' 'edly' 'ing' 'ingly'
            test gopast v  delete
            [] test (
                substring among(
                    'at' 'bl' 'iz'
                         (fail(<- 'e'))
                    'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
                    // ignoring double c, h, j, k, q, v, w, and x
                         (not (aeo atlimit))
                    ''   (fail(atmark p1  test shortv  <- 'e'))
                )
            )
            [next]  delete
        )
    )

    define Step_1c as (
        ['y' or 'Y']
        non-v not atlimit
        <-'i'
    )

    define Step_2 as (
        [substring] R1 among (
            'tional'  (<-'tion')
            'enci'    (<-'ence')
            'anci'    (<-'ance')
            'abli'    (<-'able')
            'entli'   (<-'ent')
            'izer' 'ization'
                      (<-'ize')
            'ational' 'ation' 'ator'
                      (<-'ate')
            'alism' 'aliti' 'alli'
                      (<-'al')
            'fulness' (<-'ful')
            'ousli' 'ousness'
                      (<-'ous')
            'iveness' 'iviti'
                      (<-'ive')
            'biliti' 'bli'
                      (<-'ble')
            'ogist'   (<-'og')
            'ogi'     ('l' <-'og')
            'fulli'   (<-'ful')
            'lessli'  (<-'less')
            'li'      (valid_LI delete)
        )
    )

    define Step_3 as (
        [substring] R1 among (
            'tional'  (<-'tion')
            'ational' (<-'ate')
            'alize'   (<-'al')
            'icate' 'iciti' 'ical'
                      (<-'ic')
            'ful' 'ness'
                      (delete)
            'ative'
                      (R2 delete)
        )
    )

    define Step_4 as (
        [substring] R2 among (
            'al' 'ance' 'ence' 'er' 'ic' 'able' 'ible' 'ant' 'ement'
            'ment' 'ent' 'ism' 'ate' 'iti' 'ous' 'ive' 'ize'
                      (delete)
            'ion'     ('s' or 't' delete)
        )
    )

    define Step_5 as (
        [substring] among (
            'e' (R2 or (R1 not shortv) delete)
            'l' (R2 'l' delete)
        )
    )
)

define exception1 as (

    [substring] atlimit among(

        /* special changes: */

        'skis'      (<-'ski')
        'skies'     (<-'sky')

        /* special -LY cases */

        'idly'      (<-'idl')
        'gently'    (<-'gentl')
        'ugly'      (<-'ugli')
        'early'     (<-'earli')
        'only'      (<-'onli')
        'singly'    (<-'singl')

        // ... extensions possible here ...

        /* invariant forms: */

        'sky'
        'news'
        'howe'

        'atlas' 'cosmos' 'bias' 'andes' // not plural forms

        // ... extensions possible here ...
    )
)

define postlude as (Y_found  repeat(goto (['Y']) <-'y'))

define stem as (

    exception1 or
    not hop 3 or (
        do prelude
        do mark_regions
        backwards (

            do Step_1a

            do Step_1b
            do Step_1c

            do Step_2
            do Step_3
            do Step_4

            do Step_5
        )
        do postlude
    )
)