Here is a sample of English vocabulary, with the stemmed forms that will be generated by this algorithm:
word | stem | word | stem | |||||||||
consign consigned consigning consignment consist consisted consistency consistent consistently consisteth consisting consistory consists consolate consolation consolations consolatory console consoled consoler consoles consolidate consolidated consolidating consolidation consoling consolingly consols consonancy consonant consort consorted consortest consorting conspectuities conspicuous conspicuously conspir conspiracy conspirant |
⇒ |
consign consign consign consign consist consist consist consist consist consisteth consist consistori consist consol consol consol consolatori consol consol consol consol consolid consolid consolid consolid consol consol consol conson conson consort consort consortest consort conspectu conspicu conspicu conspir conspiraci conspir |
knack knackeries knacks knag knapp knapsack knav knave knaveries knavery knaves knavish knead kneaded kneading knee kneel kneeled kneeling kneels knees knell knelled knelt knew knewest knick knicknacks knif knife knight knighted knighthood knighthoods knightly knights knightsbridge knit knits knitted |
⇒ |
knack knackeri knack knag knapp knapsack knav knave knaveri knaveri knave knavish knead knead knead knee kneel kneel kneel kneel knee knell knell knelt knew knewest knick knicknack knif knife knight knight knighthood knighthood knight knight knightsbridg knit knit knit |
(Revised slightly, December 2001)
(Further revised, September 2002)
I have made more than one attempt to improve the structure of the Porter algorithm by making it follow the pattern of ending removal of the Romance language stemmers. It is not hard to see why one should want to do this: step 1b of the Porter stemmer removes ed and ing, which are i-suffixes (*) attached to verbs. If these suffixes are removed, there should be no need to remove d-suffixes which are not verbal, although it will try to do so. This seems to be a deficiency in the Porter stemmer, not shared by the Romance stemmers. Again, the divisions between steps 2, 3 and 4 seem rather arbitrary, and are not found in the Romance stemmers.
Nevertheless, these attempts at improvement have been abandoned. They seem to lead to a more complicated algorithm with no very obvious improvements. A reason for not taking note of the outcome of step 1b may be that English endings do not determine word categories quite as strongly as endings in the Romance languages. For example, condition and position in French have to be nouns, but in English they can be verbs as well as nouns,
But it is hardly surprising that after twenty years of use of the Porter stemmer, certain improvements did suggest themselves, and a new algorithm for English is therefore offered here. (It could be called the ‘Porter2’ stemmer to distinguish it from the Porter stemmer, from which it derives.) The changes are not so very extensive:
To begin with, here is the basic algorithm without reference to the
exceptional forms. An exact comparison with the Porter algorithm needs to
be done quite carefully if done at all. Here we indicate by *
points
of departure, and by +
additional features. In the sample vocabulary,
Porter and Porter2 stem slightly under 5% of words to different forms.
Define a vowel as one of
R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)
Define a short syllable in a word as either (a) a vowel followed by a
non-vowel other than w, x or Y and preceded by a non-vowel, or
*
(b) a vowel at the beginning of the word followed by a non-vowel.
So rap, trap, entrap end with a short syllable, and ow, on, at are classed as short syllables. But uproot, bestow, disturb do not end with a short syllable.
A word is called short if it ends in a short syllable, and if R1 is null.
So bed, shed and shred are short words, bead, embed, beds are not short words.
An apostrophe (') may be regarded as a letter. (See note on apostrophes in English.)
If the word has two letters or less, leave it as it is.
Otherwise, do each of the following operations,
Remove initial ', if present. +
Then,
Set initial y, or y after a vowel, to Y, and then establish the regions R1 and R2. (See note on vowel marking.)
Step 0: +
Search for the longest among the suffixes,
Search for the longest among the following suffixes, and perform the action indicated.
+
ies*
+
ss
Search for the longest among the following suffixes, and perform the action indicated.
+
+
ing ingly+
*
Search for the longest among the following suffixes, and, if found and in R1, perform the action indicated.
+
: replace by ble
+
: replace by og if preceded by l
+
: replace by ful
+
: replace by less
+
: delete if preceded by a valid li-ending
Search for the longest among the following suffixes, and, if found and in R1, perform the action indicated.
+
: replace by tion
+
: replace by ate
*
: delete if in R2
Search for the longest among the following suffixes, and, if found and in R2, perform the action indicated.
*
Search for the following suffixes, and, if found, perform the action indicated.
Finally, turn any remaining Y letters in the word back into lower case.
It is quite easy to expand a Snowball script so that certain exceptional
word forms get special treatment. The standard case is that certain words
W1
, W2
..., instead of passing through the stemming process, are
mapped to the forms X1
, X2
... respectively. If the script does
the stemming by means of the call
define stem as C
where C
is a command, the exceptional cases can be dealt with by extending this to
define stem as ( exception or C )
and putting in a routine exception
:
define exception as ( [substring] atlimit among( 'W1' ( <- 'X1' ) 'W2' ( <- 'X2' ) ... ) )
atlimit
causes the whole string to be tested for equality with one of
the Wi
, and if a match is found, the string is replaced with
Xi
.
More precisely we might have a group of words W11
, W12
...
that need to be mapped to X1
, another group W21
, W22
... that need to be mapped to X2
, and so on, and a list of words
V1
, V2
... Vk
that are to remain invariant. The
exception
routine may then be written as follows:
among( 'W11' 'W12' ... (<- 'X1') 'W21' 'W22' ... (<- 'X2') ... 'Wn1' 'Wn2' ... (<- 'Xn') 'V1' 'V2' ... 'Vk' )
And indeed the exception1
routine for the English stemmer has just that
shape:
define exception1 as (
[substring] atlimit among(
/* special changes: */
'skis' (<-'ski')
'skies' (<-'sky')
'dying' (<-'die')
'lying' (<-'lie')
'tying' (<-'tie')
/* special -LY cases */
'idly' (<-'idl')
'gently' (<-'gentl')
'ugly' (<-'ugli')
'early' (<-'earli')
'only' (<-'onli')
'singly' (<-'singl')
// ... extensions possible here ...
/* invariant forms: */
'sky'
'news'
'howe'
'atlas' 'cosmos' 'bias' 'andes' // not plural forms
// ... extensions possible here ...
)
)
(More will be said about the words that appear here shortly.)
Here we see words being treated exceptionally before stemming is done, but equally we could
treat stems exceptionally after stemming is done, and so, if we wish, map absorpt to
absorb, reduct to reduc etc., as in the
Lovins stemmer.
But more generally, throughout the algorithm, each significant step may have recognised
exceptions, and a suitably placed among
will take care of them. For example, a point made
at least twice in the literature is that words beginning gener are overstemmed by the
Porter stemmer:
generate generates generated generating general generally generic generically generous generously | → | gener |
To fix this over-stemming, we make an exception to the usual setting of p1, the left point of R1, and therefore replace
gopast v gopast non-v setmark p1
with
among (
'gener'
// ... and other stems may be included here ...
) or (gopast v gopast non-v)
setmark p
after which the words beginning gener stem as follows:
generate generates generated generating | → | generat | ||
general generally | → | general | ||
generic generically | → | generic | ||
generous generously | → | generous |
Another example is given by the exception2
routine, which is similar to exception1
,
but placed after the call of Step_1a
, which may have removed terminal s,
define exception2 as (
[substring] atlimit among(
'inning' 'outing' 'canning' 'herring'
'proceed' 'exceed' 'succeed'
// ... extensions possible here ...
)
)
Snowball makes it easy therefore to add in lists of exceptions. But deciding what the lists of exceptions should be is far from easy. Essentially there are two lines of attack, the systematic and the piecemeal. One might systematically treat as exceptions the stem changes of irregular verbs, for example. The piecemeal approach is to add in exceptions as people notice them — like gener above. The problem with the systematic approach is that it should be done by investigating the entire language vocabulary, and that is more than most people are prepared to do. The problem with the piecemeal approach is that it is arbitrary, and usually yields little.
The exception lists in the English stemmer are meant to be illustrative (‘this is how it is done if you want to do it’), and were derived piecemeal.
a) The new stemmer improves on the Porter stemmer in handling short words ending e and y. There is however a mishandling of the four forms sky, skies, ski, skis, which is easily corrected by treating three of these words as special cases.
b) Similarly there is a problem with the ing form of three letter verbs ending ie. There are only three such verbs: die, lie and tie, so a special case is made for dying, lying and tying.
c) One has to be a little careful of certain ing forms. inning, outing, canning, which one does not wish to be stemmed to in, out, can.
d) The removal of suffix ly, which is not in the Porter stemmer, has a number of exceptions. Certain short-word exceptions are idly, gently, ugly, early, only, singly. Rarer words (bristly, burly, curly, surly ...) are not included.
e) The remaining words were included following complaints from users of the Porter algorithm. news is not the plural of new (noticed when IR systems were being set up for Reuters). Howe is a surname, and needs to be separated from how (noticed when doing a search for ‘Sir Geoffrey Howe’ in a demonstration at the House of Commons). succeed etc are not past participles, so the ed should not be removed (pointed out to me in an email from India). herring should not stem to her (another email from Russia).
f) Finally, a few non-plural words ending s have been added.
Incidentally, this illustrates how much feedback to expect from the real users of a stemming algorithm: seven or eight words in twenty years!
The definition of the English stemmer above is therefore supplemented by the following:
If the word begins gener, commun or arsen, set R1 to be the remainder of the word.
Stem certain special words as follows,
skis | → | ski | ||
skies | → | sky | ||
dying lying tying | → | die lie tie | ||
idly gently ugly early only singly | → | idl gentl ugli earli onli singl |
If one of the following is found, leave it invariant,
sky news howe | ||||||
atlas | cosmos | bias | andes |
Following step 1a, leave the following invariant,
inning | outing | canning | herring | earring | ||||
proceed | exceed | succeed |
integers ( p1 p2 )
booleans ( Y_found )
routines (
prelude postlude
mark_regions
shortv
R1 R2
Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5
exception1
exception2
)
externals ( stem )
groupings ( aeo v v_WXY valid_LI )
stringescapes {}
define aeo 'aeo'
define v 'aeiouy'
define v_WXY v + 'wxY'
define valid_LI 'cdeghkmnrt'
define prelude as (
unset Y_found
do ( ['{'}'] delete)
do ( ['y'] <-'Y' set Y_found)
do repeat(goto (v ['y']) <-'Y' set Y_found)
)
define mark_regions as (
$p1 = limit
$p2 = limit
do(
among (
'gener'
'commun' // added May 2005
'arsen' // added Nov 2006 (arsenic/arsenal)
// ... extensions possible here ...
) or (gopast v gopast non-v)
setmark p1
gopast v gopast non-v setmark p2
)
)
backwardmode (
define shortv as (
( non-v_WXY v non-v )
or
( non-v v atlimit )
)
define R1 as $p1 <= cursor
define R2 as $p2 <= cursor
define Step_1a as (
try (
[substring] among (
'{'}' '{'}s' '{'}s{'}'
(delete)
)
)
[substring] among (
'sses' (<-'ss')
'ied' 'ies'
((hop 2 <-'i') or <-'ie')
's' (next gopast v delete)
'us' 'ss'
)
)
define Step_1b as (
[substring] among (
'eed' 'eedly'
(R1 <-'ee')
'ed' 'edly' 'ing' 'ingly'
(
test gopast v delete
[] test (
substring among(
'at' 'bl' 'iz'
(fail(<- 'e'))
'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
// ignoring double c, h, j, k, q, v, w, and x
(not (aeo atlimit))
'' (fail(atmark p1 test shortv <- 'e'))
)
)
[next] delete
)
)
)
define Step_1c as (
['y' or 'Y']
non-v not atlimit
<-'i'
)
define Step_2 as (
[substring] R1 among (
'tional' (<-'tion')
'enci' (<-'ence')
'anci' (<-'ance')
'abli' (<-'able')
'entli' (<-'ent')
'izer' 'ization'
(<-'ize')
'ational' 'ation' 'ator'
(<-'ate')
'alism' 'aliti' 'alli'
(<-'al')
'fulness' (<-'ful')
'ousli' 'ousness'
(<-'ous')
'iveness' 'iviti'
(<-'ive')
'biliti' 'bli'
(<-'ble')
'ogi' ('l' <-'og')
'fulli' (<-'ful')
'lessli' (<-'less')
'li' (valid_LI delete)
)
)
define Step_3 as (
[substring] R1 among (
'tional' (<- 'tion')
'ational' (<- 'ate')
'alize' (<-'al')
'icate' 'iciti' 'ical'
(<-'ic')
'ful' 'ness'
(delete)
'ative'
(R2 delete) // 'R2' added Dec 2001
)
)
define Step_4 as (
[substring] R2 among (
'al' 'ance' 'ence' 'er' 'ic' 'able' 'ible' 'ant' 'ement'
'ment' 'ent' 'ism' 'ate' 'iti' 'ous' 'ive' 'ize'
(delete)
'ion' ('s' or 't' delete)
)
)
define Step_5 as (
[substring] among (
'e' (R2 or (R1 not shortv) delete)
'l' (R2 'l' delete)
)
)
define exception2 as (
[substring] atlimit among(
'inning' 'outing' 'canning' 'herring' 'earring'
'proceed' 'exceed' 'succeed'
// ... extensions possible here ...
)
)
)
define exception1 as (
[substring] atlimit among(
/* special changes: */
'skis' (<-'ski')
'skies' (<-'sky')
'dying' (<-'die')
'lying' (<-'lie')
'tying' (<-'tie')
/* special -LY cases */
'idly' (<-'idl')
'gently' (<-'gentl')
'ugly' (<-'ugli')
'early' (<-'earli')
'only' (<-'onli')
'singly' (<-'singl')
// ... extensions possible here ...
/* invariant forms: */
'sky'
'news'
'howe'
'atlas' 'cosmos' 'bias' 'andes' // not plural forms
// ... extensions possible here ...
)
)
define postlude as (Y_found repeat(goto (['Y']) <-'y'))
define stem as (
exception1 or
not hop 3 or (
do prelude
do mark_regions
backwards (
do Step_1a
exception2 or (
do Step_1b
do Step_1c
do Step_2
do Step_3
do Step_4
do Step_5
)
)
do postlude
)
)