Finnish stemming algorithm

Links to resources

Here is a sample of vocabulary, with the stemmed forms that will be generated with the algorithm.

word stem          word stem
edeltäjien
edeltäjiensä
edeltäjiinsä
edeltäjistään
edeltäjiä
edeltäjiään
edeltäjä
edeltäjälleen
edeltäjän
edeltäjäni
edeltäjänsä
edeltäjänä
edeltäjässä
edeltäjästä
edeltäjästään
edeltäjät
edeltäjää
edeltäjään
edeltäjäänsä
edeltäneelle
edeltäneellä
edeltäneeltä
edeltäneen
edeltäneenä
edeltäneeseen
edeltäneessä
edeltäneestä
edeltäneet
edeltäneiden
edeltäneissä
edeltäneitä
edeltänyt
edeltänyttä
edeltävien
edeltäviin
edeltävinä
edeltävissä
edeltävä
edeltävälle
edeltävällä
edeltäj
edeltäjie
edeltäj
edeltäj
edeltäj
edeltäjiä
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltäny
edeltänyt
edeltäv
edeltäv
edeltäv
edeltäv
edeltäv
edeltäv
edeltäv
innostu
innostua
innostuessaan
innostui
innostuimme
innostuin
innostuisi
innostuisivat
innostuivat
innostukseen
innostuksella
innostuksen
innostuksensa
innostuksessa
innostuksessaan
innostuksesta
innostuksissaan
innostumaan
innostuminen
innostun
innostuneelle
innostuneempia
innostuneen
innostuneena
innostuneesta
innostuneesti
innostuneet
innostuneiden
innostuneiksi
innostunein
innostuneina
innostuneissa
innostuneisuus
innostuneita
innostunut
innostunutta
innostus
innostusta
innostustaan
innostutaan
innostu
innostu
innostue
innostui
innostui
innostu
innostui
innostuisiv
innostuiv
innostuks
innostuks
innostuks
innostuks
innostuks
innostuks
innostuks
innostuks
innostum
innostumin
innostu
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostuneisuus
innostun
innostunu
innostunut
innostus
innostu
innostu
innostu

Finnish is not an Indo-European language, but belongs to the Finno-Ugric group, which again belongs to the Uralic group (*). Distinctions between a-, i- and d-suffixes can be made in Finnish, but they are much less sharply separated than in an Indo-European language. The system of endings is extremely elaborate, but strictly defined, and applies equally to all nominals, that is, to nouns, adjectives and pronouns. Verb endings have a close similarity to nominal endings, which again makes Finnish very different from any Indo-European language.

More problematical than the endings themselves is the change that can be effected in a stem as a result of taking a particular ending. A stem typically has two forms, strong and weak, where one class of ending follows the strong form and the complementary class the weak. Normalising strong and weak forms after ending removal is not generally possible, although the common case where strong and weak forms only differ in the single or double form of a final consonant can be dealt with.

Finnish includes the following accented forms,

ä   ö

The following letters are vowels:

a   e   i   o   u   y   ä   ö

R1 and R2 are then defined in the usual way (see the note on R1 and R2).

Do each of steps 1, 2, 3, 4, 5 and 6.

Step 1: particles etc

Search for the longest among the following suffixes in R1, and perform the action indicated

(a) kin   kaan   kään   ko   kö   han   hän   pa   pä
delete if preceded by n, t or a vowel
(b) sti
delete if in R2

(Of course, the n, t or vowel of 1(a) need not be in R1: only the suffix removed must be in R1. And similarly below.

Step 2: possessives

Search for the longest among the following suffixes in R1, and perform the action indicated

si
delete if not preceded by k
ni
delete
if preceded by kse, replace with ksi
nsa   nsä   mme   nne
delete
an
delete if preceded by one of   ta   ssa   sta   lla   lta   na
än
delete if preceded by one of   tä   ssä   stä   llä   ltä   nä
en
delete if preceded by one of   lle   ine

The remaining steps require a few definitions.

Define a v (vowel) as one of   a   e   i   o   u   y   ä   ö.
Define a V (restricted vowel) as one of   a   e   i   o   u   ä   ö.
So Vi means a V followed by letter i.
Define LV (long vowel) as one of   aa   ee   ii   oo   uu   ää   öö.
Define a c (consonant) as a character other than a v.
So cv means a c followed by a v.

Step 3: cases

Search for the longest among the following suffixes in R1, and perform the action indicated

hXn   preceded by X, where X is a V other than u (a/han, e/hen etc)
siin   den   tten   preceded by Vi
seen   preceded by LV
a   ä   preceded by cv
tta   ttä   preceded by e
ta   tä   ssa   ssä   sta   stä   lla   llä   lta   ltä   lle   na   nä   ksi   ine
delete
n
delete, and if preceded by LV or ie, delete the last vowel

So aarteisiinaartei, the longest matching suffix being siin, preceded as it is by Vi. But adressiinadressi. The longest matching suffix is not siin, because there is no preceding Vi, but n, and then the last vowel of the preceding LV is removed.

Step 4: other endings

Search for the longest among the following suffixes in R2, and perform the action indicated

mpi   mpa   mpä   mmi   mma   mmä
delete if not preceded by po
impi   impa   impä   immi   imma   immä   eja   ejä
delete

Step 5: plurals

If an ending was removed in step 3, delete a final i or j if in R1; otherwise, if an ending was not removed in step 3, delete a final t in R1 if it follows a vowel, and, if a t is removed, delete a final mma or imma in R2, unless the mma is preceded by po.

Step 6: tidying up

Do in turn steps (a), (b), (c), (d), restricting all tests to the region R1.

a) If R1 ends LV delete the last letter
b) If R1 ends cX, c a consonant and X one of   a   ä   e   i, delete the last letter
c) If R1 ends oj or uj delete the last letter
d) If R1 ends jo delete the last letter

Do step (e), which is not restricted to R1.

e) If the word ends with a double consonant followed by zero or more vowels, remove the last consonant (so eläkkeläk, aatonaattoaatonaato)

The full algorithm in Snowball

/* Finnish stemmer.

   Numbers in square brackets refer to the sections in
   Fred Karlsson, Finnish: An Essential Grammar. Routledge, 1999
   ISBN 0-415-20705-3

*/

routines (
           mark_regions
           R2
           particle_etc possessive
           LONG VI
           case_ending
           i_plural
           t_plural
           other_endings
           tidy
)

externals ( stem )

integers ( p1 p2 )
strings ( x )
booleans ( ending_removed )
groupings ( AEI V1 V2 particle_end )

stringescapes {}

/* special characters (in ISO Latin I) */

stringdef a"   hex 'E4'
stringdef o"   hex 'F6'

define AEI 'a{a"}ei'
define V1 'aeiouy{a"}{o"}'
define V2 'aeiou{a"}{o"}'
define particle_end V1 + 'nt'

define mark_regions as (

    $p1 = limit
    $p2 = limit

    goto V1  gopast non-V1  setmark p1
    goto V1  gopast non-V1  setmark p2
)

backwardmode (

    define R2 as $p2 <= cursor

    define particle_etc as (
        setlimit tomark p1 for ([substring])
        among(
            'kin'
            'kaan' 'k{a"}{a"}n'
            'ko'   'k{o"}'
            'han'  'h{a"}n'
            'pa'   'p{a"}'    // Particles [91]
                (particle_end)
            'sti'             // Adverb [87]
                (R2)
        )
        delete
    )
    define possessive as (    // [36]
        setlimit tomark p1 for ([substring])
        among(
            'si'
                (not 'k' delete)  // take 'ksi' as the Comitative case
            'ni'
                (delete ['kse'] <- 'ksi') // kseni = ksi + ni
            'nsa' 'ns{a"}'
            'mme'
            'nne'
                (delete)
            /* Now for Vn possessives after case endings: [36] */
            'an'
                (among('ta' 'ssa' 'sta' 'lla' 'lta' 'na') delete)
            '{a"}n'
                (among('t{a"}' 'ss{a"}' 'st{a"}'
                       'll{a"}' 'lt{a"}' 'n{a"}') delete)
            'en'
                (among('lle' 'ine') delete)
        )
    )

    define LONG as
        among('aa' 'ee' 'ii' 'oo' 'uu' '{a"}{a"}' '{o"}{o"}')

    define VI as ('i' V2)

    define case_ending as (
        setlimit tomark p1 for ([substring])
        among(
            'han'    ('a')          //-.
            'hen'    ('e')          // |
            'hin'    ('i')          // |
            'hon'    ('o')          // |
            'h{a"}n' ('{a"}')       // Illative   [43]
            'h{o"}n' ('{o"}')       // |
            'siin'   VI             // |
            'seen'   LONG           //-'

            'den'    VI
            'tten'   VI             // Genitive plurals [34]
                     ()
            'n'                     // Genitive or Illative
                ( try ( LONG // Illative
                        or 'ie' // Genitive
                          and next ]
                      )
                  /* otherwise Genitive */
                )

            'a' '{a"}'              //-.
                     (V1 non-V1)    // |
            'tta' 'tt{a"}'          // Partitive  [32]
                     ('e')          // |
            'ta' 't{a"}'            //-'

            'ssa' 'ss{a"}'          // Inessive   [41]
            'sta' 'st{a"}'          // Elative    [42]

            'lla' 'll{a"}'          // Adessive   [44]
            'lta' 'lt{a"}'          // Ablative   [51]
            'lle'                   // Allative   [46]
            'na' 'n{a"}'            // Essive     [49]
            'ksi'                   // Translative[50]
            'ine'                   // Comitative [51]

            /* Abessive and Instructive are too rare for
               inclusion [51] */

        )
        delete
        set ending_removed
    )
    define other_endings as (
        setlimit tomark p2 for ([substring])
        among(
            'mpi' 'mpa' 'mp{a"}'
            'mmi' 'mma' 'mm{a"}'    // Comparative forms [85]
                (not 'po')          //-improves things
            'impi' 'impa' 'imp{a"}'
            'immi' 'imma' 'imm{a"}' // Superlative forms [86]
            'eja' 'ej{a"}'          // indicates agent [93.1B]
        )
        delete
    )
    define i_plural as (            // [26]
        setlimit tomark p1 for ([substring])
        among(
            'i'  'j'
        )
        delete
    )
    define t_plural as (            // [26]
        setlimit tomark p1 for (
            ['t'] test V1
            delete
        )
        setlimit tomark p2 for ([substring])
        among(
            'mma' (not 'po') //-mmat endings
            'imma'           //-immat endings
        )
        delete
    )
    define tidy as (
        setlimit tomark p1 for (
            do ( LONG and ([next] delete ) ) // undouble vowel
            do ( [AEI] non-V1 delete ) // remove trailing a, a", e, i
            do ( ['j'] 'o' or 'u' delete )
            do ( ['o'] 'j' delete )
        )
        goto non-V1 [next] -> x  x delete // undouble consonant
    )
)

define stem as (

    do mark_regions
    unset ending_removed
    backwards (
        do particle_etc
        do possessive
        do case_ending
        do other_endings
        (ending_removed do i_plural) or do t_plural
        do tidy
    )
)