Finnish stemming algorithm

Links to resources

Here is a sample of Finnish vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

edeltäjien
edeltäjiensä
edeltäjiinsä
edeltäjistään
edeltäjiä
edeltäjiään
edeltäjä
edeltäjälleen
edeltäjän
edeltäjäni
edeltäjänsä
edeltäjänä
edeltäjässä
edeltäjästä
edeltäjästään
edeltäjät
edeltäjää
edeltäjään
edeltäjäänsä
edeltäneelle
edeltäneellä
edeltäneeltä
edeltäneen
edeltäneenä
edeltäneeseen
edeltäneessä
edeltäneestä
edeltäneet
edeltäneiden
edeltäneissä
edeltäneitä
edeltänyt
edeltänyttä
edeltävien
edeltäviin
edeltävinä
edeltävissä
edeltävä
edeltävälle
edeltävällä

⇒

edeltäj
edeltäjie
edeltäj
edeltäj
edeltäj
edeltäjiä
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltäj
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltän
edeltäny
edeltänyt
edeltäv
edeltäv
edeltäv
edeltäv
edeltäv
edeltäv
edeltäv

innostu
innostua
innostuessaan
innostui
innostuimme
innostuin
innostuisi
innostuisivat
innostuivat
innostukseen
innostuksella
innostuksen
innostuksensa
innostuksessa
innostuksessaan
innostuksesta
innostuksissaan
innostumaan
innostuminen
innostun
innostuneelle
innostuneempia
innostuneen
innostuneena
innostuneesta
innostuneesti
innostuneet
innostuneiden
innostuneiksi
innostunein
innostuneina
innostuneissa
innostuneisuus
innostuneita
innostunut
innostunutta
innostus
innostusta
innostustaan
innostutaan

⇒

innostu
innostu
innostue
innostui
innostui
innostu
innostui
innostuisiv
innostuiv
innostuks
innostuks
innostuks
innostuks
innostuks
innostuks
innostuks
innostuks
innostum
innostumin
innostu
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostun
innostuneisuus
innostun
innostunu
innostunut
innostus
innostu
innostu
innostu

Finnish is not an Indo-European language, but belongs to the Finno-Ugric group, which again belongs to the Uralic group (*). Distinctions between a-, i- and d-suffixes can be made in Finnish, but they are much less sharply separated than in an Indo-European language. The system of endings is extremely elaborate, but strictly defined, and applies equally to all nominals, that is, to nouns, adjectives and pronouns. Verb endings have a close similarity to nominal endings, which again makes Finnish very different from any Indo-European language.

More problematical than the endings themselves is the change that can be effected in a stem as a result of taking a particular ending. A stem typically has two forms, strong and weak, where one class of ending follows the strong form and the complementary class the weak. Normalising strong and weak forms after ending removal is not generally possible, although the common case where strong and weak forms only differ in the single or double form of a final consonant can be dealt with.

Finnish includes the following accented forms,

ä ö

The following letters are vowels:

a e i o u y ä ö

R1 and R2 are then defined in the usual way (see the note on R1 and R2).

Do each of steps 1, 2, 3, 4, 5 and 6.

Step 1: particles etc

Search for the longest among the following suffixes in R1, and perform the action indicated

(a) kin kaan kään ko kö han hän pa pä: delete if preceded by n, t or a vowel
(b) sti: delete if in R2

(Of course, the n, t or vowel of 1(a) need not be in R1: only the suffix removed must be in R1. And similarly below.

Step 2: possessives

Search for the longest among the following suffixes in R1, and perform the action indicated

si: delete if not preceded by k
ni: delete; if preceded by kse, replace with ksi
nsa nsä mme nne: delete
an: delete if preceded by one of ta ssa sta lla lta na
än: delete if preceded by one of tä ssä stä llä ltä nä
en: delete if preceded by one of lle ine

The remaining steps require a few definitions.

Define a v (vowel) as one of a e i o u y ä ö.
Define a V (restricted vowel) as one of a e i o u ä ö.
So Vi means a V followed by letter i.
Define LV (long vowel) as one of aa ee ii oo uu ää öö.
Define a c (consonant) as a character from ASCII a-z which isn't in v (originally this was "a character other than a v but since 2018-04-11 we've changed this definition to avoid the stemmer from altering sequences of digits).
So cv means a c followed by a v.

Step 3: cases

Search for the longest among the following suffixes in R1, and perform the action indicated

hXn preceded by X, where X is a V other than u (a/han, e/hen etc)
siin den tten preceded by Vi
seen preceded by LV
a ä preceded by cv
tta ttä preceded by e
ta tä ssa ssä sta stä lla llä lta ltä lle na nä ksi ine: delete
n: delete, and if preceded by LV or ie, delete the last vowel

So aarteisiin → aartei, the longest matching suffix being siin, preceded as it is by Vi. But adressiin → adressi. The longest matching suffix is not siin, because there is no preceding Vi, but n, and then the last vowel of the preceding LV is removed.

Step 4: other endings

Search for the longest among the following suffixes in R2, and perform the action indicated

mpi mpa mpä mmi mma mmä: delete if not preceded by po
impi impa impä immi imma immä eja ejä: delete

Step 5: plurals

If an ending was removed in step 3, delete a final i or j if in R1; otherwise, if an ending was not removed in step 3, delete a final t in R1 if it follows a vowel, and, if a t is removed, delete a final mma or imma in R2, unless the mma is preceded by po.

Step 6: tidying up

Do in turn steps (a), (b), (c), (d), restricting all tests to the region R1.

a) If R1 ends LV delete the last letter
b) If R1 ends cX, c a consonant and X one of a ä e i, delete the last letter
c) If R1 ends oj or uj delete the last letter
d) If R1 ends jo delete the last letter

Do step (e), which is not restricted to R1.

e) If the word ends with a double consonant followed by zero or more vowels, remove the last consonant (so eläkk → eläk, aatonaatto → aatonaato)

The full algorithm in Snowball

/* Finnish stemmer.

   Numbers in square brackets refer to the sections in
   Fred Karlsson, Finnish: An Essential Grammar. Routledge, 1999
   ISBN 0-415-20705-3

*/

routines (
           mark_regions
           R2
           particle_etc possessive
           LONG VI
           case_ending
           i_plural
           t_plural
           other_endings
           tidy
)

externals ( stem )

integers ( p1 p2 )
strings ( x )
booleans ( ending_removed )
groupings ( AEI C V1 V2 particle_end )

stringescapes {}

/* special characters */

stringdef a"   '{U+00E4}'
stringdef o"   '{U+00F6}'

define AEI 'a{a"}ei'
define C 'bcdfghjklmnpqrstvwxz'
define V1 'aeiouy{a"}{o"}'
define V2 'aeiou{a"}{o"}'
define particle_end V1 + 'nt'

define mark_regions as (

    $p1 = limit
    $p2 = limit

    goto V1  gopast non-V1  setmark p1
    goto V1  gopast non-V1  setmark p2
)

backwardmode (

    define R2 as $p2 <= cursor

    define particle_etc as (
        setlimit tomark p1 for ([substring])
        among(
            'kin'
            'kaan' 'k{a"}{a"}n'
            'ko'   'k{o"}'
            'han'  'h{a"}n'
            'pa'   'p{a"}'    // Particles [91]
                (particle_end)
            'sti'             // Adverb [87]
                (R2)
        )
        delete
    )
    define possessive as (    // [36]
        setlimit tomark p1 for ([substring])
        among(
            'si'
                (not 'k' delete)  // take 'ksi' as the Comitative case
            'ni'
                (delete ['kse'] <- 'ksi') // kseni = ksi + ni
            'nsa' 'ns{a"}'
            'mme'
            'nne'
                (delete)
            /* Now for Vn possessives after case endings: [36] */
            'an'
                (among('ta' 'ssa' 'sta' 'lla' 'lta' 'na') delete)
            '{a"}n'
                (among('t{a"}' 'ss{a"}' 'st{a"}'
                       'll{a"}' 'lt{a"}' 'n{a"}') delete)
            'en'
                (among('lle' 'ine') delete)
        )
    )

    define LONG as
        among('aa' 'ee' 'ii' 'oo' 'uu' '{a"}{a"}' '{o"}{o"}')

    define VI as ('i' V2)

    define case_ending as (
        setlimit tomark p1 for ([substring])
        among(
            'han'    ('a')          //-.
            'hen'    ('e')          // |
            'hin'    ('i')          // |
            'hon'    ('o')          // |
            'h{a"}n' ('{a"}')       // Illative   [43]
            'h{o"}n' ('{o"}')       // |
            'siin'   VI             // |
            'seen'   LONG           //-'

            'den'    VI
            'tten'   VI             // Genitive plurals [34]
                     ()
            'n'                     // Genitive or Illative
                ( try ( LONG // Illative
                        or 'ie' // Genitive
                          and next ]
                      )
                  /* otherwise Genitive */
                )

            'a' '{a"}'              //-.
                     (V1 C)         // |
            'tta' 'tt{a"}'          // Partitive  [32]
                     ('e')          // |
            'ta' 't{a"}'            //-'

            'ssa' 'ss{a"}'          // Inessive   [41]
            'sta' 'st{a"}'          // Elative    [42]

            'lla' 'll{a"}'          // Adessive   [44]
            'lta' 'lt{a"}'          // Ablative   [51]
            'lle'                   // Allative   [46]
            'na' 'n{a"}'            // Essive     [49]
            'ksi'                   // Translative[50]
            'ine'                   // Comitative [51]

            /* Abessive and Instructive are too rare for
               inclusion [51] */

        )
        delete
        set ending_removed
    )
    define other_endings as (
        setlimit tomark p2 for ([substring])
        among(
            'mpi' 'mpa' 'mp{a"}'
            'mmi' 'mma' 'mm{a"}'    // Comparative forms [85]
                (not 'po')          //-improves things
            'impi' 'impa' 'imp{a"}'
            'immi' 'imma' 'imm{a"}' // Superlative forms [86]
            'eja' 'ej{a"}'          // indicates agent [93.1B]
        )
        delete
    )
    define i_plural as (            // [26]
        setlimit tomark p1 for ([substring])
        among(
            'i'  'j'
        )
        delete
    )
    define t_plural as (            // [26]
        setlimit tomark p1 for (
            ['t'] test V1
            delete
        )
        setlimit tomark p2 for ([substring])
        among(
            'mma' (not 'po') //-mmat endings
            'imma'           //-immat endings
        )
        delete
    )
    define tidy as (
        setlimit tomark p1 for (
            do ( LONG and ([next] delete ) ) // undouble vowel
            do ( [AEI] C delete ) // remove trailing a, a", e, i
            do ( ['j'] 'o' or 'u' delete )
            do ( ['o'] 'j' delete )
        )
        goto non-V1 [C] -> x  x delete // undouble consonant
    )
)

define stem as (

    do mark_regions
    unset ending_removed
    backwards (
        do particle_etc
        do possessive
        do case_ending
        do other_endings
        (ending_removed do i_plural) or do t_plural
        do tidy
    )
)