Swedish stemming algorithm

Links to resources

Here is a sample of Swedish vocabulary, with the stemmed forms that will be generated by this algorithm:

word stem          word stem
jakt
jaktbössa
jakten
jakthund
jaktkarl
jaktkarlar
jaktkarlarne
jaktkarlens
jaktlöjtnant
jaktlöjtnanten
jaktlöjtnantens
jalusi
jalusien
jalusier
jalusierna
jamaika
jamat
jamrande
jamt
jande
januari
japanska
jaquette
jaquettekappa
jargong
jasmin
jasminen
jasminer
jasminhäck
jaspis
jaså
javäl
jazzvindens
jcrn
jcsus
je
jemföra
jemföras
jemförelse
jemförelser
jakt
jaktböss
jakt
jakthund
jaktkarl
jaktkarl
jaktkarl
jaktkarl
jaktlöjtnant
jaktlöjtnant
jaktlöjtnant
jalusi
jalusi
jalusi
jalusi
jamaik
jam
jamr
jamt
jand
januari
japansk
jaquet
jaquettekapp
jargong
jasmin
jasmin
jasmin
jasminhäck
jaspis
jaså
javäl
jazzvind
jcrn
jcsus
je
jemför
jemför
jemför
jemför
klo
kloaken
klock
klocka
klockan
klockans
klockare
klockaren
klockarens
klockarfar
klockarn
klockarsonen
klockas
klockkedjan
klocklikt
klockor
klockorna
klockornas
klockors
klockringning
kloekornas
klok
kloka
klokare
klokast
klokaste
kloke
klokhet
klokheten
klokt
kloliknande
klor
klorna
kloroform
kloster
klostergården
klosterlik
klot
klotb
klotrund
klo
kloak
klock
klock
klockan
klockan
klock
klock
klock
klockarf
klockarn
klockarson
klock
klockkedjan
klocklik
klock
klock
klock
klockor
klockringning
kloek
klok
klok
klok
klok
klok
klok
klok
klok
klokt
klolikn
klor
klorn
kloroform
klost
klostergård
klosterlik
klot
klotb
klotrund

The stemming algorithm

The Swedish alphabet includes the following additional letters,

ä   å   ö

The following letters are vowels:

a   e   i   o   u   y   ä   å   ö

R2 is not used: R1 is defined in the same way as in the German stemmer. (See the note on R1 and R2.)

Define a valid s-ending as one of

b   c   d   f   g   h   j   k   l   m   n   o   p   r   t   v   y

Define a valid öst-ending as one of

i   k   l   n   p   r   t   u   v

Do each of steps 1, 2 and 3.

Step 1:

Search for the longest among the following suffixes in R1, and perform the action indicated.
(a) a   arna   erna   heterna   orna   ad   e   ade   ande   arne   are   aste   en   anden   aren   heten   ern   ar   er   heter   or   as   arnas   ernas   ornas   es   ades   andes   ens   arens   hetens   erns   at   andet   het   ast
delete
(b) s
delete if preceded by a valid s-ending
(Note that only the suffix needs to be in R1, the letter of the valid s-ending is not required to be.)

Step 2:

Search for one of the following suffixes in R1, and if found delete the last letter.
dd   gd   nn   dt   gt   kt   tt
(For example, frisktfrisk, fröknarnn fröknarn)

Step 3:

Search for the longest among the following suffixes in R1, and perform the action indicated.
lig   ig   els
delete
öst
replace with ös if preceded by a valid öst-ending
fullt
replace with full
(The letter of the valid öst-ending is not necessarily in R1. Prior to Snowball 2.3.0, öst-ending was effectively just l and was required to be in R1.)

The same algorithm in Snowball

routines (
           mark_regions
           main_suffix
           consonant_pair
           other_suffix
)

externals ( stem )

integers ( p1 x )

groupings ( v s_ending ost_ending )

stringescapes {}

/* special characters */

stringdef a"   '{U+00E4}'
stringdef ao   '{U+00E5}'
stringdef o"   '{U+00F6}'

define v 'aeiouy{a"}{ao}{o"}'

define s_ending  'bcdfghjklmnoprtvy'

define ost_ending 'iklnprtuv'

define mark_regions as (

    $p1 = limit
    test ( hop 3 setmark x )
    goto v gopast non-v  setmark p1
    try ( $p1 < x  $p1 = x )
)

backwardmode (

    define main_suffix as (
        setlimit tomark p1 for ([substring])
        among(

            'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne'
            'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter'
            'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens'
            'hetens' 'erns' 'at' 'andet' 'het' 'ast'
                (delete)
            's'
                (s_ending delete)
        )
    )

    define consonant_pair as setlimit tomark p1 for (
        among('dd' 'gd' 'nn' 'dt' 'gt' 'kt' 'tt')
        and ([next] delete)
    )

    define other_suffix as (
        setlimit tomark p1 for ([substring])
        among(
            'lig' 'ig' 'els' (delete)
            '{o"}st'         (ost_ending <-'{o"}s')
            'fullt'          (<-'full')
        )
    )
)

define stem as (

    do mark_regions
    backwards (
        do main_suffix
        do consonant_pair
        do other_suffix
    )
)