Swedish stemming algorithm

Links to resources

Here is a sample of Swedish vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

jakt
jaktbössa
jakten
jakthund
jaktkarl
jaktkarlar
jaktkarlarne
jaktkarlens
jaktlöjtnant
jaktlöjtnanten
jaktlöjtnantens
jalusi
jalusien
jalusier
jalusierna
jamaika
jamat
jamrande
jamt
jande
januari
japanska
jaquette
jaquettekappa
jargong
jasmin
jasminen
jasminer
jasminhäck
jaspis
jaså
javäl
jazzvindens
jcrn
jcsus
je
jemföra
jemföras
jemförelse
jemförelser

⇒

jakt
jaktböss
jakt
jakthund
jaktkarl
jaktkarl
jaktkarl
jaktkarl
jaktlöjtnant
jaktlöjtnant
jaktlöjtnant
jalusi
jalusi
jalusi
jalusi
jamaik
jam
jamr
jamt
jand
januari
japansk
jaquet
jaquettekapp
jargong
jasmin
jasmin
jasmin
jasminhäck
jaspis
jaså
javäl
jazzvind
jcrn
jcsus
je
jemför
jemför
jemför
jemför

klo
kloaken
klock
klocka
klockan
klockans
klockare
klockaren
klockarens
klockarfar
klockarn
klockarsonen
klockas
klockkedjan
klocklikt
klockor
klockorna
klockornas
klockors
klockringning
kloekornas
klok
kloka
klokare
klokast
klokaste
kloke
klokhet
klokheten
klokt
kloliknande
klor
klorna
kloroform
kloster
klostergården
klosterlik
klot
klotb
klotrund

⇒

klo
kloak
klock
klock
klockan
klockan
klock
klock
klock
klockarf
klockarn
klockarson
klock
klockkedjan
klocklik
klock
klock
klock
klockor
klockringning
kloek
klok
klok
klok
klok
klok
klok
klok
klok
klokt
klolikn
klor
klorn
kloroform
klost
klostergård
klosterlik
klot
klotb
klotrund

The stemming algorithm

The Swedish alphabet includes the following additional letters,

ä å ö

The following letters are vowels:

a e i o u y ä å ö

R2 is not used: R1 is defined in the same way as in the German stemmer. (See the note on R1 and R2.)

Define a valid s-ending as one of

b c d f g h j k l m n o p r t v y

Define a valid öst-ending as one of

i k l n p r t u v

Define a valid et-ending as at least one letter followed by a vowel followed by a non-vowel, which does not have one of the following as a suffix

h iet uit fab cit dit alit ilit mit nit pit rit sit tit ivit kvit xit kom rak pak stak

Do each of steps 1, 2 and 3.

Step 1:

Search for the longest among the following suffixes in R1, and perform the action indicated.

(a) a arna erna heterna orna ad e ade ande arne are aste en anden aren heten ern ar er heter or as arnas ernas ornas es ades andes ens arens hetens erns at andet het ast: delete
(b) s: if preceded by et and that is preceded by a valid et-ending remove both s and et, otherwise delete if preceded by a valid s-ending
(c) et: delete if preceded by a valid et-ending

(Note that only the suffix needs to be in R1, the letter(s) of the valid s-ending or et-ending are not required to be.)

Step 2:

Search for one of the following suffixes in R1, and if found delete the last letter.

dd gd nn dt gt kt tt

(For example, friskt → frisk, fröknarnn → fröknarn)

Step 3:

Search for the longest among the following suffixes in R1, and perform the action indicated.

lig ig els: delete
öst: replace with ös if preceded by a valid öst-ending
fullt: replace with full

(The letter of the valid öst-ending is not necessarily in R1. Prior to Snowball 3.0.0, öst-ending was effectively just l and was required to be in R1.)

Design Notes

Swedish has a noun ending corresponding to the definite article (the in English). This occurs very commonly but cannot always be removed with certainty. Currently the algorithm removes the en form, and the et form in some cases, but not the t or n forms,

husen		hus
valet		val
flickan	→	flickan
äpplet		äpplet

History of functional changes to the algorithm

Snowball 3.0.0: Change öst suffix to ös in more situations.
Snowball 3.0.0: New rule to remove some et suffixes.

The same algorithm in Snowball

routines (
           et_condition
           mark_regions
           main_suffix
           consonant_pair
           other_suffix
)

externals ( stem )

integers ( p1 x )

groupings ( v s_ending ost_ending )

stringescapes {}

/* special characters */

stringdef a"   '{U+00E4}'
stringdef ao   '{U+00E5}'
stringdef o"   '{U+00F6}'

define v 'aeiouy{a"}{ao}{o"}'

define s_ending  'bcdfghjklmnoprtvy'

define ost_ending 'iklnprtuv'

define mark_regions as (

    $p1 = limit
    test ( hop 3 setmark x )
    gopast v  gopast non-v  setmark p1
    try ( $p1 < x  $p1 = x )
)

backwardmode (

    define et_condition as (
        (non-v v not atlimit)
        and not among (
            // frihet, nyhet, råhet, trohet
            'h'
            // societet
            'iet'
            // annuitet, kontinuitet
            'uit'
            // alfabet
            'fab'
            // autenticitet, elektricitet, kapacitet, metallicitet, publicitet
            'cit'
            // graviditet, likviditet, rigiditet
            'dit'
            // neutralitet, rivalitet, sexualitet
            'alit'
            // flexibilitet, instabilitet, kompatibilitet, mobilitet, variabilitet
            'ilit'
            // anonymitet, intimitet, legitimitet
            'mit'
            // kommunitet, maskulinitet, modernitet, spontanitet, suveränitet
            'nit'
            // epitet, serendipitet
            'pit'
            // auktoritet, integritet, majoritet, popularitet, prioritet
            'rit'
            // densitet, generositet, intensitet, luminositet, viskositet
            'sit'
            // identitet, kvantitet
            'tit'
            // aggressivitet, positivitet
            'ivit'
            // antikvitet, oblikvitet
            'kvit'
            // komplexitet
            'xit'
            // komet
            'kom'
            // raket
            'rak'
            // paket
            'pak'
            // staket
            'stak'
        )
    )

    define main_suffix as (
        setlimit tomark p1 for ([substring])
        among(

            'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne'
            'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter'
            'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens'
            'hetens' 'erns' 'at' 'andet' 'het' 'ast'
                (delete)
            's'
                ( ('et' et_condition ]) or s_ending  delete )
            'et'
                ( et_condition delete )
        )
    )

    define consonant_pair as setlimit tomark p1 for (
        among('dd' 'gd' 'nn' 'dt' 'gt' 'kt' 'tt')
        and ([next] delete)
    )

    define other_suffix as (
        setlimit tomark p1 for ([substring])
        among(
            'lig' 'ig' 'els' (delete)
            '{o"}st'         (ost_ending <-'{o"}s')
            'fullt'          (<-'full')
        )
    )
)

define stem as (

    do mark_regions
    backwards (
        do main_suffix
        do consonant_pair
        do other_suffix
    )
)