Norwegian stemming algorithm

Links to resources

Here is a sample of Norwegian vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

havnedistrikt
havnedistriktene
havnedistrikter
havnedistriktet
havnedistriktets
havnedrift
havnedriften
havneeffektivitet
havneeier
havneeiere
havneenheter
havneforbund
havneforbundets
havneformål
havneforvaltningen
havnefunksjonene
havnefunksjoner
havnefylkene
havnefylker
havnehagen
havneinfrastrukturen
havneinnretningene
havneinnretninger
havneinteresser
havnekapasitet
havnekassa
havnekasse
havnekassemidler
havnekassen
havnekassene
havnekassens
havnelokalisering
havneloven
havnelovens
havneløsning
havneløsningene
havneløsninger
havnemessig
havnemyndighetene
havnemyndigheter

⇒

havnedistrikt
havnedistrikt
havnedistrikt
havnedistrikt
havnedistrikt
havnedrift
havnedrift
havneeffektivit
havneei
havneeier
havneen
havneforbund
havneforbund
havneformål
havneforvaltning
havnefunksjon
havnefunksjon
havnefylk
havnefylk
havnehag
havneinfrastruktur
havneinnretning
havneinnretning
havneinteress
havnekapasit
havnekass
havnekass
havnekassemidl
havnekass
havnekass
havnekass
havnelokalisering
havn
havn
havneløsning
havneløsning
havneløsning
havnemess
havnemynd
havnemynd

opning
opninga
opningsbalanse
opningsbalansen
opp
oppad
opparbeide
opparbeidede
opparbeidelse
opparbeider
opparbeides
opparbeidet
opparbeiding
oppattbygging
oppbevarer
oppbevaring
oppblåst
oppblåste
oppbrente
oppbygd
oppbygde
oppbygget
oppbygging
oppbygginga
oppbyggingen
oppdage
oppdager
oppdaterte
oppdeling
oppdelingen
oppdelt
oppdrag
oppdraget
oppdragsavtale
oppdragsgivere
oppdragstakaren
oppe
oppebærer
oppfarende
oppfatning

⇒

opning
opning
opningsbalans
opningsbalans
opp
oppad
opparbeid
opparbeid
opparbeid
opparbeid
opparbeid
opparbeid
opparbeiding
oppattbygging
oppbevar
oppbevaring
oppblåst
oppblåst
oppbrent
oppbygd
oppbygd
oppbygg
oppbygging
oppbygging
oppbygging
oppdag
oppdag
oppdater
oppdeling
oppdeling
oppdelt
oppdrag
oppdrag
oppdragsavtal
oppdragsgiver
oppdragstakar
opp
oppebær
oppfar
oppfatning

The stemming algorithm

The Norwegian alphabet includes the following additional letters,

æ å ø

The following letters are vowels:

a e ê i o ò ó ô u y æ å ø

R2 is not used. R1 is set up by the following steps:

If the word contains an apostrophe, R1 is set to start after the first apostrophe.
Otherwise, R1 is set in the standard way (see note).
In either case, R1 is then adjusted so that the region before it contains at least 3 characters.

Define a valid s-ending as one of

b c d f g h j l m n o p t v y z,
or r not preceded by e,
or k not preceded by a vowel.

Do each of steps 1, 2, 3 and 4.

Step 1:

Search for the longest among the following suffixes in R1, and perform the action indicated.

(a) a e ede ande ende ane ene hetene en heten ar er heter as es edes endes enes hetenes ens hetens ets et het ast

delete

(b) ers

check the suffix preceding ers and perform the action indicated for the first match found (note that the suffixes in (i) serve as exceptions to some suffixes in (ii)):

(i) giv hav skap: delete ers suffix
(ii) amm ast ind kap kk lt nk omm pp v øst: do nothing
(iii) if none of these suffixes are present: delete ers suffix

(c) s

delete if preceded by a valid s-ending

(d) erte ert

replace with er

(Note that only the suffix needs to be in R1, the letter of the valid s-ending is not required to be.)

Step 2:

If the word ends dt or vt in R1, delete the t.

(For example, meldt → meld, operativt → operativ)

Step 3:

Search for the longest among the following suffixes in R1, and if found, delete.

leg eleg ig eig lig elig els lov elov slov hetslov

Step 4: apostrophe

If the word ends with an apostrophe (') then remove it.

(For example, cd'en → cd' (step 1) → cd in this step.)

Design Notes

This algorithm aims to stem both Bokmål and Nynorsk, which are the two legally-recognised forms of written Norwegian.

Some other accented vowels are used in a small number of Norwegian words but these are deliberately not included in the list of vowels for this algorithm. This is not due to the small number of affected words but because including them doesn't actually improve the results of stemming. In most cases it would make no difference, but for é the reasoning is more subtle. Including it would make one difference - it would conflate forms of léta (to paint) but lét is both the imperative of léta and the past tense of la (to let/allow) so overall this conflation doesn't seem an improvement.

Apostrophe is sometimes used in Norwegian to separate Norwegian suffixes from words of foreign origin. This is actually incorrect usage - the correct alternative is to use a hyphen instead of an apostrophe - but it's widespread enough that we should handle it. This is commonly seen with acronyms/initialisms, some of which are only two characters and don't contain vowels (e.g. cd'en) so the definition of R1 includes a special case for apostrophe.

History of functional changes to the algorithm

Snowball 2.0.0: s-ending definition adjusted to only include k when preceded by a non-vowel.
Snowball 3.0.0: Improve handling of words ending ers.
Snowball 3.0.0: Include ê, ò, ó and ô in the list of vowels.
Snowball 3.1.0: Add special handling for apostrophe.

The same algorithm in Snowball

routines (
           mark_regions
           main_suffix
           consonant_pair
           other_suffix
)

externals ( stem )

integers ( p1 )

groupings ( v s_ending )

stringescapes {}

/* special characters */

stringdef ae   '{U+00E6}'
stringdef ao   '{U+00E5}'
stringdef e^   '{U+00EA}'  // e-circumflex
stringdef o`   '{U+00F2}'  // o-grave
stringdef o'   '{U+00F3}'  // o-acute
stringdef o^   '{U+00F4}'  // o-circumflex
stringdef o/   '{U+00F8}'

define v 'ae{e^}io{o`}{o'}{o^}uy{ae}{ao}{o/}'

define s_ending  'bcdfghjlmnoptvyz'

define mark_regions as (
    $p1 = limit

    do (
      (
        // If there's an apostrophe, start R1 after it to handle
        // acronym loanwords such as "pc'en" and "ep'en".
        gopast '{'}'
      ) or (
        gopast v  gopast non-v
      )
      setmark p1
    )

    // Ensure at least 3 characters before R1.
    test (hop 3  do ($p1 < cursor  $p1 = cursor))
)

backwardmode (

    define main_suffix as (
        setlimit tomark p1 for ([substring])
        among(

            'a' 'e' 'ede' 'ande' 'ende' 'ane' 'ene' 'hetene' 'en' 'heten' 'ar'
            'er' 'heter' 'as' 'es' 'edes' 'endes' 'enes' 'hetenes' 'ens'
            'hetens' 'ets' 'et' 'het' 'ast'
                (delete)
            'ers'
                (
                  among (
                    'amm' 'ast' 'ind' 'kap' 'kk' 'lt' 'nk' 'omm' 'pp' 'v'
                    '{o/}st'
                        ()
                    'giv' 'hav' 'skap' ''
                        (delete)
                  )
                )
            's'
                (s_ending or ('r' not 'e') or ('k' non-v) delete)
            'erte' 'ert'
                (<-'er')
        )
    )

    define consonant_pair as (
        test (
            setlimit tomark p1 for ([substring])
            among(
                'dt' 'vt'
            )
        )
        next] delete
    )

    define other_suffix as (
        setlimit tomark p1 for ([substring])
        among(
            'leg' 'eleg' 'ig' 'eig' 'lig' 'elig' 'els' 'lov' 'elov' 'slov'
            'hetslov'
                (delete)
        )
    )
)

define stem as (

    mark_regions
    backwards (
        do main_suffix
        do consonant_pair
        do other_suffix
        // Remove trailing apostrophe.
        ['{'}'] delete
    )
)