Dutch stemming algorithm

Links to resources

Here is a sample of Dutch vocabulary, with the stemmed forms that will be generated by this algorithm:

word stem          word stem
lichaamsziek
lichamelijk
lichamelijke
lichamelijkheden
lichamen
lichere
licht
lichtbeeld
lichtbruin
lichtdoorlatende
lichte
lichten
lichtende
lichtenvoorde
lichter
lichtere
lichters
lichtgevoeligheid
lichtgewicht
lichtgrijs
lichthoeveelheid
lichtintensiteit
lichtje
lichtjes
lichtkranten
lichtkring
lichtkringen
lichtregelsystemen
lichtste
lichtstromende
lichtte
lichtten
lichttoetreding
lichtverontreinigde
lichtzinnige
lid
lidia
lidmaatschap
lidstaten
lidvereniging
lichaamsziek
licham
licham
licham
licham
licher
licht
lichtbeeld
lichtbruin
lichtdoorlat
licht
licht
lichtend
lichtenvoord
lichter
lichter
lichter
lichtgevoel
lichtgewicht
lichtgrijs
lichthoevel
lichtintensiteit
lichtj
lichtjes
lichtkrant
lichtkring
lichtkring
lichtregelsystem
lichtst
lichtstrom
licht
licht
lichttoetred
lichtverontreinigd
lichtzinn
lid
lidia
lidmaatschap
lidstat
lidveren
opgingen
opglanzing
opglanzingen
opglimlachten
opglimpen
opglimpende
opglimping
opglimpingen
opgraven
opgrijnzen
opgrijzende
opgroeien
opgroeiende
opgroeiplaats
ophaal
ophaaldienst
ophaalkosten
ophaalsystemen
ophaalt
ophaaltruck
ophalen
ophalend
ophalers
ophef
opheffen
opheffende
opheffing
opheldering
ophemelde
ophemelen
opheusden
ophief
ophield
ophieven
ophoepelt
ophoog
ophoogzand
ophopen
ophoping
ophouden
opging
opglanz
opglanz
opglimlacht
opglimp
opglimp
opglimp
opglimp
opgrav
opgrijnz
opgrijz
opgroei
opgroei
opgroeiplat
ophal
ophaaldienst
ophaalkost
ophaalsystem
ophaalt
ophaaltruck
ophal
ophal
ophaler
ophef
opheff
opheff
opheff
ophelder
ophemeld
ophemel
opheusd
ophief
ophield
ophiev
ophoepelt
ophog
ophoogzand
ophop
ophop
ophoud

The stemming algorithm

Dutch includes the following accented forms
ä   ë   ï   ö   ü   á   é   í   ó   ú   è
First, remove all umlaut and acute accents. A vowel is then one of,
a   e   i   o   u   y   è
Put initial y, y after a vowel, and i between vowels into upper case. R1 and R2 (see the note on R1 and R2) are then defined as in German.

Define a valid s-ending as a non-vowel other than j.

Define a valid en-ending as a non-vowel, and not gem.

Define undoubling the ending as removing the last letter if the word ends kk, dd or tt.

Do each of steps 1, 2 3 and 4.

Step 1:
Search for the longest among the following suffixes, and perform the action indicated

(a) heden
replace with heid if in R1

(b) en   ene
delete if in R1 and preceded by a valid en-ending, and then undouble the ending

(c) s   se
delete if in R1 and preceded by a valid s-ending
Step 2:
Delete suffix e if in R1 and preceded by a non-vowel, and then undouble the ending
Step 3a: heid
delete heid if in R2 and not preceded by c, and treat a preceding en as in step 1(b)
Step 3b: d-suffixes (*)
Search for the longest among the following suffixes, and perform the action indicated.

end   ing
delete if in R2
if preceded by ig, delete if in R2 and not preceded by e, otherwise undouble the ending

ig
delete if in R2 and not preceded by e

lijk
delete if in R2, and then repeat step 2

baar
delete if in R2

bar
delete if in R2 and if step 2 actually removed an e
Step 4: undouble vowel
If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, maanman, broodbrod).
Finally,
Turn I and Y back into lower case.

The same algorithm in Snowball

routines (
           prelude postlude
           e_ending
           en_ending
           mark_regions
           R1 R2
           undouble
           standard_suffix
)

externals ( stem )

booleans ( e_found )

integers ( p1 p2 )

groupings ( v v_I v_j )

stringescapes {}

/* special characters */

stringdef a"   '{U+00E4}'
stringdef e"   '{U+00EB}'
stringdef i"   '{U+00EF}'
stringdef o"   '{U+00F6}'
stringdef u"   '{U+00FC}'

stringdef a'   '{U+00E1}'
stringdef e'   '{U+00E9}'
stringdef i'   '{U+00ED}'
stringdef o'   '{U+00F3}'
stringdef u'   '{U+00FA}'

stringdef e`   '{U+00E8}'

define v       'aeiouy{e`}'
define v_I     v + 'I'
define v_j     v + 'j'

define prelude as (
    test repeat (
        [substring] among(
            '{a"}' '{a'}'
                (<- 'a')
            '{e"}' '{e'}'
                (<- 'e')
            '{i"}' '{i'}'
                (<- 'i')
            '{o"}' '{o'}'
                (<- 'o')
            '{u"}' '{u'}'
                (<- 'u')
            ''  (next)
        ) //or next
    )
    try(['y'] <- 'Y')
    repeat goto (
        v [('i'] v <- 'I') or
           ('y']   <- 'Y')
    )
)

define mark_regions as (

    $p1 = limit
    $p2 = limit

    gopast v  gopast non-v  setmark p1
    try($p1 < 3  $p1 = 3)  // at least 3
    gopast v  gopast non-v  setmark p2

)

define postlude as repeat (

    [substring] among(
        'Y'  (<- 'y')
        'I'  (<- 'i')
        ''   (next)
    ) //or next

)

backwardmode (

    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define undouble as (
        test among('kk' 'dd' 'tt') [next] delete
    )

    define e_ending as (
        unset e_found
        ['e'] R1 test non-v delete
        set e_found
        undouble
    )

    define en_ending as (
        R1 non-v and not 'gem' delete
        undouble
    )

    define standard_suffix as (
        do (
            [substring] among(
                'heden'
                (   R1 <- 'heid'
                )
                'en' 'ene'
                (   en_ending
                )
                's' 'se'
                (   R1 non-v_j delete
                )
            )
        )
        do e_ending

        do ( ['heid'] R2 not 'c' delete
             ['en'] en_ending
           )

        do (
            [substring] among(
                'end' 'ing'
                (   R2 delete
                    (['ig'] R2 not 'e' delete) or undouble
                )
                'ig'
                (   R2 not 'e' delete
                )
                'lijk'
                (   R2 delete e_ending
                )
                'baar'
                (   R2 delete
                )
                'bar'
                (   R2 e_found delete
                )
            )
        )
        do (
            non-v_I
            test (
                among ('aa' 'ee' 'oo' 'uu')
                non-v
            )
            [next] delete
        )
    )
)

define stem as (

        do prelude
        do mark_regions
        backwards
            do standard_suffix
        do postlude
)