Dutch stemming algorithm

Links to resources

Here is a sample of Dutch vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

lichaamsziek
lichamelijk
lichamelijke
lichamelijkheden
lichamen
lichere
licht
lichtbeeld
lichtbruin
lichtdoorlatende
lichte
lichten
lichtende
lichtenvoorde
lichter
lichtere
lichters
lichtgevoeligheid
lichtgewicht
lichtgrijs
lichthoeveelheid
lichtintensiteit
lichtje
lichtjes
lichtkranten
lichtkring
lichtkringen
lichtregelsystemen
lichtste
lichtstromende
lichtte
lichtten
lichttoetreding
lichtverontreinigde
lichtzinnige
lid
lidia
lidmaatschap
lidstaten
lidvereniging

⇒

lichaamsziek
licham
licham
licham
licham
licher
licht
lichtbeeld
lichtbruin
lichtdoorlat
licht
licht
lichtend
lichtenvoord
lichter
lichter
lichter
lichtgevoel
lichtgewicht
lichtgrijs
lichthoevel
lichtintensiteit
lichtj
lichtjes
lichtkrant
lichtkring
lichtkring
lichtregelsystem
lichtst
lichtstrom
licht
licht
lichttoetred
lichtverontreinigd
lichtzinn
lid
lidia
lidmaatschap
lidstat
lidveren

opgingen
opglanzing
opglanzingen
opglimlachten
opglimpen
opglimpende
opglimping
opglimpingen
opgraven
opgrijnzen
opgrijzende
opgroeien
opgroeiende
opgroeiplaats
ophaal
ophaaldienst
ophaalkosten
ophaalsystemen
ophaalt
ophaaltruck
ophalen
ophalend
ophalers
ophef
opheffen
opheffende
opheffing
opheldering
ophemelde
ophemelen
opheusden
ophief
ophield
ophieven
ophoepelt
ophoog
ophoogzand
ophopen
ophoping
ophouden

⇒

opging
opglanz
opglanz
opglimlacht
opglimp
opglimp
opglimp
opglimp
opgrav
opgrijnz
opgrijz
opgroei
opgroei
opgroeiplat
ophal
ophaaldienst
ophaalkost
ophaalsystem
ophaalt
ophaaltruck
ophal
ophal
ophaler
ophef
opheff
opheff
opheff
ophelder
ophemeld
ophemel
opheusd
ophief
ophield
ophiev
ophoepelt
ophog
ophoogzand
ophop
ophop
ophoud

The stemming algorithm

Dutch includes the following accented forms

ä ë ï ö ü á é í ó ú è

First, remove all umlaut and acute accents listed above. A vowel is then one of,

a e i o u y è

Put initial y, y after a vowel, and i between vowels into upper case. R1 and R2 (see the note on R1 and R2) are then defined as in German.

Define a valid s-ending as a non-vowel other than j.

Define a valid en-ending as a non-vowel, and not gem.

Define undoubling the ending as removing the last letter if the word ends kk, dd or tt.

Do each of steps 1, 2 3 and 4.

Step 1:

Search for the longest among the following suffixes, and perform the action indicated

(a) heden: replace with heid if in R1
(b) en ene: delete if in R1 and preceded by a valid en-ending, and then undouble the ending
(c) s se: delete if in R1 and preceded by a valid s-ending

Step 2:

Delete suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

Step 3a: heid

delete heid if in R2 and not preceded by c, and treat a preceding en as in step 1(b)

Step 3b: d-suffixes (*)

Search for the longest among the following suffixes, and perform the action indicated.

end ing: delete if in R2; if preceded by ig, delete if in R2 and not preceded by e, otherwise undouble the ending
ig: delete if in R2 and not preceded by e
lijk: delete if in R2, and then repeat step 2
baar: delete if in R2
bar: delete if in R2 and if step 2 actually removed an e

Step 4: undouble vowel

If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, maan → man, brood → brod).

Finally,

Turn I and Y back into lower case.

The same algorithm in Snowball

// Dutch stemming algorithm developed by Martin Porter

routines (
           prelude postlude
           e_ending
           en_ending
           mark_regions
           R1 R2
           undouble
           standard_suffix
)

externals ( stem )

booleans ( e_found )

integers ( p1 p2 x )

groupings ( v v_I v_j )

stringescapes {}

/* special characters */

stringdef a"   '{U+00E4}'
stringdef e"   '{U+00EB}'
stringdef i"   '{U+00EF}'
stringdef o"   '{U+00F6}'
stringdef u"   '{U+00FC}'

stringdef a'   '{U+00E1}'
stringdef e'   '{U+00E9}'
stringdef i'   '{U+00ED}'
stringdef o'   '{U+00F3}'
stringdef u'   '{U+00FA}'

stringdef e`   '{U+00E8}'

define v       'aeiouy{e`}'
define v_I     v + 'I'
define v_j     v + 'j'

define prelude as (
    test repeat (
        [substring] among(
            '{a"}' '{a'}'
                (<- 'a')
            '{e"}' '{e'}'
                (<- 'e')
            '{i"}' '{i'}'
                (<- 'i')
            '{o"}' '{o'}'
                (<- 'o')
            '{u"}' '{u'}'
                (<- 'u')
            ''  (next)
        )
    )
    try(['y'] <- 'Y')
    repeat (
        gopast v
        try (
            // If we see `i` not followed by a vowel then we know it couldn't
            // match on the next iteration so we can advance past it.
            //
            // However if we replace `i` with `I` we do need to check the vowel
            // after the `i` in the next iteration to match the documented
            // behaviour, e.g. consider input `iiiii`.  This may well not make
            // a difference for any actual Dutch words though.
            [('i'] do(v <- 'I')) or
             ('y']      <- 'Y')
        )
    )
)

define mark_regions as (

    $p1 = limit
    $p2 = limit

    test(hop 3 setmark x)

    gopast v  gopast non-v  setmark p1
    try($p1 < x  $p1 = x)  // at least 3
    gopast v  gopast non-v  setmark p2

)

define postlude as repeat (

    [substring] among(
        'Y'  (<- 'y')
        'I'  (<- 'i')
        ''   (next)
    )

)

backwardmode (

    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define undouble as (
        test among('kk' 'dd' 'tt') [next] delete
    )

    define e_ending as (
        unset e_found
        ['e'] R1 test non-v delete
        set e_found
        undouble
    )

    define en_ending as (
        R1 non-v and not 'gem' delete
        undouble
    )

    define standard_suffix as (
        do (
            [substring] among(
                'heden'
                (   R1 <- 'heid'
                )
                'en' 'ene'
                (   en_ending
                )
                's' 'se'
                (   R1 non-v_j delete
                )
            )
        )
        do e_ending

        do ( ['heid'] R2 not 'c' delete
             ['en'] en_ending
           )

        do (
            [substring] among(
                'end' 'ing'
                (   R2 delete
                    (['ig'] R2 not 'e' delete) or undouble
                )
                'ig'
                (   R2 not 'e' delete
                )
                'lijk'
                (   R2 delete e_ending
                )
                'baar'
                (   R2 delete
                )
                'bar'
                (   R2 e_found delete
                )
            )
        )
        do (
            non-v_I
            test (
                among ('aa' 'ee' 'oo' 'uu')
                non-v
            )
            [next] delete
        )
    )
)

define stem as (

        do prelude
        do mark_regions
        backwards
            do standard_suffix
        do postlude
)