German stemming algorithm

Links to resources

Here is a sample of German vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

aufeinander
aufeinanderbiss
aufeinanderfolge
aufeinanderfolgen
aufeinanderfolgend
aufeinanderfolgende
aufeinanderfolgenden
aufeinanderfolgender
aufeinanderfolgt
aufeinanderfolgten
aufeinanderschlügen
aufenthalt
aufenthalten
aufenthaltes
auferlegen
auferlegt
auferlegten
auferstand
auferstanden
auferstehen
aufersteht
auferstehung
auferstünde
auferwecken
auferweckt
auferzogen
aufessen
auffa
auffallen
auffallend
auffallenden
auffallender
auffassen
auffasst
auffassung
auffassungsvermögen
auffaßt
auffi
auffiel
auffinden

⇒

aufeinand
aufeinanderbiss
aufeinanderfolg
aufeinanderfolg
aufeinanderfolg
aufeinanderfolg
aufeinanderfolg
aufeinanderfolg
aufeinanderfolgt
aufeinanderfolgt
aufeinanderschlug
aufenthalt
aufenthalt
aufenthalt
auferleg
auferlegt
auferlegt
auferstand
auferstand
aufersteh
aufersteht
aufersteh
auferstund
auferweck
auferweckt
auferzog
aufess
auffa
auffall
auffall
auffall
auffall
auffass
auffasst
auffass
auffassungsvermog
auffasst
auffi
auffiel
auffind

kategorie
kategorien
kategorisch
kategorische
kategorischen
kategorischer
kater
katerliede
katern
katers
kathedrale
kathinka
katholik
katholische
katholischen
katholischer
kattun
kattunhalstücher
katz
katze
katzen
katzenschmer
katzensprung
katzenwürde
katzmann
kauen
kauerte
kauf
kaufe
kaufen
kauffahrer
kaufherr
kaufleute
kaufmann
kaufmanns
kaufmannschaft
kaufmannsnamen
kaufpreis
kaufpreises
kaufst

⇒

kategori
kategori
kategor
kategor
kategor
kategor
kat
katerlied
kat
kat
kathedral
kathinka
kathol
kathol
kathol
kathol
kattun
kattunhalstuch
katz
katz
katz
katzenschm
katzenspr
katzenwurd
katzmann
kau
kauert
kauf
kauf
kauf
kauffahr
kaufherr
kaufleut
kaufmann
kaufmann
kaufmannschaft
kaufmannsnam
kaufpreis
kaufpreis
kauf

Design Notes

Despite its inflexional complexities, German has quite a simple suffix structure, so that, if one ignores the almost intractable problems of compound words, separable verb prefixes, and prefixed and infixed ge, an algorithmic stemmer can be made quite short. (Infixed zu can be removed algorithmically, but this minor feature is not shown here.) The umlaut in German is a regular feature of plural formation, so its removal is a natural feature of stemming, but this leads to certain false conflations (for example, schön, beautiful; schon, already).

There are a few short suffixes (for example, -t) where removal would cause problems so the stemmer leaves these alone - for example holen, hole, hol and holest are stemmed to hol; ideally holt would be too, but a rule to remove -t would adversely affect too many other words.

For similar reasons, some suffixes are only removed when in R2: for example -end.

Suffixes -et, -s and -st are removed in some cases. The stemmer checks the end of the stem which would be left to determine when to remove these suffixes, erring on the side of non-removal when it might be problematic.

As with the other stemmers, words are assumed to be lower cased before stemming. This potentially loses information since nouns are always capitalised in German, which results in some possibly avoidable conflations (e.g. Planet means "planet" but planet is a form of the verb "to plan"). However words are also capitalised for other reasons (e.g. at the start of a sentence or in a title) so capitalisation is not a completely reliable indicator.

In German, ä, ö, ü and ß are sometimes transliterated as ae, oe, ue and ss respectively. There is now a capital version of ss, but it's a fairly recent invention (added to Unicode in 2008) and prior to this in capitalised text such as on street signs SS was used instead. Swiss German always uses ss instead of ß, and also uses the transliterated forms for capital letters with umlauts. The German spelling reform of 1996 also reduced use of ß, replacing it with ss in some words. Finally (but much less relevant nowadays) these transliterations are used when writing in character sets which lack these characters, or with a keyboard lacking an easy way to type them. The stemmer will conflate words spelled using these characters with those spelled using the transliterated versions.

Compound words

Famously, German allows for the formation of long compound words, written without spaces. For retrieval purposes, it is useful to be able to search on the parts of such words, as well as the on the complete words themselves. This is not just peculiar to German: Dutch, Danish, Norwegian, Swedish, Icelandic and Finnish have the same property. To split up compound words cannot be done without a dictionary, and the purely algorithmic stemmers presented here do not attempt it.

We would suggest, however, that the need for compound word splitting in these languages has been somewhat overstated. In the case of German:

There are many English compounds one would see no advantage in splitting,

blackberry blackboard rainbow coastguard ....

Many German compounds are like this,

Bleistift (pencil) = Blei (lead) + Stift (stick)
Eisenbahn (railway) = Eisen (iron) + Bahn (road)
Unterseeboot (submarine) = under + sea + boat
Other compounds correspond to what in English one would want to do by phrase searching, so they are ready made for that purpose,

Gesundheitspflege = ‘health care’
Fachhochschule = ‘technical college’
Kunstmuseum = ‘museum of fine art’
In any case, longer compounds, especially involving personal names, are frequently hyphenated,

Heinrich-Heine-Universität
It is possible to construct participial adjectives of almost any length, but they are little used in contemporary German, and regarded now as poor style.

Apostrophe

The rules about correct use of apostrophe in German have changed over time. The stemmer aims to also work with text conforming to older versions of the rules, and with common misuses of apostrophe. Therefore the rules chosen for when to remove take into account usage that was formerly correct but no longer is, or which is incorrect but common.

There are some uses of apostrophe in German which are relevant to stemming:

With the possessive of proper nouns to avoid ambiguity - for example, Andrea's Blumenecke to clarify it's Andrea rather than Andreas.
With adjectives derived from proper nouns, e.g. Einstein'sche Relativitätstheorie.
The possessive of proper nouns ending with an "s"-sound (written -s, -ß, -z, -x, -ce) is formed by appending just an apostrophe (e.g. Bordeaux’ Hafenanlagen).

Derived forms of adjectives formed from nouns will have already been reduced to having suffix -'sch by earlier rules, so we can handle these cases with a step which removes 's, 'sch or ', if that leaves at least two characters. We perform this step after all other suffix removal steps because it leaves a proper noun which we don't want to try to stem.

History of functional changes to the algorithm

Snowball 2.0.0 (2009-12-11): Extra rule for -nisse ending added
Snowball 3.0.0: Handle ASCII transliterations of umlauts (merging the "german2" variant into the standard algorithm).
Snowball 3.0.0: Special case for -system added.
Snowball 3.0.0: Replace -ln and -lns with l.
Snowball 3.0.0: Remove -erin and -erinnen.
Snowball 3.0.0: Remove -et when safe to do so.
Snowball 3.1.0: New step to handle apostrophe.

The stemming algorithm

German includes the following accented forms,

ä ö ü

and a special letter, ß, equivalent to double s.

The following letters are vowels:

a e i o u y ä ö ü

First put u and y between vowels into upper case, and then do the following mappings,

(a) replace ß with ss,
(b) replace ae with ä,
(c) replace oe with ö,
(d) replace ue with ü unless preceded by q.

(The rules here for ae, oe and ue were added in Snowball 3.0.0, but were previously present as a variant of the algorithm termed "german2"). The condition on the replacement of ue prevents the unwanted changing of quelle. Also note that feuer is not modified because the first part of the rule changes it to feUer, so ue is not found.)

R1 and R2 are first set up in the standard way (see the note on R1 and R2), but then R1 is adjusted so that the region before it contains at least 3 letters.

Define a valid s-ending as one of b, d, f, g, h, k, l, m, n, r or t.

Define a valid st-ending as the same list, excluding letter r.

Define a valid et-ending as one of d, f, g, k, l, m, n, r, s, t, U, z or ä.

Do each of steps 1, 2, 3 and 4.

Step 1:

Search for the longest among the following suffixes,

(a) em (not preceded by syst [condition added in Snowball 3.0.0])
(b) ern er
(c) e en es
(d) s (preceded by a valid s-ending)
(e) erin erinnen [added in Snowball 3.0.0]
(f) ln lns [added in Snowball 3.0.0]

and if in R1 then delete (for (a) to (e)) or replace with l (for (f)). (Note that only the suffix needs to be in R1, the letter of the valid s-ending is not required to be.)

If an ending of group (c) is deleted, and the ending is preceded by niss, delete the final s.

(For example, äckern → äck, ackers → acker, armes → arm, bedürfnissen → bedürfnis)

Step 2:

Search for the longest among the following suffixes,

(a) en er est
(b) st (preceded by a valid st-ending, itself preceded by at least 3 letters.
(c) et (preceded by a valid et-ending, itself not preceded any of geordn, intern, plan, tick or tr).

and delete the suffix if in R1.

(For example, derbsten → derbst by step 1, and derbst → derb by step 2, since b is a valid st-ending, and is preceded by just 3 letters)

Step 3: d-suffixes (*)

Search for the longest among the following suffixes, and perform the action indicated.

end ung: delete if in R2; if preceded by ig, delete if in R2 and not preceded by e
ig ik isch: delete if in R2 and not preceded by e
lich heit: delete if in R2; if preceded by er or en, delete if in R1
keit: delete if in R2; if preceded by lich or ig, delete if in R2

Step 4: Apostrophe

Remove one of the following suffixes, if that leaves at least two characters: 's 'sch '

Finally,

turn U and Y back into lower case, and remove the umlaut accent from a, o and u.

The same algorithm in Snowball

routines (
           prelude postlude
           mark_regions
           R1 R2
           standard_suffix
)

externals ( stem )

integers ( p1 p2 x )

groupings ( v et_ending s_ending st_ending )

stringescapes {}

/* special characters */

stringdef a"   '{U+00E4}'
stringdef o"   '{U+00F6}'
stringdef u"   '{U+00FC}'
stringdef ss   '{U+00DF}'

define v 'aeiouy{a"}{o"}{u"}'

define et_ending 'dfgklmnrstUz{a"}'
define s_ending  'bdfghklmnrt'
define st_ending s_ending - 'r'

define prelude as (

    test repeat goto (
        v [('u'] v <- 'U') or
           ('y'] v <- 'Y')
    )

    repeat (
        [substring] among(
            '{ss}' (<- 'ss')
            'ae'   (<- '{a"}')
            'oe'   (<- '{o"}')
            'ue'   (<- '{u"}')
            'qu'   ()
            ''     (next)
        )
    )

)

define mark_regions as (

    $p1 = limit
    $p2 = limit

    test(hop 3 setmark x)

    gopast v  gopast non-v  setmark p1
    try($p1 < x  $p1 = x)  // at least 3
    gopast v  gopast non-v  setmark p2

)

define postlude as repeat (

    [substring] among(
        'Y'    (<- 'y')
        'U'    (<- 'u')
        '{a"}' (<- 'a')
        '{o"}' (<- 'o')
        '{u"}' (<- 'u')
        ''     (next)
    )

)

backwardmode (

    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define standard_suffix as (
        do (
            [substring] R1 among(
                'em'
                (   not 'syst' // don't remove -em from words ending -system
                    delete
                )
                'ern' 'er'
                'erin' 'erinnen' // conflate female versions of nouns
                (   delete
                )
                'e' 'en' 'es'
                (   delete
                    try (['s'] 'nis' delete)
                )
                's'
                (   s_ending delete
                )
                'ln' 'lns'
                (   <- 'l'
                )
            )
        )
        do (
            [substring] R1 among(
                'en' 'er' 'est'
                (   delete
                )
                'st'
                (   st_ending hop 3 delete
                )
                'et'
                (   test et_ending
                    not among (
                        'geordn' // Still conflate untergeordnet/untergeordnetere, etc.
                        'intern' // Don't conflate Internet and internes.
                        'plan' // Don't conflate Plan and Planet.
                        'tick' // Don't conflate Tick and Ticket.
                        'tr'   // Still conflate Vertreter/Vertretung, etc.
                    )
                    delete
                )
            )
        )
        do (
            [substring] R2 among(
                'end' 'ung'
                (   delete
                    try (['ig'] not 'e' R2 delete)
                )
                'ig' 'ik' 'isch'
                (   not 'e' delete
                )
                'lich' 'heit'
                (   delete
                    try (
                        ['er' or 'en'] R1 delete
                    )
                )
                'keit'
                (   delete
                    try (
                        [substring] R2 among(
                            'lich' 'ig'
                            (   delete
                            )
                        )
                    )
                )
            )
        )
    )
)

define stem as (
    do prelude
    do mark_regions
    backwards
        do standard_suffix
    do postlude
)

Bleistift (pencil)	=	Blei (lead) + Stift (stick)
Eisenbahn (railway)	=	Eisen (iron) + Bahn (road)
Unterseeboot (submarine)	=	under + sea + boat

Gesundheitspflege	=	‘health care’
Fachhochschule	=	‘technical college’
Kunstmuseum	=	‘museum of fine art’