Hungarian stemming algorithm

Links to resources

Here is a sample of Hungarian vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

babaháznak
babakocsi
babakocsijáért
babakocsit
babakocsiért
babból
bab
babgulyás
babgulyást
babona
babonákkal
babonás
babrálgatta
babrálni
babrál
babrált
babrálva
babusgatnak
baba
babái
babák
babákkal
babázni
babérfa
babérokat
babért
bacchánsnők
badacsonyi
badarság
badarságok
baedeker
baglyokat
bagolyszemüveges
bagót
bajbajutott
bajbajutottak
bajbajutottakat
bajbajutottakon
bajlódjanak
bajlódni

⇒

babaház
babakocs
babakocs
babakocs
babakocs
bab
bab
babgulyás
babgulyás
babon
babona
babonás
babrálgatt
babráln
babrál
babrál
babrálv
babusgat
ba
baba
baba
baba
babázn
babérf
babér
bab
bacchánsnő
badacsony
badarság
badarság
baedeker
bagly
bagolyszemüveges
bagó
bajbajutot
bajbajutott
bajbajutott
bajbajutott
bajlód
bajlódn

muattta
mukkot
mulandóság
mulandóságot
mulasszátok
mulasztanak
mulasztotta
mulasztottam
mulasztották
mulaszt
mulaszthatom
mulasztás
mulasztásban
mulasztásból
mulasztásnál
mulasztással
mulasztásának
mulasztásánál
mulasztásáért
mulasztási
mulasztásos
mulasztó
mulathatnánk
mulathattunk
mulatna
mulat
mulatnak
mulatni
mulattak
mulattat
mulattatta
mulatott
mulatozott
mulatozáshoz
mulatozást
mulatság
mulatságnak
mulatságot
mulatságos
mulatt

⇒

muattt
muk
mulandóság
mulandóság
mulasszát
mulaszt
mulasztott
mulasztott
mulasztotta
mulasz
mulaszthat
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztásos
mulasztó
mulathatna
mulathatt
mulatn
mul
mulat
mulatn
mulatt
mulatt
mulattatt
mulatot
mulatozot
mulatozás
mulatozás
mulatság
mulatság
mulatság
mulatságos
mulat

Design Notes

The algorithm is described in the paper as "Light". It primarily aims to remove noun inflections ("all noun cases, plural and frequent owners"). This means it also stems adjectives ("the two are linked because of similar morphology"). Some verb forms are stemmed, but really only as a side-effect when a rule to remove a noun suffix matches a verb form as well. The paper presents the case that stemming verbs matters less for retrieval.

The Hungrian language has these digraphs:

cs dz dzs gy ly ny sz ty zs

However treating these specially makes no difference to the results of the algorithm on valid Hungarian words so (since Snowball 3.0.0) the algorithm doesn't treat digraphs specially, except that some of the double constants include digraphs: e.g. ccs.

Algorithm Description

This stemming algorithm removes the inflectional suffixes of nouns. Nouns are inflected for case, person/possession and number.

Letters in Hungarian include the following accented forms,

á é í ó ö ő ú ü ű

The following letters are vowels:

a á e é i í o ó ö ő u ú ü ű

For the purposes of this algorithm we define a consonant as a character which is not a vowel.

A double consonant is defined as:

bb cc ccs dd ff gg ggy jj kk ll lly mm nn nny pp rr ss ssz tt tty vv zz zzs

If the word begins with a vowel, R1 is defined as the region after the first consonant in the word. If the word begins with a consonant, it is defined as the region after the first vowel in the word. If the word does not contain both a vowel and consonant, R1 is the null region at the end of the word.

For example:

    t ó b a n           consonant-vowel
       |.....|          R1 is 'a b a n'

    a b l a k a n       vowel-consonant
       |.........|      R1 is 'l a k a n'

    a c s o n y         vowel-digraph
         |.....|        R1 is 'o n y'

    c v s
     --->|<---          null R1 region

‘Delete if in R1’ means that the suffix should be removed if it is in region R1 but not if it is outside.

Do steps 1 to 9 in turn

Step 1: Remove instrumental case

Search for one of the following suffixes and perform the action indicated.

al el: delete if in R1 and preceded by a double consonant, and remove one of the double consonants. (In the case of consonant plus digraph, such as ccs, remove a c).

Step 2: Remove frequent cases

Search for the longest among the following suffixes and perform the action indicated.

ban ben ba be ra re nak nek val vel tól től ról ről ból ből hoz hez höz nál nél ig at et ot öt ért képp képpen kor ul ül vá vé onként enként anként ként en on an ön n t: delete if in R1; if the remaining word ends á replace by a; if the remaining word ends é replace by e

Step 3: Remove special cases:

Search for the longest among the following suffixes and perform the action indicated.

án ánként: replace by a if in R1
én: replace by e if in R1

Step 4: Remove other cases:

Search for the longest among the following suffixes and perform the action indicated

astul estül stul stül: delete if in R1
ástul: replace with a if in R1
éstül: replace with e if in R1

Step 5: Remove factive case

Search for one of the following suffixes and perform the action indicated.

á é: delete if in R1 and preceded by a double consonant, and remove one of the double consonants (as in step 1).

Step 6: Remove owned

Search for the longest among the following suffixes and perform the action indicated.

oké öké aké eké ké éi é: delete if in R1
áké áéi: replace with a if in R1
éké ééi éé: replace with e if in R1

Step 7: Remove singular owner suffixes

Search for the longest among the following suffixes and perform the action indicated.

ünk unk nk juk jük uk ük em om am m od ed ad öd d ja je a e o: delete if in R1
ánk ájuk ám ád á: replace with a if in R1
énk éjük ém éd é: replace with e if in R1

Step 8: Remove plural owner suffixes

Search for the longest among the following suffixes and perform the action indicated.

jaim jeim aim eim im jaid jeid aid eid id jai jei ai ei i jaink jeink eink aink ink jaitok jeitek aitok eitek itek jeik jaik aik eik ik: delete if in R1
áim áid ái áink áitok áik: replace with a if in R1
éim éid éi éink éitek éik: replace with e if in R1

Step 9: Remove plural suffixes

Search for the longest among the following suffixes and perform the action indicated.

ák: replace with a if in R1
ék: replace with e if in R1
ök ok ek ak k: delete if in R1

History of functional changes to the algorithm

Sep 2006: Contributed by Anna Tordai, University of Amsterdam. The paper linked above describes evaluation of four variants of the algorithm, but does not describe the details of the algorithm itself. It seems the contributed algorithm corresponds to Light2 in the paper.
Sep 2014: Fixed encoding problem: õ was being used instead of ő and û instead of ű.
Snowball 3.0.0: Removed special handling of digraphs. We were ensuring R1 didn't start in the middle of a digraph (except that "dz" was missing from the Snowball implementation although included in the algorithm description). However having R1 start in the middle of a digraph would only make a difference to the stemming if we removed a suffix that started with the last character of the digraph (or with "zs" in the case of "dzs").

No suffixes we remove start with y or z.

Two suffixes start with s (stul and stül) so removing special handling of cs and dzs makes a difference to some inputs but not to any inputs which are valid Hungarian words.

Removing the digraph handling speeds up stemming (by ~2% on the current sample vocabulary list).

The full algorithm in Snowball

/*
Hungarian Stemmer
Removes noun inflections
*/

routines (
    mark_regions
    R1
    v_ending
    case
    case_special
    case_other
    plural
    owned
    sing_owner
    plur_owner
    instrum
    factive
    undouble
    double
)

externals ( stem )

integers ( p1 )
groupings ( v )

stringescapes {}

/* special characters */

stringdef a'  '{U+00E1}'  //a-acute
stringdef e'  '{U+00E9}'  //e-acute
stringdef i'  '{U+00ED}'  //i-acute
stringdef o'  '{U+00F3}'  //o-acute
stringdef o"  '{U+00F6}'  //o-umlaut
stringdef oq  '{U+0151}' //o-double acute
stringdef u'  '{U+00FA}'  //u-acute
stringdef u"  '{U+00FC}'  //u-umlaut
stringdef uq  '{U+0171}' //u-double acute

define v 'aeiou{a'}{e'}{i'}{o'}{o"}{oq}{u'}{u"}{uq}'

define mark_regions as (

    $p1 = limit

    (
        // Word start with a vowel, start R1 after: V...C
        v
        do (gopast non-v setmark p1)
    ) or (
        // Word start with a non-vowel, start R1 after: C...V
        gopast v setmark p1
    )
)

backwardmode (

    define R1 as $p1 <= cursor

    define v_ending as (
        [substring] R1 among(
            '{a'}' (<- 'a')
            '{e'}' (<- 'e')
        )
    )

    define double as (
        test among('bb' 'cc' 'ccs' 'dd' 'ff' 'gg' 'ggy' 'jj' 'kk' 'll' 'lly' 'mm'
        'nn' 'nny' 'pp' 'rr' 'ss' 'ssz' 'tt' 'tty' 'vv' 'zz' 'zzs')
    )

    define undouble as (
        next [hop 1] delete
    )

    define instrum as(
        [substring] R1 among(
            'al' (double)
            'el' (double)
        )
        delete
        undouble
    )


    define case as (
        [substring] R1 among(
            'ban' 'ben'
            'ba' 'be'
            'ra' 're'
            'nak' 'nek'
            'val' 'vel'
            't{o'}l' 't{oq}l'
            'r{o'}l' 'r{oq}l'
            'b{o'}l' 'b{oq}l'
            'hoz' 'hez' 'h{o"}z'
            'n{a'}l' 'n{e'}l'
            'ig'
            'at' 'et' 'ot' '{o"}t'
            '{e'}rt'
            'k{e'}pp' 'k{e'}ppen'
            'kor'
            'ul' '{u"}l'
            'v{a'}' 'v{e'}'
            'onk{e'}nt' 'enk{e'}nt' 'ank{e'}nt'
            'k{e'}nt'
            'en' 'on' 'an' '{o"}n'
            'n'
            't'
        )
        delete
        v_ending
    )

    define case_special as(
        [substring] R1 among(
            '{e'}n' (<- 'e')
            '{a'}n' (<- 'a')
            '{a'}nk{e'}nt' (<- 'a')
        )
    )

    define case_other as(
        [substring] R1 among(
            'astul' 'est{u"}l' (delete)
            'stul' 'st{u"}l' (delete)
            '{a'}stul' (<- 'a')
            '{e'}st{u"}l' (<- 'e')
        )
    )

    define factive as(
        [substring] R1 among(
            '{a'}' (double)
            '{e'}' (double)
        )
        delete
        undouble
    )

    define plural as (
        [substring] R1 among(
            '{a'}k' (<- 'a')
            '{e'}k' (<- 'e')
            '{o"}k' (delete)
            'ak' (delete)
            'ok' (delete)
            'ek' (delete)
            'k' (delete)
        )
    )

    define owned as (
        [substring] R1 among (
            'ok{e'}' '{o"}k{e'}' 'ak{e'}' 'ek{e'}' (delete)
            '{e'}k{e'}' (<- 'e')
            '{a'}k{e'}' (<- 'a')
            'k{e'}' (delete)
            '{e'}{e'}i' (<- 'e')
            '{a'}{e'}i' (<- 'a')
            '{e'}i'  (delete)
            '{e'}{e'}' (<- 'e')
            '{e'}' (delete)
        )
    )

    define sing_owner as (
        [substring] R1 among(
            '{u"}nk' 'unk' (delete)
            '{a'}nk' (<- 'a')
            '{e'}nk' (<- 'e')
            'nk' (delete)
            '{a'}juk' (<- 'a')
            '{e'}j{u"}k' (<- 'e')
            'juk' 'j{u"}k' (delete)
            'uk' '{u"}k' (delete)
            'em' 'om' 'am' (delete)
            '{a'}m' (<- 'a')
            '{e'}m' (<- 'e')
            'm' (delete)
            'od' 'ed' 'ad' '{o"}d' (delete)
            '{a'}d' (<- 'a')
            '{e'}d' (<- 'e')
            'd' (delete)
            'ja' 'je' (delete)
            'a' 'e' 'o' (delete)
            '{a'}' (<- 'a')
            '{e'}' (<- 'e')
        )
    )

    define plur_owner as (
        [substring] R1 among(
            'jaim' 'jeim' (delete)
            '{a'}im' (<- 'a')
            '{e'}im' (<- 'e')
            'aim' 'eim' (delete)
            'im' (delete)
            'jaid' 'jeid' (delete)
            '{a'}id' (<- 'a')
            '{e'}id' (<- 'e')
            'aid' 'eid' (delete)
            'id' (delete)
            'jai' 'jei' (delete)
            '{a'}i' (<- 'a')
            '{e'}i' (<- 'e')
            'ai' 'ei' (delete)
            'i' (delete)
            'jaink' 'jeink' (delete)
            'eink' 'aink' (delete)
            '{a'}ink' (<- 'a')
            '{e'}ink' (<- 'e')
            'ink'
            'jaitok' 'jeitek' (delete)
            'aitok' 'eitek' (delete)
            '{a'}itok' (<- 'a')
            '{e'}itek' (<- 'e')
            'itek' (delete)
            'jeik' 'jaik' (delete)
            'aik' 'eik' (delete)
            '{a'}ik' (<- 'a')
            '{e'}ik' (<- 'e')
            'ik' (delete)
        )
    )
)

define stem as (
    do mark_regions
    backwards (
      do instrum
        do case
        do case_special
        do case_other
        do factive
        do owned
        do sing_owner
        do plur_owner
        do plural
    )
)