Hungarian stemming algorithm

Contributed by Anna Tordai University of Amsterdam

Links to resources

Here is a sample of Hungarian vocabulary, with the stemmed forms that will be generated by this algorithm:

word stem          word stem
babaháznak
babakocsi
babakocsijáért
babakocsit
babakocsiért
babból
bab
babgulyás
babgulyást
babona
babonákkal
babonás
babrálgatta
babrálni
babrál
babrált
babrálva
babusgatnak
baba
babái
babák
babákkal
babázni
babérfa
babérokat
babért
bacchánsnők
badacsonyi
badarság
badarságok
baedeker
baglyokat
bagolyszemüveges
bagót
bajbajutott
bajbajutottak
bajbajutottakat
bajbajutottakon
bajlódjanak
bajlódni
babaház
babakocs
babakocs
babakocs
babakocs
bab
bab
babgulyás
babgulyás
babon
babona
babonás
babrálgatt
babráln
babrál
babrál
babrálv
babusgat
ba
baba
baba
baba
babázn
babérf
babér
bab
bacchánsnő
badacsony
badarság
badarság
baedeker
bagly
bagolyszemüveges
bagó
bajbajutot
bajbajutott
bajbajutott
bajbajutott
bajlód
bajlódn
muattta
mukkot
mulandóság
mulandóságot
mulasszátok
mulasztanak
mulasztotta
mulasztottam
mulasztották
mulaszt
mulaszthatom
mulasztás
mulasztásban
mulasztásból
mulasztásnál
mulasztással
mulasztásának
mulasztásánál
mulasztásáért
mulasztási
mulasztásos
mulasztó
mulathatnánk
mulathattunk
mulatna
mulat
mulatnak
mulatni
mulattak
mulattat
mulattatta
mulatott
mulatozott
mulatozáshoz
mulatozást
mulatság
mulatságnak
mulatságot
mulatságos
mulatt
muattt
muk
mulandóság
mulandóság
mulasszát
mulaszt
mulasztott
mulasztott
mulasztotta
mulasz
mulaszthat
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztásos
mulasztó
mulathatna
mulathatt
mulatn
mul
mulat
mulatn
mulatt
mulatt
mulattatt
mulatot
mulatozot
mulatozás
mulatozás
mulatság
mulatság
mulatság
mulatságos
mulat

This stemming algorithm removes the inflectional suffixes of nouns. Nouns are inflected for case, person/possession and number.

Letters in Hungarian include the following accented forms,

á   é   í   ó   ö   ő   ú   ü   ű

The following letters are vowels:

a   á   e   é   i   í   o   ó   ö   ő   u   ú   ü   ű

The following letters are digraphs:

cs   dz   dzs   gy   ly   ny   ty   zs

A double consonant is defined as:

bb   cc   ccs   dd   ff   gg   ggy   jj   kk   ll   lly   mm   nn   nny   pp   rr   ss   ssz   tt   tty   vv   zz   zzs

If the word begins with a vowel, R1 is defined as the region after the first consonant or digraph in the word. If the word begins with a consonant, it is defined as the region after the first vowel in the word. If the word does not contain both a vowel and consonant, R1 is the null region at the end of the word.

For example:

    t ó b a n           consonant-vowel
       |.....|          R1 is 'a b a n'

    a b l a k a n       vowel-consonant
       |.........|      R1 is 'l a k a n'

    a c s o n y         vowel-digraph
         |.....|        R1 is 'o n y'

    c v s
     --->|<---          null R1 region

‘Delete if in R1’ means that the suffix should be removed if it is in region R1 but not if it is outside.

Do steps 1 to 9 in turn

Step 1: Remove instrumental case

Search for one of the following suffixes and perform the action indicated.
al   el
delete if in R1 and preceded by a double consonant, and remove one of the double consonants. (In the case of consonant plus digraph, such as ccs, remove a c).

Step 2: Remove frequent cases

Search for the longest among the following suffixes and perform the action indicated.
ban   ben   ba   be   ra   re   nak   nek   val   vel   tól   től   ról   ről   ból   ből   hoz   hez   höz   nál   nél   ig   at   et   ot   öt   ért   képp   képpen   kor   ul   ül   vá   vé   onként   enként   anként   ként   en   on   an   ön   n   t
delete if in R1
if the remaining word ends á replace by a
if the remaining word ends é replace by e

Step 3: Remove special cases:

Search for the longest among the following suffixes and perform the action indicated.
án   ánként
replace by a if in R1
én
replace by e if in R1

Step 4: Remove other cases:

Search for the longest among the following suffixes and perform the action indicated
astul   estül   stul   stül
delete if in R1
ástul
replace with a if in R1
éstül
replace with e if in R1

Step 5: Remove factive case

Search for one of the following suffixes and perform the action indicated.
á   é
delete if in R1 and preceded by a double consonant, and remove one of the double consonants (as in step 1).

Step 6: Remove owned

Search for the longest among the following suffixes and perform the action indicated.
oké   öké   aké   eké   ké   éi   é
delete if in R1
áké   áéi
replace with a if in R1
éké   ééi   éé
replace with e if in R1

Step 7: Remove singular owner suffixes

Search for the longest among the following suffixes and perform the action indicated.
ünk   unk   nk   juk   jük   uk   ük   em   om   am   m   od   ed   ad   öd   d   ja   je   a   e o
delete if in R1
ánk ájuk ám ád á
replace with a if in R1
énk éjük ém éd é
replace with e if in R1

Step 8: Remove plural owner suffixes

Search for the longest among the following suffixes and perform the action indicated.
jaim   jeim   aim   eim   im   jaid   jeid   aid   eid   id   jai   jei   ai   ei   i   jaink   jeink   eink   aink   ink   jaitok   jeitek   aitok   eitek   itek   jeik   jaik   aik   eik   ik
delete if in R1
áim   áid   ái   áink   áitok   áik
replace with a if in R1
éim   éid     éi   éink   éitek   éik
replace with e if in R1

Step 9: Remove plural suffixes

Search for the longest among the following suffixes and perform the action indicated.
ák
replace with a if in R1
replace with e if in R1
ök   ok   ek   ak   k
delete if in R1

The full algorithm in Snowball

/*
Hungarian Stemmer
Removes noun inflections
*/

routines (
    mark_regions
    R1
    v_ending
    case
    case_special
    case_other
    plural
    owned
    sing_owner
    plur_owner
    instrum
    factive
    undouble
    double
)

externals ( stem )

integers ( p1 )
groupings ( v )

stringescapes {}

/* special characters */

stringdef a'  '{U+00E1}'  //a-acute
stringdef e'  '{U+00E9}'  //e-acute
stringdef i'  '{U+00ED}'  //i-acute
stringdef o'  '{U+00F3}'  //o-acute
stringdef o"  '{U+00F6}'  //o-umlaut
stringdef oq  '{U+0151}' //o-double acute
stringdef u'  '{U+00FA}'  //u-acute
stringdef u"  '{U+00FC}'  //u-umlaut
stringdef uq  '{U+0171}' //u-double acute

define v 'aeiou{a'}{e'}{i'}{o'}{o"}{oq}{u'}{u"}{uq}'

define mark_regions as (

    $p1 = limit

    (v goto non-v
     among('cs' 'gy' 'ly' 'ny' 'sz' 'ty' 'zs' 'dzs') or next
     setmark p1)
    or

    (non-v gopast v setmark p1)
)

backwardmode (

    define R1 as $p1 <= cursor

    define v_ending as (
        [substring] R1 among(
            '{a'}' (<- 'a')
            '{e'}' (<- 'e')
        )
    )

    define double as (
        test among('bb' 'cc' 'ccs' 'dd' 'ff' 'gg' 'ggy' 'jj' 'kk' 'll' 'lly' 'mm'
        'nn' 'nny' 'pp' 'rr' 'ss' 'ssz' 'tt' 'tty' 'vv' 'zz' 'zzs')
    )

    define undouble as (
        next [hop 1] delete
    )

    define instrum as(
        [substring] R1 among(
            'al' (double)
            'el' (double)
        )
        delete
        undouble
    )


    define case as (
        [substring] R1 among(
            'ban' 'ben'
            'ba' 'be'
            'ra' 're'
            'nak' 'nek'
            'val' 'vel'
            't{o'}l' 't{oq}l'
            'r{o'}l' 'r{oq}l'
            'b{o'}l' 'b{oq}l'
            'hoz' 'hez' 'h{o"}z'
            'n{a'}l' 'n{e'}l'
            'ig'
            'at' 'et' 'ot' '{o"}t'
            '{e'}rt'
            'k{e'}pp' 'k{e'}ppen'
            'kor'
            'ul' '{u"}l'
            'v{a'}' 'v{e'}'
            'onk{e'}nt' 'enk{e'}nt' 'ank{e'}nt'
            'k{e'}nt'
            'en' 'on' 'an' '{o"}n'
            'n'
            't'
        )
        delete
        v_ending
    )

    define case_special as(
        [substring] R1 among(
            '{e'}n' (<- 'e')
            '{a'}n' (<- 'a')
            '{a'}nk{e'}nt' (<- 'a')
        )
    )

    define case_other as(
        [substring] R1 among(
            'astul' 'est{u"}l' (delete)
            'stul' 'st{u"}l' (delete)
            '{a'}stul' (<- 'a')
            '{e'}st{u"}l' (<- 'e')
        )
    )

    define factive as(
        [substring] R1 among(
            '{a'}' (double)
            '{e'}' (double)
        )
        delete
        undouble
    )

    define plural as (
        [substring] R1 among(
            '{a'}k' (<- 'a')
            '{e'}k' (<- 'e')
            '{o"}k' (delete)
            'ak' (delete)
            'ok' (delete)
            'ek' (delete)
            'k' (delete)
        )
    )

    define owned as (
        [substring] R1 among (
            'ok{e'}' '{o"}k{e'}' 'ak{e'}' 'ek{e'}' (delete)
            '{e'}k{e'}' (<- 'e')
            '{a'}k{e'}' (<- 'a')
            'k{e'}' (delete)
            '{e'}{e'}i' (<- 'e')
            '{a'}{e'}i' (<- 'a')
            '{e'}i'  (delete)
            '{e'}{e'}' (<- 'e')
            '{e'}' (delete)
        )
    )

    define sing_owner as (
        [substring] R1 among(
            '{u"}nk' 'unk' (delete)
            '{a'}nk' (<- 'a')
            '{e'}nk' (<- 'e')
            'nk' (delete)
            '{a'}juk' (<- 'a')
            '{e'}j{u"}k' (<- 'e')
            'juk' 'j{u"}k' (delete)
            'uk' '{u"}k' (delete)
            'em' 'om' 'am' (delete)
            '{a'}m' (<- 'a')
            '{e'}m' (<- 'e')
            'm' (delete)
            'od' 'ed' 'ad' '{o"}d' (delete)
            '{a'}d' (<- 'a')
            '{e'}d' (<- 'e')
            'd' (delete)
            'ja' 'je' (delete)
            'a' 'e' 'o' (delete)
            '{a'}' (<- 'a')
            '{e'}' (<- 'e')
        )
    )

    define plur_owner as (
        [substring] R1 among(
            'jaim' 'jeim' (delete)
            '{a'}im' (<- 'a')
            '{e'}im' (<- 'e')
            'aim' 'eim' (delete)
            'im' (delete)
            'jaid' 'jeid' (delete)
            '{a'}id' (<- 'a')
            '{e'}id' (<- 'e')
            'aid' 'eid' (delete)
            'id' (delete)
            'jai' 'jei' (delete)
            '{a'}i' (<- 'a')
            '{e'}i' (<- 'e')
            'ai' 'ei' (delete)
            'i' (delete)
            'jaink' 'jeink' (delete)
            'eink' 'aink' (delete)
            '{a'}ink' (<- 'a')
            '{e'}ink' (<- 'e')
            'ink'
            'jaitok' 'jeitek' (delete)
            'aitok' 'eitek' (delete)
            '{a'}itok' (<- 'a')
            '{e'}itek' (<- 'e')
            'itek' (delete)
            'jeik' 'jaik' (delete)
            'aik' 'eik' (delete)
            '{a'}ik' (<- 'a')
            '{e'}ik' (<- 'e')
            'ik' (delete)
        )
    )
)

define stem as (
    do mark_regions
    backwards (
      do instrum
        do case
        do case_special
        do case_other
        do factive
        do owned
        do sing_owner
        do plur_owner
        do plural
    )
)