Here is a sample of Hungarian vocabulary, with the stemmed forms that will be generated by this algorithm:
word | stem | word | stem | |||||||||
babaháznak babakocsi babakocsijáért babakocsit babakocsiért babból bab babgulyás babgulyást babona babonákkal babonás babrálgatta babrálni babrál babrált babrálva babusgatnak baba babái babák babákkal babázni babérfa babérokat babért bacchánsnők badacsonyi badarság badarságok baedeker baglyokat bagolyszemüveges bagót bajbajutott bajbajutottak bajbajutottakat bajbajutottakon bajlódjanak bajlódni |
⇒ |
babaház babakocs babakocs babakocs babakocs bab bab babgulyás babgulyás babon babona babonás babrálgatt babráln babrál babrál babrálv babusgat ba baba baba baba babázn babérf babér bab bacchánsnő badacsony badarság badarság baedeker bagly bagolyszemüveges bagó bajbajutot bajbajutott bajbajutott bajbajutott bajlód bajlódn |
muattta mukkot mulandóság mulandóságot mulasszátok mulasztanak mulasztotta mulasztottam mulasztották mulaszt mulaszthatom mulasztás mulasztásban mulasztásból mulasztásnál mulasztással mulasztásának mulasztásánál mulasztásáért mulasztási mulasztásos mulasztó mulathatnánk mulathattunk mulatna mulat mulatnak mulatni mulattak mulattat mulattatta mulatott mulatozott mulatozáshoz mulatozást mulatság mulatságnak mulatságot mulatságos mulatt |
⇒ |
muattt muk mulandóság mulandóság mulasszát mulaszt mulasztott mulasztott mulasztotta mulasz mulaszthat mulasztás mulasztás mulasztás mulasztás mulasztás mulasztás mulasztás mulasztás mulasztás mulasztásos mulasztó mulathatna mulathatt mulatn mul mulat mulatn mulatt mulatt mulattatt mulatot mulatozot mulatozás mulatozás mulatság mulatság mulatság mulatságos mulat |
The algorithm is described in the paper as "Light". It primarily aims to remove noun inflections ("all noun cases, plural and frequent owners"). This means it also stems adjectives ("the two are linked because of similar morphology"). Some verb forms are stemmed, but really only as a side-effect when a rule to remove a noun suffix matches a verb form as well. The paper presents the case that stemming verbs matters less for retrieval.
The Hungrian language has these digraphs:
However treating these specially makes no difference to the results of the algorithm on valid Hungarian words so (since Snowball 2.3.0) the algorithm doesn't treat digraphs specially, except that some of the double constants include digraphs: e.g. ccs.
This stemming algorithm removes the inflectional suffixes of nouns. Nouns are inflected for case, person/possession and number.
Letters in Hungarian include the following accented forms,
The following letters are vowels:
For the purposes of this algorithm we define a consonant as a character which is not a vowel.
A double consonant is defined as:
If the word begins with a vowel, R1 is defined as the region after the first consonant in the word. If the word begins with a consonant, it is defined as the region after the first vowel in the word. If the word does not contain both a vowel and consonant, R1 is the null region at the end of the word.
For example:
t ó b a n consonant-vowel |.....| R1 is 'a b a n' a b l a k a n vowel-consonant |.........| R1 is 'l a k a n' a c s o n y vowel-digraph |.....| R1 is 'o n y' c v s --->|<--- null R1 region
‘Delete if in R1’ means that the suffix should be removed if it is in region R1 but not if it is outside.
Do steps 1 to 9 in turn
Step 1: Remove instrumental case
Step 2: Remove frequent cases
Step 3: Remove special cases:
Step 4: Remove other cases:
Step 5: Remove factive case
Step 6: Remove owned
Step 7: Remove singular owner suffixes
Step 8: Remove plural owner suffixes
Step 9: Remove plural suffixes
Mar 2025 (Snowball 2.3.0): Removed special handling of digraphs. We were ensuring R1 didn't start in the middle of a digraph (except that "dz" was missing from the Snowball implementation although included in the algorithm description). However having R1 start in the middle of a digraph would only make a difference to the stemming if we removed a suffix that started with the last character of the digraph (or with "zs" in the case of "dzs").
No suffixes we remove start with y or z.
Two suffixes start with s (stul and stül) so removing special handling of cs and dzs makes a difference to some inputs but not to any inputs which are valid Hungarian words.
Removing the digraph handling speeds up stemming (by ~2% on the current sample vocabulary list).
/*
Hungarian Stemmer
Removes noun inflections
*/
routines (
mark_regions
R1
v_ending
case
case_special
case_other
plural
owned
sing_owner
plur_owner
instrum
factive
undouble
double
)
externals ( stem )
integers ( p1 )
groupings ( v )
stringescapes {}
/* special characters */
stringdef a' '{U+00E1}' //a-acute
stringdef e' '{U+00E9}' //e-acute
stringdef i' '{U+00ED}' //i-acute
stringdef o' '{U+00F3}' //o-acute
stringdef o" '{U+00F6}' //o-umlaut
stringdef oq '{U+0151}' //o-double acute
stringdef u' '{U+00FA}' //u-acute
stringdef u" '{U+00FC}' //u-umlaut
stringdef uq '{U+0171}' //u-double acute
define v 'aeiou{a'}{e'}{i'}{o'}{o"}{oq}{u'}{u"}{uq}'
define mark_regions as (
$p1 = limit
(
// Word start with a vowel, start R1 after: V...C
v
do (gopast non-v setmark p1)
) or (
// Word start with a non-vowel, start R1 after: C...V
gopast v setmark p1
)
)
backwardmode (
define R1 as $p1 <= cursor
define v_ending as (
[substring] R1 among(
'{a'}' (<- 'a')
'{e'}' (<- 'e')
)
)
define double as (
test among('bb' 'cc' 'ccs' 'dd' 'ff' 'gg' 'ggy' 'jj' 'kk' 'll' 'lly' 'mm'
'nn' 'nny' 'pp' 'rr' 'ss' 'ssz' 'tt' 'tty' 'vv' 'zz' 'zzs')
)
define undouble as (
next [hop 1] delete
)
define instrum as(
[substring] R1 among(
'al' (double)
'el' (double)
)
delete
undouble
)
define case as (
[substring] R1 among(
'ban' 'ben'
'ba' 'be'
'ra' 're'
'nak' 'nek'
'val' 'vel'
't{o'}l' 't{oq}l'
'r{o'}l' 'r{oq}l'
'b{o'}l' 'b{oq}l'
'hoz' 'hez' 'h{o"}z'
'n{a'}l' 'n{e'}l'
'ig'
'at' 'et' 'ot' '{o"}t'
'{e'}rt'
'k{e'}pp' 'k{e'}ppen'
'kor'
'ul' '{u"}l'
'v{a'}' 'v{e'}'
'onk{e'}nt' 'enk{e'}nt' 'ank{e'}nt'
'k{e'}nt'
'en' 'on' 'an' '{o"}n'
'n'
't'
)
delete
v_ending
)
define case_special as(
[substring] R1 among(
'{e'}n' (<- 'e')
'{a'}n' (<- 'a')
'{a'}nk{e'}nt' (<- 'a')
)
)
define case_other as(
[substring] R1 among(
'astul' 'est{u"}l' (delete)
'stul' 'st{u"}l' (delete)
'{a'}stul' (<- 'a')
'{e'}st{u"}l' (<- 'e')
)
)
define factive as(
[substring] R1 among(
'{a'}' (double)
'{e'}' (double)
)
delete
undouble
)
define plural as (
[substring] R1 among(
'{a'}k' (<- 'a')
'{e'}k' (<- 'e')
'{o"}k' (delete)
'ak' (delete)
'ok' (delete)
'ek' (delete)
'k' (delete)
)
)
define owned as (
[substring] R1 among (
'ok{e'}' '{o"}k{e'}' 'ak{e'}' 'ek{e'}' (delete)
'{e'}k{e'}' (<- 'e')
'{a'}k{e'}' (<- 'a')
'k{e'}' (delete)
'{e'}{e'}i' (<- 'e')
'{a'}{e'}i' (<- 'a')
'{e'}i' (delete)
'{e'}{e'}' (<- 'e')
'{e'}' (delete)
)
)
define sing_owner as (
[substring] R1 among(
'{u"}nk' 'unk' (delete)
'{a'}nk' (<- 'a')
'{e'}nk' (<- 'e')
'nk' (delete)
'{a'}juk' (<- 'a')
'{e'}j{u"}k' (<- 'e')
'juk' 'j{u"}k' (delete)
'uk' '{u"}k' (delete)
'em' 'om' 'am' (delete)
'{a'}m' (<- 'a')
'{e'}m' (<- 'e')
'm' (delete)
'od' 'ed' 'ad' '{o"}d' (delete)
'{a'}d' (<- 'a')
'{e'}d' (<- 'e')
'd' (delete)
'ja' 'je' (delete)
'a' 'e' 'o' (delete)
'{a'}' (<- 'a')
'{e'}' (<- 'e')
)
)
define plur_owner as (
[substring] R1 among(
'jaim' 'jeim' (delete)
'{a'}im' (<- 'a')
'{e'}im' (<- 'e')
'aim' 'eim' (delete)
'im' (delete)
'jaid' 'jeid' (delete)
'{a'}id' (<- 'a')
'{e'}id' (<- 'e')
'aid' 'eid' (delete)
'id' (delete)
'jai' 'jei' (delete)
'{a'}i' (<- 'a')
'{e'}i' (<- 'e')
'ai' 'ei' (delete)
'i' (delete)
'jaink' 'jeink' (delete)
'eink' 'aink' (delete)
'{a'}ink' (<- 'a')
'{e'}ink' (<- 'e')
'ink'
'jaitok' 'jeitek' (delete)
'aitok' 'eitek' (delete)
'{a'}itok' (<- 'a')
'{e'}itek' (<- 'e')
'itek' (delete)
'jeik' 'jaik' (delete)
'aik' 'eik' (delete)
'{a'}ik' (<- 'a')
'{e'}ik' (<- 'e')
'ik' (delete)
)
)
)
define stem as (
do mark_regions
backwards (
do instrum
do case
do case_special
do case_other
do factive
do owned
do sing_owner
do plur_owner
do plural
)
)