In March 2012, Jim O’Regan sent us an implementation of Ljiljana Dolamic's Czech stemmer. This was only on the website for many years, but finally got merged in Snowball 3.1.0 in 2026.
The merged version is based on Jim's, but resolves discrepancies between it, the original paper (linked above), and the light and aggressive Czech stemmers from UniNE which are also by Ljiljana Dolamic. We have made significant further improvements after evaluating our stemmer on a Czech vocabulary list.
We've chosen the "light" versions of Dolamic's and O’Regan's stemmers as our starting point. The other Snowball stemmers are generally light rather than aggressive, and Dolamic's aggressive variant is reportedly known to overstem.
We've focussed on stemming noun forms by removing case suffixes and possessive suffixes. There's significant overlap in Czech noun and adjectival suffixes, and some overlap with verbs, so many adjectives and some verbs are also handled.
Like most of the other Snowball stemmers, the Czech stemmer defines an R1 region to reduce overstemming. The definition of R1 used for Czech is similar to that for other languages, but also takes into account syllabic consonants: in Czech, an r and l between two consonants, or after a consonant at the end of a word, can act as a syllable nuceleus in place of a vowel.
For example, vlna (wool) has declension vlna, vlny, vlne, vlnu, vlne, vlnou and we want R1 to start right after vln.
Some sources say m and/or n can also act in this way, but these seem to be extremely rarely used in practice, and including them actually causes a false conflations while almost never helping, so our stemmer only considers r and l as special for determining R1.
R1 is also required to start at least 3 characters into the word which helped to reduce overstemming in our tests.
(The stemmer in the paper and the UniNE versions both enforce a minimum stem length using a simple character count; Jim's Snowball version uses defines a region R1, but his definition doesn't take into account syllabic consonants.)
A hard final consonant in the word stem is softened before certain suffixes, which is reflected in how written words are spelled. For example kluk + -i → kluci. Ideally we want to reverse this when removing such a suffix, but this is tricky because the word stem could end in c - for example, vejce + -i → vejci. In some cases (such as vejci) we can safely make <vejk our stem - it's linguistically incorrect, but doesn't collide with the stem of an unrelated word. In some other cases we don't palatalise a trailing consonant because our testing suggested it hindered more than it helped.
(The stemmer in the paper, the UniNE versions, and Jim's Snowball version all palatalise in more cases.)
For nouns in which the stem ends with a consonant group, a floating -e- is usually inserted between the last two consonants in cases with no ending. For example, zámku and zámek. Our stemmer removes this in most cases, so both of these stem to zámk. (This is not done by the other stemmer versions.)
Sometimes a vowel changes in the word stem when adding certain suffixes - for example, the genitive plural of smlouva is smluv. This seems to be hard to reverse without affecting words where the vowel hasn't changed, so our stemmer does not currently try to do anything special for such words (nor do the other versions).
routines (
R1
palatalise_e
palatalise_i
mark_regions
possessive_suffix
case_suffix
)
externals ( stem )
integers ( p1 x )
groupings ( env_ending ev_ending v v_or_syllabic_c )
stringescapes {}
stringdef a' '{U+00E1}' // á
stringdef cv '{U+010D}' // č
stringdef dv '{U+010F}' // ď
stringdef e' '{U+00E9}' // é
stringdef ev '{U+011B}' // ě
stringdef i' '{U+00ED}' // í
stringdef nv '{U+0148}' // ň
stringdef o' '{U+00F3}' // ó
stringdef rv '{U+0159}' // ř
stringdef sv '{U+0161}' // š
stringdef tv '{U+0165}' // ť
stringdef u' '{U+00FA}' // ú
stringdef uo '{U+016F}' // ů
stringdef y' '{U+00FD}' // ý
stringdef zv '{U+017E}' // ž
define v 'aeiouy{a'}{ev}{e'}{i'}{o'}{u'}{uo}{y'}'
// Some consonants in Czech can be syllabic - if these occur between two other
// consonants then they act in a vowel-like way and it is helpful to include
// them in the definition of R1.
//
// Some sources also list 'm' and 'n' as syllabic consonants for Czech but they
// seem to be much rarer and including them makes no difference to the results
// of stemming any words in our sample vocabulary list. Checking on a larger
// vocabulary list (also from wikipedia but with a lower cut-off frequency)
// all but one of the affected words don't seem to actually be Czech words.
define v_or_syllabic_c v + 'lr'
// Letters that can occur before -ev. Actual known exceptions include
// 'j' (objev, projev, výjev) and 'ř' (ohřev)
define ev_ending 'hknrtz'
// Letters that can occur before -eň. Actual known exceptions include
// 'g' (Irgeň), 'l' (zeleň), 'm' (kameň) and 'ř' (třeň).
define env_ending 'bc{cv}dhkprs{sv}tvz{zv}'
define mark_regions as (
test (hop 3 setmark x) // Signals f if the input < 3 characters.
$p1 = limit
do (
// A syllabic consonant must occur between two consonants, or be
// preceded by a consonant and at the end of the word.
//
// Instead of literally testing that, we handle the first character
// specially by only checking if it's a vowel; for subsequent
// characters we know that the character before is a consonant because
// otherwise we'd have stopped already.
//
// We also don't actually need to check the character after, since
// if it's a vowel then that vowel means we'd end up at the same
// position after `gopast non-v` anyway, and if it's the end of the
// word then there's no non-v after it.
(v or (next gopast v_or_syllabic_c)) gopast non-v
setmark p1
try($p1 < x $p1 = x) // at least 3
)
)
backwardmode (
define R1 as $p1 <= cursor
define palatalise_e as (
[substring] among (
// -c -> -k
'c'
(<- 'k')
'nc' // e.g. finance
'avc' // e.g. dravce
'ovc' // e.g. jalovce
()
'{i'}nc' // e.g. podmínce
(<- '{i'}nk')
)
)
define palatalise_i as (
[substring] among (
// -c -> -k
'c'
(<- 'k')
'nc' // e.g. financí
'avc' // e.g. nástavci
'ovc' // e.g. pískovci
()
'{i'}nc' // e.g. Gruzínci
(<- '{i'}nk')
'{cv}t' (<- 'ck')
// -št -> -sk
'{sv}t' // e.g. čeština
(<-'sk')
'{a'}{sv}t' // e.g. plášti
'de{sv}t' // e.g. dešti
'i{sv}t' // e.g. bojišti
'{i'}{sv}t' // e.g. příští
'le{sv}t' // e.g. kleští
'pou{sv}t' // e.g. poušti, spouští
()
)
)
define possessive_suffix as (
[substring] R1 among (
'ov' '{uo}v'
(delete)
'in'
(
delete
try palatalise_i
)
)
)
define case_suffix as (
setlimit tomark p1 for ( [substring] ) among (
'atech'
'at{uo}m'
'{a'}ch' '{y'}ch' 'ov{e'}' '{y'}mi'
'ata' 'aty' 'ama' 'ami' 'ovi'
'at' '{a'}m' 'us' '{uo}m' '{y'}m' 'mi' 'ou'
'{e'}ho' '{e'}m' '{e'}mu'
'u' 'y' '{uo}' 'a' 'o' '{a'}' '{e'}' '{y'}'
(delete)
'{ev}' '{ev}tem' '{ev}mi' '{ev}te' '{ev}ti'
'{ev}m' // e.g. koněm
(
delete
)
'e' 'ech' 'em' 'emi'
(
delete
try palatalise_e
)
'ete' 'eti' 'etem'
(
// t-stem neuter nouns
among (
'{cv}' // e.g. dvojč-etem
'l' // e.g. batol-ete
'{rv}' // e.g. zvíř-ete
's' // e.g. pras-ete
'{zv}' // e.g. páž-ete
(delete)
'e{cv}' // e.g. pečet-i
'tl' // e.g. atlet-i
'es' // e.g. deset-i
''
(
// Remove -e, -i, or -em; stem now ends -et so no palatalise
// step is needed.
<-'et'
)
)
)
'eb'
( // Conflate e.g. skladeb with skladba, skladbě, skladby, etc.
test non-v
not 't{rv}' // potřeb
<-'b'
)
'ec'
( // Conflate e.g. obec with obce, obcemi, obci, obcí, obcích.
test non-v
delete attach 'c'
try palatalise_e
)
'ek'
( // Conflate e.g. článek with článkem, článku, článků, článkům, články.
test non-v
not among (
'dot' // dotek
'obl' // oblek
'sn' // česnek
)
<-'k'
)
'{ev}k'
( // Conflate e.g. daněk with daňka, daňkem, daňki, daňkovi, daňků, etc.
'n'] <-'{nv}k'
)
'e{nv}'
( // Conflate e.g. Plzeň with Plzně, Plzni, Plzní.
test env_ending
<-'n' // -eň -> -n not -ň
)
// -eš can also lose the -e- but this seems very uncommon - the only
// example I've seen is Aleš (a male given name or diminuitive name)
// but we require 3 characters before R1 so this won't be considered
// for stemming.
//
// Also this can decline as Alše or Aleše, etc, and these alternative
// declensions mean that just removing the -e- from Aleš would not
// really help (especially as the forms which keep the -e- seem
// more common, at least based on cs.wikipedia.org) so if we did this
// we would need to also remove -e- from Aleše, etc, which seems a lot
// of complication for a single word.
'et'
( // Conflate e.g. počet with počte, počtu, počty, etc.
among (
'uc' // e.g. tucet but not dvacet.
'{cv}'
'h'
'ok' // e.g. loket but not paket.
'kar' // e.g. karet but not cigaret.
)
<-'t'
)
'ev'
( // Conflate e.g. církev with církve, církve, církvemu, církví, etc.
ev_ending
<-'v'
)
'{tv}' '{tv}mi'
( // Conflate e.g. oběť and oběťmi with obětech; hradišť with hradišti.
<-'t'
)
'i'
'{i'}' '{i'}ch' '{i'}ho' '{i'}m' '{i'}mi' '{i'}mu'
(
delete
try palatalise_i
)
)
)
)
define stem as (
mark_regions // Signals f if the input has < 3 characters.
backwards (
do case_suffix
do possessive_suffix
)
)