Here is a sample of Finnish vocabulary, with the stemmed forms that will be generated by this algorithm:
word | stem | word | stem | |||||||||
edeltäjien edeltäjiensä edeltäjiinsä edeltäjistään edeltäjiä edeltäjiään edeltäjä edeltäjälleen edeltäjän edeltäjäni edeltäjänsä edeltäjänä edeltäjässä edeltäjästä edeltäjästään edeltäjät edeltäjää edeltäjään edeltäjäänsä edeltäneelle edeltäneellä edeltäneeltä edeltäneen edeltäneenä edeltäneeseen edeltäneessä edeltäneestä edeltäneet edeltäneiden edeltäneissä edeltäneitä edeltänyt edeltänyttä edeltävien edeltäviin edeltävinä edeltävissä edeltävä edeltävälle edeltävällä |
⇒ |
edeltäj edeltäjie edeltäj edeltäj edeltäj edeltäjiä edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltäj edeltän edeltän edeltän edeltän edeltän edeltän edeltän edeltän edeltän edeltän edeltän edeltän edeltäny edeltänyt edeltäv edeltäv edeltäv edeltäv edeltäv edeltäv edeltäv |
innostu innostua innostuessaan innostui innostuimme innostuin innostuisi innostuisivat innostuivat innostukseen innostuksella innostuksen innostuksensa innostuksessa innostuksessaan innostuksesta innostuksissaan innostumaan innostuminen innostun innostuneelle innostuneempia innostuneen innostuneena innostuneesta innostuneesti innostuneet innostuneiden innostuneiksi innostunein innostuneina innostuneissa innostuneisuus innostuneita innostunut innostunutta innostus innostusta innostustaan innostutaan |
⇒ |
innostu innostu innostue innostui innostui innostu innostui innostuisiv innostuiv innostuks innostuks innostuks innostuks innostuks innostuks innostuks innostuks innostum innostumin innostu innostun innostun innostun innostun innostun innostun innostun innostun innostun innostun innostun innostun innostuneisuus innostun innostunu innostunut innostus innostu innostu innostu |
Finnish is not an Indo-European language, but belongs to the Finno-Ugric group, which again belongs to the Uralic group (*). Distinctions between a-, i- and d-suffixes can be made in Finnish, but they are much less sharply separated than in an Indo-European language. The system of endings is extremely elaborate, but strictly defined, and applies equally to all nominals, that is, to nouns, adjectives and pronouns. Verb endings have a close similarity to nominal endings, which again makes Finnish very different from any Indo-European language.
More problematical than the endings themselves is the change that can be effected in a stem as a result of taking a particular ending. A stem typically has two forms, strong and weak, where one class of ending follows the strong form and the complementary class the weak. Normalising strong and weak forms after ending removal is not generally possible, although the common case where strong and weak forms only differ in the single or double form of a final consonant can be dealt with.
Finnish includes the following accented forms,
The following letters are vowels:
R1 and R2 are then defined in the usual way (see the note on R1 and R2).
Do each of steps 1, 2, 3, 4, 5 and 6.
Step 1: particles etc
Note that only the suffix needs to be in R1, the n, t or vowel of 1(a) is not required to be. And similarly below.
Step 2: possessives
The remaining steps require a few definitions.
Define a v (vowel) as one of a e i o u y ä ö.
Define a V (restricted vowel) as one of a e i o u ä ö.
So Vi means a V followed by letter i.
Define LV (long vowel) as one of aa ee ii oo uu ää öö.
Define a c (consonant) as a character from ASCII a-z which isn't in
v (originally this was "a character other than a v but since
2018-04-11 we've changed this definition to avoid the stemmer from altering
sequences of digits).
So cv means a c followed by a v.
Step 3: cases
So aarteisiin → aartei, the longest matching suffix being siin, preceded as it is by Vi. But adressiin → adressi. The longest matching suffix is not siin, because there is no preceding Vi, but n, and then the last vowel of the preceding LV is removed.
Step 4: other endings
Step 5: plurals
Step 6: tidying up
a) If R1 ends LV delete the last letter
b) If R1 ends cX, c a consonant and X one of a ä e i, delete the last
letter
c) If R1 ends oj or uj delete the last letter
d) If R1 ends jo delete the last letter
Do step (e), which is not restricted to R1.
e) If the word ends with a double consonant followed by zero or more vowels, remove the last consonant (so eläkk → eläk, aatonaatto → aatonaato)
/* Finnish stemmer.
Numbers in square brackets refer to the sections in
Fred Karlsson, Finnish: An Essential Grammar. Routledge, 1999
ISBN 0-415-20705-3
*/
routines (
mark_regions
R2
particle_etc possessive
LONG VI
case_ending
i_plural
t_plural
other_endings
tidy
)
externals ( stem )
integers ( p1 p2 )
strings ( x )
booleans ( ending_removed )
groupings ( AEI C V1 V2 particle_end )
stringescapes {}
/* special characters */
stringdef a" '{U+00E4}'
stringdef o" '{U+00F6}'
define AEI 'a{a"}ei'
define C 'bcdfghjklmnpqrstvwxz'
define V1 'aeiouy{a"}{o"}'
define V2 'aeiou{a"}{o"}'
define particle_end V1 + 'nt'
define mark_regions as (
$p1 = limit
$p2 = limit
goto V1 gopast non-V1 setmark p1
goto V1 gopast non-V1 setmark p2
)
backwardmode (
define R2 as $p2 <= cursor
define particle_etc as (
setlimit tomark p1 for ([substring])
among(
'kin'
'kaan' 'k{a"}{a"}n'
'ko' 'k{o"}'
'han' 'h{a"}n'
'pa' 'p{a"}' // Particles [91]
(particle_end)
'sti' // Adverb [87]
(R2)
)
delete
)
define possessive as ( // [36]
setlimit tomark p1 for ([substring])
among(
'si'
(not 'k' delete) // take 'ksi' as the Comitative case
'ni'
(delete ['kse'] <- 'ksi') // kseni = ksi + ni
'nsa' 'ns{a"}'
'mme'
'nne'
(delete)
/* Now for Vn possessives after case endings: [36] */
'an'
(among('ta' 'ssa' 'sta' 'lla' 'lta' 'na') delete)
'{a"}n'
(among('t{a"}' 'ss{a"}' 'st{a"}'
'll{a"}' 'lt{a"}' 'n{a"}') delete)
'en'
(among('lle' 'ine') delete)
)
)
define LONG as
among('aa' 'ee' 'ii' 'oo' 'uu' '{a"}{a"}' '{o"}{o"}')
define VI as ('i' V2)
define case_ending as (
setlimit tomark p1 for ([substring])
among(
'han' ('a') //-.
'hen' ('e') // |
'hin' ('i') // |
'hon' ('o') // |
'h{a"}n' ('{a"}') // Illative [43]
'h{o"}n' ('{o"}') // |
'siin' VI // |
'seen' LONG //-'
'den' VI
'tten' VI // Genitive plurals [34]
()
'n' // Genitive or Illative
( try ( LONG // Illative
or 'ie' // Genitive
and next ]
)
/* otherwise Genitive */
)
'a' '{a"}' //-.
(V1 C) // |
'tta' 'tt{a"}' // Partitive [32]
('e') // |
'ta' 't{a"}' //-'
'ssa' 'ss{a"}' // Inessive [41]
'sta' 'st{a"}' // Elative [42]
'lla' 'll{a"}' // Adessive [44]
'lta' 'lt{a"}' // Ablative [51]
'lle' // Allative [46]
'na' 'n{a"}' // Essive [49]
'ksi' // Translative[50]
'ine' // Comitative [51]
/* Abessive and Instructive are too rare for
inclusion [51] */
)
delete
set ending_removed
)
define other_endings as (
setlimit tomark p2 for ([substring])
among(
'mpi' 'mpa' 'mp{a"}'
'mmi' 'mma' 'mm{a"}' // Comparative forms [85]
(not 'po') //-improves things
'impi' 'impa' 'imp{a"}'
'immi' 'imma' 'imm{a"}' // Superlative forms [86]
'eja' 'ej{a"}' // indicates agent [93.1B]
)
delete
)
define i_plural as ( // [26]
setlimit tomark p1 for ([substring])
among(
'i' 'j'
)
delete
)
define t_plural as ( // [26]
setlimit tomark p1 for (
['t'] test V1
delete
)
setlimit tomark p2 for ([substring])
among(
'mma' (not 'po') //-mmat endings
'imma' //-immat endings
)
delete
)
define tidy as (
setlimit tomark p1 for (
do ( LONG and ([next] delete ) ) // undouble vowel
do ( [AEI] C delete ) // remove trailing a, a", e, i
do ( ['j'] 'o' or 'u' delete )
do ( ['o'] 'j' delete )
)
goto non-V1 [C] -> x x delete // undouble consonant
)
)
define stem as (
do mark_regions
unset ending_removed
backwards (
do particle_etc
do possessive
do case_ending
do other_endings
(ending_removed do i_plural) or do t_plural
do tidy
)
)