Here is a sample of French vocabulary, with the stemmed forms that will be generated by this algorithm:
word | stem | word | stem | |||||||||
continu continua continuait continuant continuation continue continuel continuelle continuellement continuelles continuels continuer continuera continuerait continueront continuez continuité continuons continué contorsions contour contournait contournant contourne contours contractait contracter contractions contracté contractée contractés contradictoirement contradictoires contraindre contraint contrainte contraintes contraire contraires contraria |
⇒ |
continu continu continu continu continu continu continuel continuel continuel continuel continuel continu continu continu continu continu continu continuon continu contors contour contourn contourn contourn contour contract contract contract contract contract contract contradictoir contradictoir contraindr contraint contraint contraint contrair contrair contrari |
main mains maintenaient maintenait maintenant maintenir maintenue maintien maintint maire maires mairie mais maison maisons maistre maitre majestueuse majestueusement majestueux majesté majeur majeure major majordome majordomes majorité majorités mal malacca malade malades maladie maladies maladive maladresse maladresses maladroit maladroite maladroitement |
⇒ |
main main mainten mainten mainten mainten maintenu maintien maintint mair mair mair mais maison maison maistr maitr majestu majestu majestu majest majeur majeur major majordom majordom major major mal malacc malad malad malad malad malad maladress maladress maladroit maladroit maladroit |
Letters in French include the following accented forms,
The following letters are vowels:
The first step removes elisions. If the word starts with one of c d j l m n s t or qu, followed by an apostrophe (') then remove this unless it comprises the whole word.
Assume the word is in lower case. Then, taking the letters in turn from the beginning to end of the word, put u or i into upper case when it is both preceded and followed by a vowel; put y into upper case when it is either preceded or followed by a vowel; and put u into upper case when it follows q. For example,
jouer | → | joUer | ||
ennuie | → | ennuIe | ||
yeux | → | Yeux | ||
quand | → | qUand | ||
croyiez | → | croYiez |
In the last example, y becomes Y because it is between two vowels, but i does not become I because it is between Y and e, and Y is not defined as a vowel above.
(The upper case forms are not then classed as vowels — see note on vowel marking.)
Replace ë and ï with He and Hi. The H marks the vowel as having originally had a diaeresis, while the vowel itself, lacking an accent, is able to match suffixes beginning in e or i.
If the word begins with two vowels, RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. (Exceptionally, par, col or tap, at the beginning of a word is also taken to define RV as the region to their right.)
For example,
a i m e r a d o r e r v o l e r t a p i s |...| |.....| |.....| |...|
R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)
For example:
f a m e u s e m e n t |......R1.......| |...R2....|
Note that R1 can contain RV (adorer), and RV can contain R1 (voler).
Below, ‘delete if in R2’ means that a found suffix should be removed if it lies entirely in R2, but not if it overlaps R2 and the rest of the word. ‘delete if in R1 and preceded by X’ means that X itself does not have to come in R1, while ‘delete if preceded by X in R1’ means that X, like the suffix, must be entirely in R1.
Start with step 1
Step 1: Standard suffix removal
In steps 2a and 2b all tests are confined to the RV region.
Do step 2a if either no ending was removed by step 1, or if one of endings amment, emment, ment, ments was found.
Step 2a: Verb suffixes beginning i
Do step 2b if step 2a was done, but failed to remove a suffix.
Step 2b: Other verb suffixes
If the last step to be obeyed — either step 1, 2a or 2b — altered the word, do step 3
Step 3
Alternatively, if the last step to be obeyed did not alter the word, do step 4
Step 4: Residual suffix
If the word ends s, not preceded by a, i (unless itself preceded by H), o, u, è or s, delete it.
In the rest of step 4, all tests are confined to the RV region.
Search for the longest among the following suffixes, and perform the action indicated.
Always do steps 5 and 6.
Step 5: Undouble
Step 6: Un-accent
And finally:
Turn any remaining I, U and Y letters in the word back into lower case.
Turn He and Hi back into ë and ï, and remove any remaining H.
routines (
elisions
prelude postlude mark_regions
RV R1 R2
standard_suffix
i_verb_suffix
verb_suffix
residual_suffix
un_double
un_accent
)
externals ( stem )
integers ( pV p1 p2 )
groupings ( elision_char v keep_with_s )
stringescapes {}
/* special characters */
stringdef a^ '{U+00E2}' // a-circumflex
stringdef a` '{U+00E0}' // a-grave
stringdef cc '{U+00E7}' // c-cedilla
stringdef e" '{U+00EB}' // e-diaeresis (rare)
stringdef e' '{U+00E9}' // e-acute
stringdef e^ '{U+00EA}' // e-circumflex
stringdef e` '{U+00E8}' // e-grave
stringdef i" '{U+00EF}' // i-diaeresis
stringdef i^ '{U+00EE}' // i-circumflex
stringdef o^ '{U+00F4}' // o-circumflex
stringdef u^ '{U+00FB}' // u-circumflex
stringdef u` '{U+00F9}' // u-grave
define v 'aeiouy{a^}{a`}{e"}{e'}{e^}{e`}{i"}{i^}{o^}{u^}{u`}'
// Single character elisions
define elision_char 'cdjlmnst'
define elisions as
(
[ (elision_char or 'qu') '{'}' ] not atlimit delete
)
define prelude as repeat goto (
( v [ ('u' ] v <- 'U') or
('i' ] v <- 'I') or
('y' ] <- 'Y')
)
or
( [ '{e"}' ] <- 'He' )
or
( [ '{i"}' ] <- 'Hi' )
or
( ['y'] v <- 'Y' )
or
( 'q' ['u'] <- 'U' )
)
define mark_regions as (
$pV = limit
$p1 = limit
$p2 = limit // defaults
do (
( v v next )
or
among ( // this exception list begun Nov 2006
'par' // paris, parie, pari
'col' // colis
'tap' // tapis
// extensions possible here
)
or
( next gopast v )
setmark pV
)
do (
gopast v gopast non-v setmark p1
gopast v gopast non-v setmark p2
)
)
define postlude as repeat (
[substring] among(
'I' (<- 'i')
'U' (<- 'u')
'Y' (<- 'y')
'He' (<- '{e"}')
'Hi' (<- '{i"}')
'H' (delete)
'' (next)
)
)
backwardmode (
define RV as $pV <= cursor
define R1 as $p1 <= cursor
define R2 as $p2 <= cursor
define standard_suffix as (
[substring] among(
'ance' 'iqUe' 'isme' 'able' 'iste' 'eux'
'ances' 'iqUes' 'ismes' 'ables' 'istes'
( R2 delete )
'atrice' 'ateur' 'ation'
'atrices' 'ateurs' 'ations'
( R2 delete
try ( ['ic'] (R2 delete) or <-'iqU' )
)
'logie'
'logies'
( R2 <- 'log' )
'usion' 'ution'
'usions' 'utions'
( R2 <- 'u' )
'ence'
'ences'
( R2 <- 'ent' )
'ement'
'ements'
(
RV delete
try (
[substring] among(
'iv' (R2 delete ['at'] R2 delete)
'eus' ((R2 delete) or (R1<-'eux'))
'abl' 'iqU'
(R2 delete)
'i{e`}r' 'I{e`}r' //)
(RV <-'i') //)--new 2 Sept 02
)
)
)
'it{e'}'
'it{e'}s'
(
R2 delete
try (
[substring] among(
'abil' ((R2 delete) or <-'abl')
'ic' ((R2 delete) or <-'iqU')
'iv' (R2 delete)
)
)
)
'if' 'ive'
'ifs' 'ives'
(
R2 delete
try ( ['at'] R2 delete ['ic'] (R2 delete) or <-'iqU' )
)
'eaux' (<- 'eau')
'aux' (R1 <- 'al')
'euse'
'euses'((R2 delete) or (R1<-'eux'))
'issement'
'issements'(R1 non-v delete) // verbal
// fail(...) below forces entry to verb_suffix. -ment typically
// follows the p.p., e.g 'confus{e'}ment'.
'amment' (RV fail(<- 'ant'))
'emment' (RV fail(<- 'ent'))
'ment'
'ments' (test(v RV) fail(delete))
// v is e,i,u,{e'},I or U
)
)
define i_verb_suffix as setlimit tomark pV for (
[substring] among (
'{i^}mes' '{i^}t' '{i^}tes' 'i' 'ie' 'ies' 'ir' 'ira' 'irai'
'iraIent' 'irais' 'irait' 'iras' 'irent' 'irez' 'iriez'
'irions' 'irons' 'iront' 'is' 'issaIent' 'issais' 'issait'
'issant' 'issante' 'issantes' 'issants' 'isse' 'issent' 'isses'
'issez' 'issiez' 'issions' 'issons' 'it'
(not 'H' non-v delete)
)
)
define verb_suffix as setlimit tomark pV for (
[substring] among (
'ions'
(R2 delete)
'{e'}' '{e'}e' '{e'}es' '{e'}s' '{e`}rent' 'er' 'era' 'erai'
'eraIent' 'erais' 'erait' 'eras' 'erez' 'eriez' 'erions'
'erons' 'eront' 'ez' 'iez'
// 'ons' //-best omitted
(delete)
'{a^}mes' '{a^}t' '{a^}tes' 'a' 'ai' 'aIent' 'ais' 'ait' 'ant'
'ante' 'antes' 'ants' 'as' 'asse' 'assent' 'asses' 'assiez'
'assions'
(delete
try(['e'] delete)
)
)
)
define keep_with_s 'aiou{e`}s'
define residual_suffix as (
try(['s'] test ('Hi' or non-keep_with_s) delete)
setlimit tomark pV for (
[substring] among(
'ion' (R2 's' or 't' delete)
'ier' 'i{e`}re'
'Ier' 'I{e`}re' (<-'i')
'e' (delete)
)
)
)
define un_double as (
test among('enn' 'onn' 'ett' 'ell' 'eill') [next] delete
)
define un_accent as (
atleast 1 non-v
[ '{e'}' or '{e`}' ] <-'e'
)
)
define stem as (
do elisions
do prelude
do mark_regions
backwards (
do (
(
( standard_suffix or
i_verb_suffix or
verb_suffix
)
and
try( [ ('Y' ] <- 'i' ) or
('{cc}'] <- 'c' )
)
) or
residual_suffix
)
// try(['ent'] RV delete) // is best omitted
do un_double
do un_accent
)
do postlude
)