French stemming algorithm

Links to resources

Here is a sample of French vocabulary, with the stemmed forms that will be generated by this algorithm:

word

stem

word

stem

continu
continua
continuait
continuant
continuation
continue
continuel
continuelle
continuellement
continuelles
continuels
continuer
continuera
continuerait
continueront
continuez
continuité
continuons
continué
contorsions
contour
contournait
contournant
contourne
contours
contractait
contracter
contractions
contracté
contractée
contractés
contradictoirement
contradictoires
contraindre
contraint
contrainte
contraintes
contraire
contraires
contraria

⇒

continu
continu
continu
continu
continu
continu
continuel
continuel
continuel
continuel
continuel
continu
continu
continu
continu
continu
continu
continuon
continu
contors
contour
contourn
contourn
contourn
contour
contract
contract
contract
contract
contract
contract
contradictoir
contradictoir
contraindr
contraint
contraint
contraint
contrair
contrair
contrari

main
mains
maintenaient
maintenait
maintenant
maintenir
maintenue
maintien
maintint
maire
maires
mairie
mais
maison
maisons
maistre
maitre
majestueuse
majestueusement
majestueux
majesté
majeur
majeure
major
majordome
majordomes
majorité
majorités
mal
malacca
malade
malades
maladie
maladies
maladive
maladresse
maladresses
maladroit
maladroite
maladroitement

⇒

main
main
mainten
mainten
mainten
mainten
maintenu
maintien
maintint
mair
mair
mair
mais
maison
maison
maistr
maitr
majestu
majestu
majestu
majest
majeur
majeur
major
majordom
majordom
major
major
mal
malacc
malad
malad
malad
malad
malad
maladress
maladress
maladroit
maladroit
maladroit

Design Notes

In French the verb endings -ent and -ons cannot be removed without unacceptable overstemming (of accident and garçons for example). The -ons form is rarer, but -ent forms are quite common, and will appear regularly throughout a stemmed vocabulary.

The rule to replace -oux with -ou will produce linguistically incorrect stems by removing x from époux and jaloux, but this is harmless as no unwanted conflation results.

The rule to replace -aux with -al stems travaux to traval and vitraux to vitral wheras -ail would be lingusitically correct; similarly esquimaux, noyaux and tuyaux where -au would be more correct. None of these seem to result in unwanted conflation though.

The suffix -eux is removed if in R2. For a few short words (cheveux, jeux, lieux and neveux) removing -x would conflate a plural form with its singular, but the complexity of the rule needed to avoid adversely affecting other short words ending -eux does not seem justified for just four cases of understemming.

The stemming algorithm

Letters in French include the following accented forms,

â à ç ë é ê è ï î ô û ù

The following letters are vowels:

a e i o u y â à ë é ê è ï î ô û ù

The first step removes elisions. If the word starts with one of c d j l m n s t or qu, followed by an apostrophe (') which is not at the end of the word, then remove from the prefix of the word up to and including this apostrophe.

Assume the word is in lower case. Then, taking the letters in turn from the beginning to end of the word, put u or i into upper case when it is both preceded and followed by a vowel; put y into upper case when it is either preceded or followed by a vowel; and put u into upper case when it follows q. For example,

jouer	→	joUer
ennuie	→	ennuIe
yeux	→	Yeux
quand	→	qUand
croyiez	→	croYiez

In the last example, y becomes Y because it is between two vowels, but i does not become I because it is between Y and e, and Y is not defined as a vowel above.

(The upper case forms are not then classed as vowels — see note on vowel marking.)

Replace ë and ï with He and Hi. The H marks the vowel as having originally had a diaeresis, while the vowel itself, lacking an accent, is able to match suffixes beginning in e or i.

RV is defined as the region to the right of the first of these which is true (the examples show RV underlined):

The word starts with two vowels followed by another letter - e.g. aimer
The word starts par

parie

The word starts col

colis

The word starts tap

tapis

The word starts ni followed by any vowel

niaises

The first vowel not at the beginning of the word

adorer

voler

The end of the word - e.g. arcs (RV is an empty region at the end of the word)

R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)

For example:

    f a m e u s e m e n t
         |......R1.......|
               |...R2....|

Note that R1 can contain RV (adorer), and RV can contain R1 (voler).

Below, ‘delete if in R2’ means that a found suffix should be removed if it lies entirely in R2, but not if it overlaps R2 and the rest of the word. ‘delete if in R1 and preceded by X’ means that X itself does not have to come in R1, while ‘delete if preceded by X in R1’ means that X, like the suffix, must be entirely in R1.

Start with step 1

Step 1: Standard suffix removal

Search for the longest among the following suffixes, and perform the action indicated.

ance iqUe isme able iste eux ances iqUes ismes ables istes: delete if in R2
atrice ateur ation atrices ateurs ations: delete if in R2; if preceded by ic, delete if in R2, else replace by iqU
logie logies: replace with log if in R2
usion ution usions utions: replace with u if in R2
ence ences: replace with ent if in R2
ement ements: delete if in RV; if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,; if preceded by eus, delete if in R2, else replace by eux if in R1, otherwise,; if preceded by abl or iqU, delete if in R2, otherwise,; if preceded by ièr or Ièr, replace by i if in RV
ité ités: delete if in R2; if preceded by abil, delete if in R2, else replace by abl, otherwise,; if preceded by ic, delete if in R2, else replace by iqU, otherwise,; if preceded by iv, delete if in R2
if ive ifs ives: delete if in R2; if preceded by at, delete if in R2 (and if further preceded by ic, delete if in R2, else replace by iqU)
eaux: replace with eau
aux: replace with al if in R1
oux: replace with ou if preceded by one of b h j l n p
euse euses: delete if in R2, else replace by eux if in R1
issement issements: delete if in R1 and preceded by a non-vowel
amment: replace with ant if in RV
emment: replace with ent if in RV
ment ments: delete if preceded by a vowel in RV

Do step 2a if either no ending was removed by step 1, or if one of endings amment, emment, ment, ments was found.

Step 2a: Verb suffixes beginning i

Search for the longest among the following suffixes in RV and if found, delete if the preceding character is also in RV and is neither a vowel nor H.

îmes ît îtes i ie ies ir ira irai iraIent irais irait iras irent irez iriez irions irons iront is issaIent issais issait issant issante issantes issants isse issent isses issez issiez issions issons it

Do step 2b if step 2a was done, but failed to remove a suffix.

Step 2b: Other verb suffixes

Search for the longest among the following suffixes in RV, and perform the action indicated.

ions: delete if in R2
é ée ées és èrent er era erai eraIent erais erait eras erez eriez erions erons eront ez iez: delete
âmes ât âtes a ai aIent ait ant ante antes ants as asse assent asses assiez assions: delete; if preceded by e which is also in RV, delete the e as well
ais aise aises: delete unless preceded by one of: al preceded by exactly one character, auv, épl

If the last step to be obeyed — either step 1, 2a or 2b — altered the word, do step 3

Step 3

Replace final Y with i or final ç with c

Alternatively, if the last step to be obeyed did not alter the word, do step 4

Step 4: Residual suffix

If the word ends s, not preceded by a, i (unless itself preceded by H), o, u, è or s, delete it.

In the rest of step 4, all tests are confined to the RV region.

Search for the longest among the following suffixes, and perform the action indicated.

ion: delete if in R2 and preceded by s or t
ier ière Ier Ière: replace with i
e: delete

(So note that ion is removed only when it is in R2 — as well as being in RV — and preceded by s or t which must be in RV.)

Always do steps 5 and 6.

Step 5: Undouble

If the word ends enn, onn, ett, ell or eill, delete the last letter

Step 6: Un-accent

If the words ends é or è followed by at least one non-vowel, remove the accent from the e.

And finally:

Turn any remaining I, U and Y letters in the word back into lower case.

Turn He and Hi back into ë and ï, and remove any remaining H.

History of functional changes to the algorithm

September 2002: New rule for -ièr
Snowball 2.0.0: Suffixes that begin with a diaereses are now removed (done by replacing ë and ï with He and Hi, during stemming then undoing afterwards).
Snowball 3.0.0: Added a new first step which removes elisions.
Snowball 3.0.0: Added rule to replace -oux with -ou.
Snowball 3.0.0: Added RV exceptions for ni followed by a vowel.
Snowball 3.0.0: Restrict removal of -ais; remove -aise and -aises.

The same algorithm in Snowball

routines (
           elisions
           prelude postlude mark_regions
           RV R1 R2
           standard_suffix
           i_verb_suffix
           verb_suffix
           residual_suffix
           un_double
           un_accent
)

externals ( stem )

integers ( pV p1 p2 )

groupings ( elision_char v keep_with_s oux_ending )

stringescapes {}

/* special characters */

stringdef a^   '{U+00E2}'  // a-circumflex
stringdef a`   '{U+00E0}'  // a-grave
stringdef cc   '{U+00E7}'  // c-cedilla

stringdef e"   '{U+00EB}'  // e-diaeresis (rare)
stringdef e'   '{U+00E9}'  // e-acute
stringdef e^   '{U+00EA}'  // e-circumflex
stringdef e`   '{U+00E8}'  // e-grave
stringdef i"   '{U+00EF}'  // i-diaeresis
stringdef i^   '{U+00EE}'  // i-circumflex
stringdef o^   '{U+00F4}'  // o-circumflex
stringdef u^   '{U+00FB}'  // u-circumflex
stringdef u`   '{U+00F9}'  // u-grave

define v 'aeiouy{a^}{a`}{e"}{e'}{e^}{e`}{i"}{i^}{o^}{u^}{u`}'

// Replace -oux with -ou if preceded by one of these letters.
define oux_ending 'bhjlnp'

// Single character elisions
define elision_char 'cdjlmnst'

define elisions as
(
    [ (elision_char or 'qu') '{'}' ] not atlimit delete
)

define prelude as repeat goto (

    (  v [ ('u' ] v <- 'U') or
           ('i' ] v <- 'I') or
           ('y' ] <- 'Y')
    )
    or
    (  [ '{e"}' ] <- 'He' )
    or
    (  [ '{i"}' ] <- 'Hi' )
    or
    (  ['y'] v <- 'Y' )
    or
    (  'q' ['u'] <- 'U' )
)

define mark_regions as (

    $pV = limit
    $p1 = limit
    $p2 = limit  // defaults

    do (
        ( v v next )
        or
        among ( // Exception list:
            'par'    // paris, parie, pari
            'col'    // colis
            'tap'    // tapis
                ()
            'ni' (v) // niais/nierais/nié/niâmes/nièrent
            // extensions possible here
        )
        or
        ( next gopast v )
        setmark pV
    )
    do (
        gopast v gopast non-v setmark p1
        gopast v gopast non-v setmark p2
    )
)

define postlude as repeat (

    [substring] among(
        'I' (<- 'i')
        'U' (<- 'u')
        'Y' (<- 'y')
        'He' (<- '{e"}')
        'Hi' (<- '{i"}')
        'H' (delete)
        ''  (next)
    )
)

backwardmode (

    define RV as $pV <= cursor
    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define standard_suffix as (
        [substring] among(

            'ance' 'iqUe' 'isme' 'able' 'iste' 'eux'
            'ances' 'iqUes' 'ismes' 'ables' 'istes'
               ( R2 delete )
            'atrice' 'ateur' 'ation'
            'atrices' 'ateurs' 'ations'
               ( R2 delete
                 try ( ['ic'] (R2 delete) or <-'iqU' )
               )
            'logie'
            'logies'
               ( R2 <- 'log' )
            'usion' 'ution'
            'usions' 'utions'
               ( R2 <- 'u' )
            'ence'
            'ences'
               ( R2 <- 'ent' )
            'ement'
            'ements'
            (
                RV delete
                try (
                    [substring] among(
                        'iv'   (R2 delete ['at'] R2 delete)
                        'eus'  ((R2 delete) or (R1<-'eux'))
                        'abl' 'iqU'
                               (R2 delete)
                        'i{e`}r' 'I{e`}r'
                               (RV <-'i')
                    )
                )
            )
            'it{e'}'
            'it{e'}s'
            (
                R2 delete
                try (
                    [substring] among(
                        'abil' ((R2 delete) or <-'abl')
                        'ic'   ((R2 delete) or <-'iqU')
                        'iv'   (R2 delete)
                    )
                )
            )
            'if' 'ive'
            'ifs' 'ives'
            (
                R2 delete
                try ( ['at'] R2 delete ['ic'] (R2 delete) or <-'iqU' )
            )
            'eaux' (<- 'eau')
            'aux'  (R1 <- 'al')
            'oux'  (oux_ending <- 'ou')
            'euse'
            'euses'((R2 delete) or (R1<-'eux'))

            'issement'
            'issements'(R1 non-v delete) // verbal

            // fail(...) below forces entry to verb_suffix. -ment typically
            // follows the p.p., e.g 'confus{e'}ment'.

            'amment'   (RV fail(<- 'ant'))
            'emment'   (RV fail(<- 'ent'))
            'ment'
            'ments'    (test(v RV) fail(delete))
                       // v is e,i,u,{e'},I or U
        )
    )

    define i_verb_suffix as setlimit tomark pV for (
        [substring] among (
            '{i^}mes' '{i^}t' '{i^}tes' 'i' 'ie' 'ies' 'ir' 'ira' 'irai'
            'iraIent' 'irais' 'irait' 'iras' 'irent' 'irez' 'iriez'
            'irions' 'irons' 'iront' 'is' 'issaIent' 'issais' 'issait'
            'issant' 'issante' 'issantes' 'issants' 'isse' 'issent' 'isses'
            'issez' 'issiez' 'issions' 'issons' 'it'
                (not 'H' non-v delete)
        )
    )

    define verb_suffix as (
        setlimit tomark pV for ([substring])
        among (
            'ions'
                (R2 delete)

            '{e'}' '{e'}e' '{e'}es' '{e'}s' '{e`}rent' 'er' 'era' 'erai'
            'eraIent' 'erais' 'erait' 'eras' 'erez' 'eriez' 'erions'
            'erons' 'eront' 'ez' 'iez'

            // 'ons' //-best omitted

                (delete)

            '{a^}mes' '{a^}t' '{a^}tes' 'a' 'ai' 'aIent' 'ait' 'ant'
            'ante' 'antes' 'ants' 'as' 'asse' 'assent' 'asses' 'assiez'
            'assions'
                ( try('e' RV]) delete )

            'ais' 'aise' 'aises'
                ( not among (
                      'al'     // balais, calais, galais, malais, palais, valais
                          (next atlimit)
                      'auv'    // mauvais
                      '{e'}pl' // déplais
                          ()
                  )
                  delete )
            'eais'
                (delete)
        )
    )

    define keep_with_s 'aiou{e`}s'

    define residual_suffix as (
        try(['s'] test ('Hi' or non-keep_with_s) delete)
        setlimit tomark pV for (
            [substring] among(
                'ion'           (R2 's' or 't' delete)
                'ier' 'i{e`}re'
                'Ier' 'I{e`}re' (<-'i')
                'e'             (delete)
            )
        )
    )

    define un_double as (
        test among('enn' 'onn' 'ett' 'ell' 'eill') [next] delete
    )

    define un_accent as (
        atleast 1 non-v
        [ '{e'}' or '{e`}' ] <-'e'
    )
)

define stem as (

    do elisions
    do prelude
    do mark_regions
    backwards (

        do (
            (
                 ( standard_suffix or
                   i_verb_suffix or
                   verb_suffix
                 )
                 and
                 try( [ ('Y'   ] <- 'i' ) or
                        ('{cc}'] <- 'c' )
                 )
            ) or
            residual_suffix
        )

        // try(['ent'] RV delete) // is best omitted

        do un_double
        do un_accent
    )
    do postlude
)