Spanish stemming algorithm

Links to resources

Here is a sample of Spanish vocabulary, with the stemmed forms that will be generated by this algorithm:

word stem          word stem
che
checa
checar
checo
checoslovaquia
chedraoui
chefs
cheliabinsk
chelo
chemical
chemicalweek
chemise
chepo
cheque
chequeo
cheques
cheraw
chesca
chester
chetumal
chetumaleños
chevrolet
cheyene
cheyenne
chi
chía
chiapaneca
chiapas
chiba
chic
chica
chicago
chicana
chicano
chicas
chicharrones
chichen
chichimecas
chicles
chico
che
chec
chec
chec
checoslovaqui
chedraoui
chefs
cheliabinsk
chel
chemical
chemicalweek
chemis
chep
chequ
cheque
chequ
cheraw
chesc
chest
chetumal
chetumaleñ
chevrolet
cheyen
cheyenn
chi
chi
chiapanec
chiap
chib
chic
chic
chicag
chican
chican
chic
chicharron
chich
chichimec
chicl
chic
torá
tórax
torcer
toreado
toreados
toreándolo
torear
toreara
torearlo
toreó
torero
toreros
torio
tormenta
tormentas
tornado
tornados
tornar
tornen
torneo
torneos
tornillo
tornillos
torniquete
torno
toro
toronto
toros
torpedearon
torpeza
torrado
torralba
torre
torrencial
torrenciales
torrente
torreon
torreón
torres
torrescano
tor
torax
torc
tor
tor
tor
tor
tor
tor
tore
torer
torer
tori
torment
torment
torn
torn
torn
torn
torne
torne
tornill
tornill
torniquet
torn
tor
toront
tor
torped
torpez
torr
torralb
torr
torrencial
torrencial
torrent
torreon
torreon
torr
torrescan

The stemming algorithm

Letters in Spanish include the following accented forms,

á   é   í   ó   ú   ü   ñ

The following letters are vowels:

a   e   i   o   u   á   é   í   ó   ú   ü

R2 is defined in the usual way — see the note on R1 and R2.

RV is defined as follows (and this is not the same as the French stemmer definition):

If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.

For example,

    m a c h o     o l i v a     t r a b a j o     á u r e o
         |...|         |...|         |.......|         |...|

Always do steps 0 and 1.

Step 0: Attached pronoun

Search for the longest among the following suffixes
me   se   sela   selo   selas   selos   la   le   lo   las   les   los   nos

and delete it, if comes after one of
(a) iéndo   ándo   ár   ér   ír
(b) ando   iendo   ar   er   ir
(c) yendo following u

in RV. In the case of (c), yendo must lie in RV, but the preceding u can be outside it.

In the case of (a), deletion is followed by removing the acute accent (for example, haciéndolahaciendo).

Step 1: Standard suffix removal

Search for the longest among the following suffixes, and perform the action indicated.
anza   anzas   ico   ica   icos   icas   ismo   ismos   able   ables   ible   ibles   ista   istas   oso   osa   osos   osas   amiento   amientos   imiento   imientos
delete if in R2
adora   ador   ación   adoras   adores   aciones   ante   antes   ancia   ancias
delete if in R2
if preceded by ic, delete if in R2
logía   logías
replace with log if in R2
ución   uciones
replace with u if in R2
encia   encias
replace with ente if in R2
amente
delete if in R1
if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
if preceded by os, ic or ad, delete if in R2
mente
delete if in R2
if preceded by ante, able or ible, delete if in R2
idad   idades
delete if in R2
if preceded by abil, ic or iv, delete if in R2
iva   ivo   ivas   ivos
delete if in R2
if preceded by at, delete if in R2

Do step 2a if no ending was removed by step 1.

Step 2a: Verb suffixes beginning y

Search for the longest among the following suffixes in RV, and if found, delete if preceded by u.
ya   ye   yan   yen   yeron   yendo   yo   yó   yas   yes   yais   yamos
(Note that the preceding u need not be in RV.)

Do Step 2b if step 2a was done, but failed to remove a suffix.

Step 2b: Other verb suffixes

Search for the longest among the following suffixes in RV, and perform the action indicated.
en   es   éis   emos
delete, and if preceded by gu delete the u (the gu need not be in RV)
arían   arías   arán   arás   aríais   aría   aréis   aríamos   aremos   ará   aré   erían   erías   erán   erás   eríais   ería   eréis   eríamos   eremos   erá   eré   irían   irías   irán   irás   iríais   iría   iréis   iríamos   iremos   irá   iré   aba   ada   ida   ía   ara   iera   ad   ed   id   ase   iese   aste   iste   an   aban   ían   aran   ieran   asen   iesen   aron   ieron   ado   ido   ando   iendo   ió   ar   er   ir   as   abas   adas   idas   ías   aras   ieras   ases   ieses   ís   áis   abais   íais   arais   ierais     aseis   ieseis   asteis   isteis   ados   idos   amos   ábamos   íamos   imos   áramos   iéramos   iésemos   ásemos
delete

Always do step 3.

Step 3: residual suffix

Search for the longest among the following suffixes in RV, and perform the action indicated.
os   a   o   á   í   ó
delete if in RV
e   é
delete if in RV, and if preceded by gu with the u in RV delete the u

And finally:

Remove acute accents

The same algorithm in Snowball

routines (
           postlude mark_regions
           RV R1 R2
           attached_pronoun
           standard_suffix
           y_verb_suffix
           verb_suffix
           residual_suffix
)

externals ( stem )

integers ( pV p1 p2 )

groupings ( v )

stringescapes {}

/* special characters */

stringdef a'   '{U+00E1}'  // a-acute
stringdef e'   '{U+00E9}'  // e-acute
stringdef i'   '{U+00ED}'  // i-acute
stringdef o'   '{U+00F3}'  // o-acute
stringdef u'   '{U+00FA}'  // u-acute
stringdef u"   '{U+00FC}'  // u-diaeresis
stringdef n~   '{U+00F1}'  // n-tilde

define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'

define mark_regions as (

    $pV = limit
    $p1 = limit
    $p2 = limit  // defaults

    do (
        ( v (non-v gopast v) or (v gopast non-v) )
        or
        ( non-v (non-v gopast v) or (v next) )
        setmark pV
    )
    do (
        gopast v gopast non-v setmark p1
        gopast v gopast non-v setmark p2
    )
)

define postlude as repeat (
    [substring] among(
        '{a'}' (<- 'a')
        '{e'}' (<- 'e')
        '{i'}' (<- 'i')
        '{o'}' (<- 'o')
        '{u'}' (<- 'u')
        // and possibly {u"}->u here, or in prelude
        ''     (next)
    ) //or next
)

backwardmode (

    define RV as $pV <= cursor
    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define attached_pronoun as (
        [substring] among(
            'me' 'se'  'sela' 'selo' 'selas' 'selos' 'la' 'le' 'lo'
            'las' 'les' 'los' 'nos'
        )
        substring RV among(
            'i{e'}ndo' (] <- 'iendo')
            '{a'}ndo'  (] <- 'ando')
            '{a'}r'    (] <- 'ar')
            '{e'}r'    (] <- 'er')
            '{i'}r'    (] <- 'ir')
            'ando'
            'iendo'
            'ar' 'er' 'ir'
                       (delete)
            'yendo'    ('u' delete)
        )
    )

    define standard_suffix as (
        [substring] among(

            'anza' 'anzas'
            'ico' 'ica' 'icos' 'icas'
            'ismo' 'ismos'
            'able' 'ables'
            'ible' 'ibles'
            'ista' 'istas'
            'oso' 'osa' 'osos' 'osas'
            'amiento' 'amientos'
            'imiento' 'imientos'
            (
                R2 delete
            )
            'adora' 'ador' 'aci{o'}n'
            'adoras' 'adores' 'aciones'
            'ante' 'antes' 'ancia' 'ancias'// Note 1
            (
                R2 delete
                try ( ['ic'] R2 delete )
            )
            'log{i'}a'
            'log{i'}as'
            (
                R2 <- 'log'
            )
            'uci{o'}n' 'uciones'
            (
                R2 <- 'u'
            )
            'encia' 'encias'
            (
                R2 <- 'ente'
            )
            'amente'
            (
                R1 delete
                try (
                    [substring] R2 delete among(
                        'iv' (['at'] R2 delete)
                        'os'
                        'ic'
                        'ad'
                    )
                )
            )
            'mente'
            (
                R2 delete
                try (
                    [substring] among(
                        'ante' // Note 1
                        'able'
                        'ible' (R2 delete)
                    )
                )
            )
            'idad'
            'idades'
            (
                R2 delete
                try (
                    [substring] among(
                        'abil'
                        'ic'
                        'iv'   (R2 delete)
                    )
                )
            )
            'iva' 'ivo'
            'ivas' 'ivos'
            (
                R2 delete
                try (
                    ['at'] R2 delete // but not a further   ['ic'] R2 delete
                )
            )
        )
    )

    define y_verb_suffix as (
        setlimit tomark pV for ([substring]) among(
            'ya' 'ye' 'yan' 'yen' 'yeron' 'yendo' 'yo' 'y{o'}'
            'yas' 'yes' 'yais' 'yamos'
                ('u' delete)
        )
    )

    define verb_suffix as (
        setlimit tomark pV for ([substring]) among(

            'en' 'es' '{e'}is' 'emos'
                (try ('u' test 'g') ] delete)

            'ar{i'}an' 'ar{i'}as' 'ar{a'}n' 'ar{a'}s' 'ar{i'}ais'
            'ar{i'}a' 'ar{e'}is' 'ar{i'}amos' 'aremos' 'ar{a'}'
            'ar{e'}'
            'er{i'}an' 'er{i'}as' 'er{a'}n' 'er{a'}s' 'er{i'}ais'
            'er{i'}a' 'er{e'}is' 'er{i'}amos' 'eremos' 'er{a'}'
            'er{e'}'
            'ir{i'}an' 'ir{i'}as' 'ir{a'}n' 'ir{a'}s' 'ir{i'}ais'
            'ir{i'}a' 'ir{e'}is' 'ir{i'}amos' 'iremos' 'ir{a'}'
            'ir{e'}'

            'aba' 'ada' 'ida' '{i'}a' 'ara' 'iera' 'ad' 'ed'
            'id' 'ase' 'iese' 'aste' 'iste' 'an' 'aban' '{i'}an'
            'aran' 'ieran' 'asen' 'iesen' 'aron' 'ieron' 'ado'
            'ido' 'ando' 'iendo' 'i{o'}' 'ar' 'er' 'ir' 'as'
            'abas' 'adas' 'idas' '{i'}as' 'aras' 'ieras' 'ases'
            'ieses' '{i'}s' '{a'}is' 'abais' '{i'}ais' 'arais'
            'ierais'  'aseis' 'ieseis' 'asteis' 'isteis' 'ados'
            'idos' 'amos' '{a'}bamos' '{i'}amos' 'imos'
            '{a'}ramos' 'i{e'}ramos' 'i{e'}semos' '{a'}semos'
                (delete)
        )
    )

    define residual_suffix as (
        [substring] among(
            'os'
            'a' 'o' '{a'}' '{i'}' '{o'}'
                ( RV delete )
            'e' '{e'}'
                ( RV delete try( ['u'] test 'g' RV delete ) )
        )
    )
)

define stem as (
    do mark_regions
    backwards (
        do attached_pronoun
        do ( standard_suffix or
             y_verb_suffix or
             verb_suffix
           )
        do residual_suffix
    )
    do postlude
)

/*
    Note 1: additions of 15 Jun 2005
*/