Portuguese stemming algorithm

Links to resources

Here is a sample of Portuguese vocabulary, with the stemmed forms that will be generated by this algorithm:

word stem          word stem
boa
boainain
boas
bôas
boassu
boataria
boate
boates
boatos
bob
boba
bobagem
bobagens
bobalhões
bobear
bobeira
bobinho
bobinhos
bobo
bobs
boca
bocadas
bocadinho
bocado
bocaiúva
boçal
bocarra
bocas
bode
bodoque
body
boeing
boem
boemia
boêmio
boêmios
bogotá
boi
bóia
boiando
boa
boainain
boas
bôas
boassu
boat
boat
boat
boat
bob
bob
bobag
bobagens
bobalhõ
bob
bobeir
bobinh
bobinh
bob
bobs
boc
boc
bocadinh
boc
bocaiúv
boçal
bocarr
boc
bod
bodoqu
body
boeing
boem
boem
boêmi
boêmi
bogot
boi
bói
boi
quiabo
quicaram
quickly
quieto
quietos
quilate
quilates
quilinhos
quilo
quilombo
quilométricas
quilométricos
quilômetro
quilômetros
quilos
química
químicas
químico
químicos
quimioterapia
quimioterápicos
quimono
quincas
quinhão
quinhentos
quinn
quino
quinta
quintal
quintana
quintanilha
quintão
quintessência
quintino
quinto
quintos
quintuplicou
quinze
quinzena
quiosque
quiab
quic
quickly
quiet
quiet
quilat
quilat
quilinh
quil
quilomb
quilométr
quilométr
quilômetr
quilômetr
quil
químic
químic
químic
químic
quimioterap
quimioteráp
quimon
quinc
quinhã
quinhent
quinn
quin
quint
quintal
quintan
quintanilh
quintã
quintessent
quintin
quint
quint
quintuplic
quinz
quinzen
quiosqu

The stemming algorithm

Letters in Portuguese include the following accented forms,

á   é   í   ó   ú   â   ê   ô   ç   ã   õ   ü
The following letters are vowels:
a   e   i   o   u   á   é   í   ó   ú   â   ê   ô
And the two nasalised vowel forms,
ã   õ

should be treated as a vowel followed by a consonant.

ã and õ are therefore replaced by a~ and o~ in the word, where ~ is a separate character to be treated as a consonant. And then —

R2 (see the note on R1 and R2) and RV have the same definition as in the Spanish stemmer.

Always do step 1.

Step 1: Standard suffix removal

Search for the longest among the following suffixes, and perform the action indicated.
eza   ezas   ico   ica   icos   icas   ismo   ismos   ável   ível   ista   istas   oso   osa   osos   osas   amento   amentos   imento   imentos   adora   ador   aça~o   adoras   adores   aço~es   ante   antes   ância
delete if in R2
logia   logias
replace with log if in R2
ução   uções
replace with u if in R2
ência   ências
replace with ente if in R2
amente
delete if in R1
if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
if preceded by os, ic or ad, delete if in R2
mente
delete if in R2
if preceded by ante, avel or ível, delete if in R2
idade   idades
delete if in R2
if preceded by abil, ic or iv, delete if in R2
iva   ivo   ivas   ivos
delete if in R2
if preceded by at, delete if in R2
ira   iras
replace with ir if in RV and preceded by e

Do step 2 if no ending was removed by step 1.

Step 2: Verb suffixes

Search for the longest among the following suffixes in RV, and if found, delete.
ada   ida   ia   aria   eria   iria   ará   ara   erá   era   irá   ava   asse   esse   isse   aste   este   iste   ei   arei   erei   irei   am   iam   ariam   eriam   iriam   aram   eram   iram   avam   em   arem   erem   irem   assem   essem   issem   ado   ido   ando   endo   indo   ara~o   era~o   ira~o   ar   er   ir   as   adas   idas   ias   arias   erias   irias   arás   aras   erás   eras   irás   avas   es   ardes   erdes   irdes   ares   eres   ires   asses   esses   isses   astes   estes   istes   is   ais   eis   íeis   aríeis   eríeis   iríeis   áreis   areis   éreis   ereis   íreis   ireis   ásseis   ésseis   ísseis   áveis   ados   idos   ámos   amos   íamos   aríamos   eríamos   iríamos   áramos   éramos   íramos   ávamos   emos   aremos   eremos   iremos   ássemos   êssemos   íssemos   imos   armos   ermos   irmos   eu   iu   ou   ira   iras

If the last step to be obeyed — either step 1 or 2 — altered the word, do step 3
Step 3
Delete suffix i if in RV and preceded by c

Alternatively, if neither steps 1 nor 2 altered the word, do step 4

Step 4: Residual suffix

If the word ends with one of the suffixes
os   a   i   o   á   í   ó

in RV, delete it

Always do step 5

Step 5:

If the word ends with one of

e   é   ê

in RV, delete it, and if preceded by gu (or ci) with the u (or i) in RV, delete the u (or i).

Or if the word ends ç remove the cedilla

And finally:

Turn a~, o~ back into ã, õ

The same algorithm in Snowball

routines (
           prelude postlude mark_regions
           RV R1 R2
           standard_suffix
           verb_suffix
           residual_suffix
           residual_form
)

externals ( stem )

integers ( pV p1 p2 )

groupings ( v )

stringescapes {}

/* special characters */

stringdef a'   '{U+00E1}'  // a-acute
stringdef a^   '{U+00E2}'  // a-circumflex e.g. 'bota^nico
stringdef e'   '{U+00E9}'  // e-acute
stringdef e^   '{U+00EA}'  // e-circumflex
stringdef i'   '{U+00ED}'  // i-acute
stringdef o^   '{U+00F4}'  // o-circumflex
stringdef o'   '{U+00F3}'  // o-acute
stringdef u'   '{U+00FA}'  // u-acute
stringdef cc   '{U+00E7}'  // c-cedilla

stringdef a~   '{U+00E3}'  // a-tilde
stringdef o~   '{U+00F5}'  // o-tilde


define v 'aeiou{a'}{e'}{i'}{o'}{u'}{a^}{e^}{o^}'

define prelude as repeat (
    [substring] among(
        '{a~}' (<- 'a~')
        '{o~}' (<- 'o~')
        ''     (next)
    ) //or next
)

define mark_regions as (

    $pV = limit
    $p1 = limit
    $p2 = limit  // defaults

    do (
        ( v (non-v gopast v) or (v gopast non-v) )
        or
        ( non-v (non-v gopast v) or (v next) )
        setmark pV
    )
    do (
        gopast v gopast non-v setmark p1
        gopast v gopast non-v setmark p2
    )
)

define postlude as repeat (
    [substring] among(
        'a~' (<- '{a~}')
        'o~' (<- '{o~}')
        ''   (next)
    ) //or next
)

backwardmode (

    define RV as $pV <= cursor
    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor

    define standard_suffix as (
        [substring] among(

            'eza' 'ezas'
            'ico' 'ica' 'icos' 'icas'
            'ismo' 'ismos'
            '{a'}vel'
            '{i'}vel'
            'ista' 'istas'
            'oso' 'osa' 'osos' 'osas'
            'amento' 'amentos'
            'imento' 'imentos'

           'adora' 'ador' 'a{cc}a~o'
           'adoras' 'adores' 'a{cc}o~es'  // no -ic test
           'ante' 'antes' '{a^}ncia' // Note 1
            (
                R2 delete
            )
            'logia'
            'logias'
            (
                R2 <- 'log'
            )
            'u{cc}a~o' 'u{cc}o~es'
            (
                R2 <- 'u'
            )
            '{e^}ncia' '{e^}ncias'
            (
                R2 <- 'ente'
            )
            'amente'
            (
                R1 delete
                try (
                    [substring] R2 delete among(
                        'iv' (['at'] R2 delete)
                        'os'
                        'ic'
                        'ad'
                    )
                )
            )
            'mente'
            (
                R2 delete
                try (
                    [substring] among(
                        'ante' // Note 1
                        'avel'
                        '{i'}vel' (R2 delete)
                    )
                )
            )
            'idade'
            'idades'
            (
                R2 delete
                try (
                    [substring] among(
                        'abil'
                        'ic'
                        'iv'   (R2 delete)
                    )
                )
            )
            'iva' 'ivo'
            'ivas' 'ivos'
            (
                R2 delete
                try (
                    ['at'] R2 delete // but not a further   ['ic'] R2 delete
                )
            )
            'ira' 'iras'
            (
                RV 'e'  // -eira -eiras usually non-verbal
                <- 'ir'
            )
        )
    )

    define verb_suffix as setlimit tomark pV for (
        [substring] among(
            'ada' 'ida' 'ia' 'aria' 'eria' 'iria' 'ar{a'}' 'ara' 'er{a'}'
            'era' 'ir{a'}' 'ava' 'asse' 'esse' 'isse' 'aste' 'este' 'iste'
            'ei' 'arei' 'erei' 'irei' 'am' 'iam' 'ariam' 'eriam' 'iriam'
            'aram' 'eram' 'iram' 'avam' 'em' 'arem' 'erem' 'irem' 'assem'
            'essem' 'issem' 'ado' 'ido' 'ando' 'endo' 'indo' 'ara~o'
            'era~o' 'ira~o' 'ar' 'er' 'ir' 'as' 'adas' 'idas' 'ias'
            'arias' 'erias' 'irias' 'ar{a'}s' 'aras' 'er{a'}s' 'eras'
            'ir{a'}s' 'avas' 'es' 'ardes' 'erdes' 'irdes' 'ares' 'eres'
            'ires' 'asses' 'esses' 'isses' 'astes' 'estes' 'istes' 'is'
            'ais' 'eis' '{i'}eis' 'ar{i'}eis' 'er{i'}eis' 'ir{i'}eis'
            '{a'}reis' 'areis' '{e'}reis' 'ereis' '{i'}reis' 'ireis'
            '{a'}sseis' '{e'}sseis' '{i'}sseis' '{a'}veis' 'ados' 'idos'
            '{a'}mos' 'amos' '{i'}amos' 'ar{i'}amos' 'er{i'}amos'
            'ir{i'}amos' '{a'}ramos' '{e'}ramos' '{i'}ramos' '{a'}vamos'
            'emos' 'aremos' 'eremos' 'iremos' '{a'}ssemos' '{e^}ssemos'
            '{i'}ssemos' 'imos' 'armos' 'ermos' 'irmos' 'eu' 'iu' 'ou'

            'ira' 'iras'
                (delete)
        )
    )

    define residual_suffix as (
        [substring] among(
            'os'
            'a' 'i' 'o' '{a'}' '{i'}' '{o'}'
                ( RV delete )
        )
    )

    define residual_form as (
        [substring] among(
            'e' '{e'}' '{e^}'
                ( RV delete [('u'] test 'g') or
                             ('i'] test 'c') RV delete )
            '{cc}' (<-'c')
        )
    )
)

define stem as (
    do prelude
    do mark_regions
    backwards (
        do (
            ( ( standard_suffix or verb_suffix )
              and do ( ['i'] test 'c' RV delete )
            )
            or residual_suffix
        )
        do residual_form
    )
    do postlude
)

/*
    Note 1: additions of 15 Jun 2005
*/