The Schinke Latin stemming algorithm

Links to resources

(A note by Martin Porter.)

The Schinke Latin stemming algorithm is described in,

Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.

It has the feature that it stems each word to two forms, noun and verb. For example,

                NOUN        VERB
                ----        ----
    aquila      aquil       aquila
    portat      portat      porta
    portis      port        por

Here (slightly reformatted) are the rules of the stemmer,

1. (start)

2.  Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u',
    respectively.

3.  If the word ends in '-que' then
        if the word is on the list shown in Figure 4, then
            write the original word to both the noun-based and verb-based
            stem dictionaries and go to 8.
        else remove '-que'

    [Figure 4 was

        atque quoque neque itaque absque apsque abusque adaeque adusque denique
        deque susque oblique peraeque plenisque quandoque quisque quaeque
        cuiusque cuique quemque quamque quaque quique quorumque quarumque
        quibusque quosque quasque quotusquisque quousque ubique undique usque
        uterque utique utroque utribique torque coque concoque contorque
        detorque decoque excoque extorque obtorque optorque retorque recoque
        attorque incoque intorque praetorque]

4.  Match the end of the word against the suffix list show in Figure 6(a),
    removing the longest matching suffix, (if any).

    [Figure 6(a) was

        -ibus -ius  -ae   -am   -as   -em   -es   -ia
        -is   -nt   -os   -ud   -um   -us   -a    -e
        -i    -o    -u]

5.  If the resulting stem contains at least two characters then write this stem
    to the noun-based stem dictionary.

6.  Match the end of the word against the suffix list show in Figure 6(b),
    identifying the longest matching suffix, (if any).

    [Figure 6(b) was

    -iuntur-beris -erunt -untur -iunt  -mini  -ntur  -stis
    -bor   -ero   -mur   -mus   -ris   -sti   -tis   -tur
    -unt   -bo    -ns    -nt    -ri    -m     -r     -s
    -t]

    If any of the following suffixes are found then convert them as shown:

        '-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i';
        '-beris', '-bor', and '-bo' to '-bi';
        '-ero' to '-eri'

    else remove the suffix in the normal way.

7.  If the resulting stem contains at least two characters then write this stem
    to the verb-based stem dictionary.

8.  (end)

Unfortunately I was not able to make the rules match the examples given, which led to the following email correspondence,

From: Martin Porter
To: Peter Willett
Date: Mon Sep 10 15:11:51 2001
Subject: Re: Stemming algorithms

> ... I'm no longer working in the IR area,
>spending all of my time on computational chemistry/drug discovery
>research but I guess that Mark Sanderson would be interested in
>Snowball - do you mind if I pass your email onto him?

Peter,

Well, actually, I do have a question, if you can cast your mind back. I've
implemented the Latin Stemmer in Snowball (see below: you'll have to guess the
semantics, but I'm sure you'll agree the syntax looks nice), and find that Fig
5 of the 1996 Schinke paper doesn't correspond to the algorithm of fig 7, but to
the algorithm with the extra rules concerning -ba-, -bi-, -sse- mentioned on
page 182. Which is the "correct" algorithm - with or without those rules? If
with, what is the exact criterion for their removal? A bigger problem is why
the -nt is not removed from 'Apparebunt', given -nt as an ending in 6(a). Is
-nt a misprint?

Sorry to bother you with this, but the paper says you are the one "to whom all
correspondence should be addressed" :-)

Martin


 Here is your algorithm in Snowball. The generated code will do about 1 million
 Latin word in 5 seconds:

 -------

strings ( noun_form  verb_form )

routines (

   map_letters
   que_word
)

externals ( stem )

define map_letters as (

    do repeat ( goto ( ['j'] ) <- 'i' )
    do repeat ( goto ( ['v'] ) <- 'u' )
)

backwardmode (

    define que_word as (

        ['que'] (
            among (
                'at' 'quo' 'ne' 'ita' 'abs' 'aps' 'abus' 'adae' 'adus'
                'deni' 'de' 'sus' 'obli' 'perae' 'plenis' 'quando' 'quis'
                'quae' 'cuius' 'cui' 'quem' 'quam' 'qua' 'qui'
                'quorum' 'quarum' 'quibus' 'quos' 'quas' 'quotusquis'
                'quous' 'ubi' 'undi' 'us' 'uter' 'uti' 'utro' 'utribi'
                'tor' 'co' 'conco' 'contor' 'detor' 'deco' 'exco' 'extor'
                'obtor' 'optor' 'retor' 'reco' 'attor' 'inco' 'intor'
                'praetor'
            ) atlimit ]
            => noun_form
            => verb_form
        ) or fail(delete)
    )
)

define stem as (

    map_letters

    backwards (
        que_word or (
            => noun_form
            => verb_form

            $noun_form backwards try (
                [substring] hop 2
                among (
                    'ibus' 'ius' 'ae' 'am' 'as' 'em' 'es' 'ia' 'is' 'nt'
                    'os' 'ud' 'um' 'us' 'a' 'e' 'i' 'o' 'u'
                        (delete)
                )
            )

            $verb_form backwards try (
                [substring] hop 2
                among (
                    'iuntur' 'erunt' 'untur' 'iunt' 'unt'
                         (<-'i')
                    'beris' 'bor' 'bo'
                         (<-'bi')
                    'ero'
                         (<-'eri')
                    'mini' 'ntur' 'stis' 'mur' 'mus' 'ris' 'sti' 'tis'
                    'tur' 'ns' 'nt' 'ri' 'm' 'r' 's' 't'
                         (delete)
                )
            )
        )
    )

    /* the stemmed words are left in noun-form and verb-form, and can
       be picked up as C strings at z->S[0] and z->S[1] through the API. */
)
From: Peter Willett
To: Martin Porter
Date: Mon Sep 10 20:25:24 2001
Subject: Re: Stemming algorithms

Martin

Sorry - I just cannot answer.  Robertson has retired to Dorset while
Schinke is now in - I think - Canada

Peter

Following this, I was unable to contact Schinke, and so the problems have remained unresolved.

The linked zip file contains the stemmer, generated C version, and sample data. (The stemmer differs slightly from the version in the email above in that it assembles the noun- and verb-forms of the stem in a single string with space separation.) voc.txt is a sample vocabulary, and joined.txt the vocabulary joined with the two stemmed forms as three column output.