Sesotho stemming algorithm

Links to resources

Introduction

The Sesotho (Southern Sotho) stemming algorithm was developed by Kamohelo Lebjane. Sesotho (ISO 639-1: st) is a Southern Bantu language spoken by approximately 5.6 million people in South Africa and Lesotho.

Like other Bantu languages, Sesotho is agglutinative - words are formed by attaching prefixes and suffixes to root morphemes. A single root can appear with many different affixes, making stemming important for information retrieval.

Stemming Approach

The algorithm removes noun class prefixes and verb/nominal suffixes to extract root forms. Sesotho uses a noun class system where prefixes indicate grammatical categories (e.g., mo-/ba- for people, le-/ma- for certain objects, se-/li- for things). Verb suffixes encode tense, aspect, and voice (e.g., -ile for perfect tense, -isa for causative, -wa for passive).

The stemmer processes suffixes before prefixes and ensures a minimum stem length is preserved to avoid over-stemming.

The algorithm in Snowball

/* stringescapes UTF-8 */

/*
  Sesotho stemmer for the Snowball project
  ----------------------------------------
  Author: Kamohelo Lebjane
  Purpose: To reduce Sesotho words to their morphological stems.

  Language notes:
  Sesotho (Southern Sotho) is an agglutinative Bantu language.
  Words often contain prefixes for noun classes and suffixes
  for tense, aspect, or derivation.

  An agglutinative language is a type of language that primarily forms words
  by stringing together morphemes (word parts)—each typically representing a single
  grammatical meaning—without significant modification to their forms (agglutinations).
  In such languages, affixes (prefixes, suffixes, infixes, or circumfixes)
  are added to a root word in a linear and systematic way, creating complex
  words that encode detailed grammatical information.

  Examples:
    baruti      -> rut (root)
    moruti      -> rut (root)
    rutile      -> rut (root)

  The rules below remove common noun class prefixes
  and common verb suffixes, keeping the main root form.
*/

/* --- Routine declarations --- */
routines (
  mark_regions
  remove_noun_prefixes
  remove_verb_suffixes
  remove_nominal_suffixes
)

/* --- External declarations --- */
externals ( stem )

/* --- Groupings --- */
groupings ( v )

/* --- Character sets --- */
define v 'aeiou'

/* --- Integer for tracking position --- */
integers ( pV )

/* --- Mark vowel region --- */
define mark_regions as (
  // Set pV after the first vowel and at least 2 characters into the string.
  // Signals f if the string doesn't contain a vowel or is shorter than 2
  // characters.
  test (gopast v  setmark pV)
  test (hop 2 do ($(cursor > pV) setmark pV))
)

/* --- Remove noun class prefixes --- */
define remove_noun_prefixes as (
  [substring] among(
    'mo'   'ba'   'me'   'le'
    'ma'   'se'   'boi'   'li'
  )
  /* Require at least two characters remain */
  test (next not atlimit)
  /* Only delete if there's a vowel after the cursor position */
  gopast v delete
)

backwardmode (

  /* --- Remove verb suffixes (from end of word) --- */
  define remove_verb_suffixes as (
    setlimit tomark pV for (
      [substring] among(
        'ile'   /* perfect tense */
        'isa'   /* causative */
        'etse'  /* applicative */
        'ela'   /* applicative */
        'ang'   /* plural imperative */
        'ong'    /* continuous/derived form */
        'eng'
        'wa'    /* passive */
        'a'     /* infinitive marker */
        (delete)
      )
    )
  )

  /* --- Remove nominal suffixes --- */
  define remove_nominal_suffixes as (
    setlimit tomark pV for (
      [substring] among(
        'nyana'   /* diminutive form */
        'ana'     /* diminutive form */
        'ano'
        'oa'
        'i'
        (delete)
      )
    )
  )

)

/* --- MAIN STEMMER --- */
define stem as (
  mark_regions // Signals f if the string is too short to stem.
  backwards (
    do remove_nominal_suffixes
    do remove_verb_suffixes
  )
  do remove_noun_prefixes
)