The apostrophe character

Representing apostrophe is problematical for various reasons,

  1. There are two Unicode characters for apostrophe, U+0027 and U+2019. The former is also in both ASCII and ISO-8859-1 (Latin1) whereas the latter is not. Compare,

            Hamlet's father's ghost (U+0027)
            Hamlet’s father’s ghost (U+2019)
    
  2. Although conceptually different from an apostrophe, a single closing quote is also represented by character U+2019.

  3. Character U+0027 is used for apostrophe, single closing quote and single opening quote (U+2018).

  4. A fourth character, U+201B, like U+2018 but with the tail ‘rising’ instead of ‘descending’, is also sometimes used as apostrophe (in the house style of certain publishers, for surnames like M’Coy and so on.)

Catalan

Some Catalan pronouns can attach before or after a verb which starts/ends in a vowel. The pronoun drops its vowel and an apostrophe is added between the pronoun and verb. These are handled by our Catalan stemmer.

Dutch

The Kraaij-Pohlmann stemmer for Dutch (Kraaij, 1994, 1995) removes hyphen and treats apostrophe as part of the alphabet (so ’s, ’tje and ’je are three of their endings). Kraaij-Pohlmann is the default Dutch stemmer since Snowball 3.0.0.

Dutch porter

The previous default Dutch stemmer was Martin Porter's which assumes hyphen and apostrophe have already been removed from the word to be stemmed. We still provide this for compatibility with users who have data processed using it - given its aim is compatibility with existing data, we've not updated its handling of apostrophes.

English

In the English stemming algorithm, it is assumed that apostrophe is represented by U+0027. This makes it ASCII compatible. Clearly other codes for apostrophe can be mapped to this code prior to stemming.

In English orthography, apostrophe has one of three functions.

  1. It indicates a contraction in what is now accepted as a single word: o’clock, O’Reilly, M’Coy. Except in proper names such forms are rare: the apostrophe in Hallowe’en is disappearing, and in ’bus has disappeared.

  2. It indicates a standard contraction with auxiliary or modal verbs: you’re, isn’t, we’d. There are about forty of these forms in contemporary English, and their use is increasing as they displace the full forms that were at one time used in formal documents. Although they can be reduced to word pairs, it is more convenient to treat them as single items (usually stopwords) in IR work. And then preserving the apostrophe is important, so that he’ll, she’ll, we’ll, we'd are not equated with hell, shell, well, wed etc.

  3. It is used to form the ‘English genitive’, John's book, the horses’ hooves etc. This is a development of (1), where historically the apostrophe stood for an elided e. (Similarly the printed form ’d for ed was very common before the nineteenth century.) Although in decline (witness pigs trotters, Girls School Trust), its use continues in contemporary English, where it is fiercely promoted as correct grammar, despite (or it might be closer to the truth to say because of) its complete semantic redundancy.

For these reasons, the English stemmer treats apostrophe as if it were a letter, removing it from the beginning of a word, where it might have stood for an opening quote, from the end of the word, where it might have stood for a closing quote, or been an apostrophe following s. The form ’s is also treated as an ending.

Porter

We provide a reference implementation of the original Porter stemmer as described by Martin Porter's 1980 paper. The paper does not include any special handling of apostrophes, so since this is intended as a reference implementation, our implementation does not either.

Lovins

We provide a implementation of Beth Lovins' very early stemming algorithm. This handles -'s and -s' suffixes.

Esperanto

Inflections of 'sti are expanded into forms of esti. The words l' and "un' become la and unu. A final apostrophe becomes after certain known stems, or else o.

French

French elisions (e.g. d'-, l'-, m'-, qu'-) are removed since Snowball 3.0.0.

Irish

Irish has some contractions which appear as prefixes. Our Irish stemmer handles d'-, m'- and b'-.

Polish

Polish uses an apostrophe to separate loanwords from native suffixes, for example: olly'ego, george'a. The correct use is to mark the elision of the final sound of a loanword before a Polish inflectional ending, but apparently it's also often used with any loanword.

Since Snowball 3.1.0, the Polish stemmer removes an apostrophe if the stem ends with one after it has removed a noun, adjective or verb suffix.

Russian

Our Russian stemmer handles suffixes which include an apostrophe.

Turkish

In modern Turkish orthography, an apostrophe is used to separate proper names from any suffixes - for example Türkiye'dir ("it is Turkey"). Since Snowball 3.0.0, our Turkish stemmer removes such suffixes.