The apostrophe character

Representing apostrophe is problematical for various reasons,

  1. There are two Unicode characters for apostrophe, U+0027 (also ASCII hex 27), and U+2019. Compare,

            Hamlet's father's ghost (U+0027)
            Hamlet’s father’s ghost (U+2019)
  2. Although conceptually different from an apostrophe, a single closing quote is also represented by character U+2019.

  3. Character U+0027 is used for apostrophe, single closing quote and single opening quote (U+2018).

  4. A fourth character, U+201B, like U+2018 but with the tail ‘rising’ instead of ‘descending’, is also sometimes used as apostrophe (in the house style of certain publishers, for surnames like M’Coy and so on.)

In the English stemming algorithm, it is assumed that apostrophe is represented by U+0027. This makes it ASCII compatible. Clearly other codes for apostrophe can be mapped to this code prior to stemming.

In English orthography, apostrophe has one of three functions.

  1. It indicates a contraction in what is now accepted as a single word: o’clock, O’Reilly, M’Coy. Except in proper names such forms are rare: the apostrophe in Hallowe’en is disappearing, and in ’bus has disappeared.

  2. It indicates a standard contraction with auxiliary or modal verbs: you’re, isn’t, we’d. There are about forty of these forms in contemporary English, and their use is increasing as they displace the full forms that were at one time used in formal documents. Although they can be reduced to word pairs, it is more convenient to treat them as single items (usually stopwords) in IR work. And then preserving the apostrophe is important, so that he’ll, she’ll, we’ll are not equated with hell, shell, well etc.

  3. It is used to form the ‘English genitive’, John's book, the horses’ hooves etc. This is a development of (1), where historically the apostrophe stood for an elided e. (Similarly the printed form ’d for ed was very common before the nineteenth century.) Although in decline (witness pigs trotters, Girls School Trust), its use continues in contemporary English, where it is fiercely promoted as correct grammar, despite (or it might be closer to the truth to say because of) its complete semantic redundancy.

For these reasons, the English stemmer treats apostrophe as a letter, removing it from the beginning of a word, where it might have stood for an opening quote, from the end of the word, where it might have stood for a closing quote, or been an apostrophe following s. The form ’s is also treated as an ending.