Snowball

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

It was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Please address all Snowball-related mail to the snowball-discuss mailing list.

Any such mail sent directly to individual developers may be answered less speedily, and in any case they reserve the right to post their answers on snowball-discuss.

Major events

  • Nov 2020 - Yiddish stemming algorithm contributed by Assaf Urieli.
  • Oct 2019 - Serbian stemming algorithm contributed by Stefan Petkovic and Dragan Ivanovic.
  • Oct 2019 - Snowball 2.0.0 released!
  • Aug 2019 - Hindi stemming algorithm contributed by Olly Betts.
  • Aug 2019 - Basque and Catalan merged into the distribution.
  • Oct 2018 - Greek stemming algorithm contributed by Oleg Smirnov.
  • Jun 2018 - Object pascal backend from Wout van Wezel merged.
  • May 2018 - Lithuanian stemming algorithm contributed by Dainius Jocas.
  • May 2018 - Indonesian stemming algorithm contributed by Olly Betts.
  • Mar 2018 - C# backend contributed by Cesar Souza.
  • Mar 2018 - Javascript backend merged.
  • Jun 2017 - Go backend contributed by Marty Schoch.
  • Mar 2017 - Rust backend contributed by Jakob Demler.
  • Jan 2016 - Arabic stemming algorithm contributed by Assem Chelli.
  • Oct 2015 - Tamil stemming algorithm contributed by Damodharan Rajalingam.
  • Sep 2015 - New home for snowball on snowballstem.org.
  • Sep 2014 - Martin Porter retires from snowball development.
  • May 2012 - Contributed stemmers for Irish and Czech.
  • Jul 2010 - Contributed stemmers for Armenian, Basque, Catalan.
  • Mar 2007 - Romanian stemmer.
  • Jan 2007 - Turkish stemmer. Contributed by Evren (Kapusuz) Cilden.
  • Sep 2006 - Hungarian stemmer. Contributed by Anna Tordai.
  • Jun 2006 - Supported and updated Python bindings.
  • May 2005 - UTF-8 Unicode support.
  • Sep 2002 - Finnish stemmer.
  • Jul 2002 - ISO Latin I as default The use of MS DOS Latin I is now history, but the old versions of the Snowball stemmers are still accessible on the site.
  • May 2002 - Unicode support
  • Feb 2002 - Java support Richard has modified the snowball code generator to produce Java output as well as ANSI C output. This means that pure Java systems can now use the snowball stemmers.