Snowball is a small string processing language for creating
stemming algorithms for use in Information Retrieval, plus a collection of
stemming algorithms implemented using it.
It was originally designed
and built by Martin
Porter. Martin retired from development in 2014 and Snowball is now
maintained as a community project. Martin originally chose the name Snowball as
a tribute to SNOBOL, the
excellent string handling language from the 1960s. It now also serves as a
metaphor for how the project grows by gathering contributions over time.
The Snowball compiler translates a Snowball program into source code in another
language - currently Ada, ISO C, C#, Go, Java, Javascript, Object Pascal,
Python and Rust are supported.
What is Stemming?
Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps connection, connections,
connective, connected, and connecting to connect.
So a searching for connected would also find documents which only
have the other forms.
This stem form is often a word itself, but this is not always the case as
this is not a requirement for text search systems, which are the intended
field of use. We also aim to conflate words with the same meaning, rather
than all words with a common linguistic root (so awe and awful
don't have the same stem), and over-stemming is more problematic than
under-stemming so we tend not to stem in cases that are hard to resolve. If
you want to always reduce words to a root form and/or get a root form which is
itself a word then Snowball's stemming algorithms likely aren't the right
answer.
Please address all Snowball-related mail to the snowball-discuss mailing list.
Any such mail sent directly to individual developers may be answered less
speedily, and in any case they reserve the right to post their answers on snowball-discuss.
Major events
-
Sep 2023 - Estonian stemming algorithm contributed by Linda Freienthal.
-
Nov 2021 - Snowball 2.2.0 released!
-
Jan 2021 - Snowball 2.1.0 released.
-
Jan 2021 - Armenian stemmer from Astghik Mkrtchyan merged into the distribution.
-
Jan 2021 - Ada backend contributed by Stephane Carrez.
-
Nov 2020 - Yiddish stemming algorithm contributed by Assaf Urieli.
-
Oct 2019 - Serbian stemming algorithm contributed by Stefan Petkovic and Dragan Ivanovic.
-
Oct 2019 - Snowball 2.0.0 released.
-
Aug 2019 - Hindi stemming algorithm contributed by Olly Betts.
-
Aug 2019 - Basque and Catalan merged into the distribution.
-
Oct 2018 - Greek stemming algorithm contributed by Oleg Smirnov.
-
Jun 2018 - Object pascal backend from Wout van Wezel merged.
-
May 2018 - Lithuanian stemming algorithm contributed by Dainius Jocas.
-
May 2018 - Indonesian stemming algorithm contributed by Olly Betts.
-
Mar 2018 - C# backend contributed by Cesar Souza.
-
Mar 2018 - Javascript backend merged.
-
Jun 2017 - Go backend contributed by Marty Schoch.
-
Mar 2017 - Rust backend contributed by Jakob Demler.
-
Jan 2016 - Arabic stemming algorithm contributed by Assem Chelli.
-
Oct 2015 - Tamil stemming algorithm contributed by Damodharan Rajalingam.
-
Sep 2015 - New home for snowball on snowballstem.org.
-
Sep 2014 - Martin Porter retires from snowball development.
-
May 2012 - Contributed stemmers for Irish and Czech.
-
Jul 2010 - Contributed stemmers for Armenian, Basque, Catalan.
-
Mar 2007 - Romanian stemmer.
-
Jan 2007 - Turkish stemmer. Contributed by Evren (Kapusuz) Cilden.
-
Sep 2006 - Hungarian stemmer. Contributed by Anna Tordai.
-
Jun 2006 - Supported and updated Python bindings.
-
May 2005 - UTF-8 Unicode support.
-
Sep 2002 - Finnish stemmer.
-
Jul 2002 - ISO Latin I as default
The use of MS DOS Latin I is now history, but the old versions of the
Snowball stemmers are still accessible on the site.
-
May 2002 - Unicode support
-
Feb 2002 - Java support
Richard has modified the snowball code generator to produce Java output as
well as ANSI C output. This means that pure Java systems can now use the
snowball stemmers.