An Object Pascal codegenerator for Snowball

Links to resources

Here is the original correspondence,

From: Wout van Wezel <wout@vanwezel.com>
To: martin.porter@grapeshot.co.uk
Date: Mon Jan 24 21:44:23 2005
Subject: Snowball extension

Dear Mr. Porter,

Somebody extended Snowball for me to create Pascal stemmers. I wanted the
stemming algorithms in Pascal so they can be compiled in my information
retrieval system (http://www.collectionconnection.nl) which is created in
Delphi. I would not mind sharing the extensions, especially given the nature
of the Snowball project. If you are interested, please let me know, and I
will send you the sources. Everything seems to be working fine, but
unfortunately I won't be able to maintain the Pascal-extension code when the
core Snowball program changes since my knowledge of C is approximately zero.
Here are the changes the developer made:

"In order to support Delphi files generation I've added file
generator_delphi.c (in compiler directory). Also modified:
1) header.h: added output_delphi and make_delphi fields to struct options.
Also added forward declarations to Delphi's generator functions;
2) driver.c: modified in order to support new command line option "-d" and
call Delphi generator if its specified;

Modified GNUmakefile at the root of the snowball tree in order to add
generator_delphi.c into the compilation process.

Folder Delphi added. It contains 3 files:
1) SnowballProgram.pas - base class for all generated stemmers;
2) Test.bpr - template for the sample projects;
3) Generate.pl - Perl script that generate all sample stemmers.

File algorithms/finnish/stem.sbl modified. There was two grouping: v and V. I
rename V to V2 because Delphi Language is case-insensitive."

I apologize for emailing you directly, but offering the source code on the
mailing list could result in parallel versions instead of a single CVS
version and I didn't think that would be good either.

Best regards,
Wout van Wezel
From: martin.porter@grapeshot.co.uk (Martin Porter)
To: Wout van Wezel <wout@vanwezel.com>
Cc: richard@tartarus.org
Date: Mon Jan 24 21:54:15 2005
Subject: Re: Snowball extension

Wout,

Thank you for this email. I am very busy with a number of other things at the
moment, but hope to send you a sensible reply in a few days. Meanwhile I'm
copying your email to Richard Boulton, who is equally involved with the
Snowball work.

Martin
From: martin.porter@grapeshot.co.uk (Martin Porter)
To: Wout van Wezel <wout@vanwezel.com>
Cc: richard@tartarus.org
Date: Tue Jan 25 08:49:07 2005
Subject: Re: Snowball extension

Wout,

I think it would be a shame to lose the work you have done, but at the moment
I'm not sure about the best way to make it publicly available from the Snowball
site. I will talk the matter over with Richard Boulton. Could I suggest that,
despite your misgivings, it should be announced on Snowball discuss? We then
get a record on the site that Pascal versions of the stemmers do exist.

Maintenance is of course the main issue here. On
http://tartarus.org/~martin/PorterStemmer/ I have about 14 versions of the
Porter stemmer in various languages, only four of which I wrote myself.
Inevitably I get queries about versions of the stemmer written in programming
languages I am not familiar with, and they are difficult to answer, especially
when contact with the author has been lost.

We would need to assess the code. For example,

>File algorithms/finnish/stem.sbl modified. There was two grouping: v and V. I
>rename V to V2 because Delphi Language is case-insensitive."

This is not a good solution, since it constrains the writing of Snowball
scripts. The name translation should reflect case. There are various ways of
doing this, for example,

stemmer -> stemmer_lllllll
Stemmer -> stemmer_ullllll
STEMMER -> stemmer_uuuuuuu

where the pattern of u's an l's shows the upper/lower case usage of the
original name.

Martin
From: Wout van Wezel <wout@vanwezel.com>
To: Martin Porter <martin.porter@grapeshot.co.uk>
Cc: richard@tartarus.org
Date: Tue Jan 25 10:44:36 2005
Subject: Re: Snowball extension

Dear Martin/Richard,

I've attached the files I got from the developer. The changed sources are in
the Snowball tree. The developer has explained in 'readme.doc' what he did.
Be aware that I don't have an understanding of the Snowball or C language
myself. If you think it would be useful for a more general audience, I can
ask the developer to work on the 'case' problem.

ps, I don't mind a message on the discussion list. Also, I would be happy to
send the Pascal stemmers or the adapted Snowball program to people from the
list that think they could use it.

Best regards,
Wout

Attachment: stemming.zip
From: martin.porter@grapeshot.co.uk (Martin Porter)
To: Wout van Wezel <wout@vanwezel.com>
Cc: richard@tartarus.org
Date: Wed Apr 20 21:37:42 2005
Subject: Pascal codegenerator for Snowball

Wout,

I have only recently looked through the large tgz file you sent me in
January. Here are some initial thoughts,

The work of your student was done to high standard, and the delivered system,
with software extensions and documentation, is very nice. The only real slip
is that the upper/lower case distinction of Snowball  names was not preserved
in the generated Pascal code. But I have altered the Finnish stemmer so that V
and v are called V2 and V1 now, so you won't have to create a separate version.

I have not yet talked to Richard Boulton about this, but I think we should keep
the tgz file on the Snowball site with a note about its purpose, but I don't
think it is worth incorporating the Pascal codegenerator into the main Snowball
system, since it is unlikely there would be a large demand for Pascal versions
of the stemmers.

There is of course a danger that as Snowball extends, use of the Pascal
codegenerator will become increasingly difficult, but I am beginning to think
that Snowball is now fairly fixed, so that should not be a problem.

Martin