Two Romanian stemmers

Links to resources

In swift succession, we received in 2006 two stemmers for Romanian written in Snowball. Here is the original correspondence,

From: Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
To: snowball-discuss
Date: Wed, 07 Jun 2006 00:06:30 +0200
Subject: [Snowball-discuss] romanian stemmer

Hello everyone,

my name is Erwín Glockner, I'm a student of computational linguistics in
Heidelberg, Germany. Together with my fellow students Doina Gliga and
Marina Stegarescu we started to write a romanian stemmer in Snowball.
We planned to finish the stemmer until end of this month. We would be
happy if the stemmer would be accepted as part of the Snowball-distribution.
There is still some work to do, e.g. evaluating the stemmer, making a
stopwords-list, unicode support, etc. After finishing this we will send
you our stemmer with the corresponding files, but I couldn't find any
email address to whom the stemmer should be sent to.
Could please someone tell me the address(es)?

With kind regards,
E. Glockner, D. Gliga, M. Stegarescu.
From: Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
To: richard@lemurconsulting.com,
    martin.porter@grapeshot.co.uk
Date: Tue Jul 18 19:43:39 2006
Subject: romanian stemmer

Dear Mr. Porter, dear Mr. Boulton,

we finally finished the Romanian stemmer. Unfortunately evaluation took
more time than expected.
However, it was an interesting experience creating the stemmer, and we
are happy to send you the result of our work.
The attachment-file is a Tarball-zipped file with (hopefully) all files
needed. The files and the stemmer as well are encoded in UTF-8. Please
inform us if something is missing.

We would be happy if the Romanian stemmer would be accepted and
integrated into the official Snowball distribution. We agree of course
to license the stemmer under the same terms as the existing snowball
software.

We're looking forward to hear from you soon.


With kind regards,

Marina, Doina and Erwin.

Attachment: [romanian1.tgz]
From: Irina Tirdea <irina.tirdea@gmail.com>
To: richard@lemurconsulting.com,
    martin.porter@grapeshot.co.uk
Date: Mon Jul 31 10:19:51 2006
Subject: Romanian stemmer

Hello,

My name is Irina Tirdea and I have developed a Romanian stemmer in Snowball
as part of my bachelor thesis, in Bucharest, Romania. I am sending you the
code attached (with vocabulary and stop word list files) and I hope you will
accept and integrate it as a part of the Snowball project. I am ready to
release the stemmer under the BSD license, just as the Snowball software.
The files have been written in UTF-8 encoding (on a Linux system).

Looking forward to hear from you.

Kind regards,
Irina Tirdea

Attachment: [romanian2.tgz]
From: martin.porter@grapeshot.co.uk (Martin Porter)
To: snowball-discuss
Cc: atordai@science.uva.nl,
    eglockne@ix.urz.uni-heidelberg.de,
    irina.tirdea@gmail.com
Date: Mon Jul 31 10:43:05 BST 2006
Subject: Tardy response to submissions to Snowball

I am sending this general email as a kind of apology, for having done nothing
so far on the following generously sent Snowball submissions:

7 June, from E. Glockner: a Romanian stemmer
8 June, from A. Tordai: a Hungarian stemmer

and this morning another Romanian stemmer arrived,

31 July, from I. Tirdea, a Romanian stemmer

After the first submission I promised to look at it "next week", so Mr Glockner
has probably been wondering what has happened. [. . .] I will make a point of
looking at these submissions this week,

More soon,

Martin
From: martin.porter@grapeshot.co.uk (Martin Porter)
To: snowball-discuss
Cc: irina.tirdea@gmail.com,
    eglockne@ix.urz.uni-heidelberg.de,
    mstegare@hotmail.com,
    doina_gliga@yahoo.co.uk,
    eglockner@hotmail.com
Date: Wed Sep 06 12:39:16 BST 2006
Subject: Romanian stemmer

To the originators of the Romanian stemmers,

I have now found time to do some preliminary work on the Romanian stemmer. I
should explain that part of the complication has been the receipt, no more
than ten days apart, of two Romanian stemmers in Snowball, the first
(romanian1) from [Glockner, Gliga, and Stegarescu] in Heidelberg, the second
(romanian2) from Tirdea in Bucharest.

[. . . .]

I have put together a vocabulary by combining the vocabularies provided with
romanian1 and romanian2. This appears in column 1. Column 2 is the stemmed
form produced by romanian1, and column 3 the stemmed form produced by
romanian2. If the entry in column 3 is blank, both stemmers are producing the
same result.

You might care to compare the two approaches.

My own feeling is that romanian1 does a more thorough job of ending removal,
but unlike romanian2 has a habit of discarding too much from short words.
aberant->ab, abatere->ab, aburi->ab are examples of this. In romanian1 the R2
test is rarely used (it seems to me that 'R1 or R2' is equivalent to 'R1',
since p2 is never to the left of p1.)

I might have a go at making some modifications here. Needless to say, I am
not familiar with Romanian, but the similarity to the other Romance
languages, especially Italian, enables one to grasp the essential features of
the morphology.

What we would like to do is to have a single stemmer for release from the
snowball site, if that is possible, and giving all necessary credits, along
the lines of the recent addition,

http://snowballstem.org/algorithms/hungarian/stemmer.html

Hope to hear from you,

Martin Porter

Finally we decided to produce our own Romanian stemmer as described on the Romanian stemmer page. The submitted stemmers both contain stop word lists, available inside the tarballs.