SCM

[#6615] Current german stemmer removes umlauts

[#6615] Current german stemmer removes umlauts

Submitted by:
Nobody
Date Submitted:
2019-02-22 18:23
Assigned to:
Nobody (None)
Priority:
3
State:
Summary:*
Current german stemmer removes umlauts

Detailed description
Anonymous message posted by j.rabenschlag@gmail.com

Hi,

the current stemmer for the German language in your package removes the umlauts.

Example:

words<-"groß Größe größer"
SnowballC::wordStem(words, language = "german")
[1] "gross Grosse gross"

The Snowball project provides a stemming function called "german2" to prevent this problem: http://snowball.tartarus.org/algorithms/german2/stemmer.html

Could you implement this?

Some more info:
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Thanks,
Johannes

Add A Comment: Notepad

Comments:

Message  ↓
Date: 2019-02-22 18:27
Sender: Milan Bouchet-Valat

That stemmer isn't available from the tarball for the C library provided by Snowball: http://snowball.tartarus.org/dist/libstemmer_c.tgz There's probably a reason for that, but you'd better ask them.

Existing Files:

Attach Files: (max upload size: 8 MiB)




Attached Files:

Changes

Field Old Value Date By
detailsAnonymous message posted by j.rabenschlag@gmail.com Hi, the current stemmer for the German language in your package removes the umlauts. Example: words<-"groß Größe größer" SnowballC::wordStem(words, language = "german") [1] "gross Grosse gross" The Snowball project provides a stemming function called "german2" to prevent this problem: http://snowball.tartarus.org/algorithms/german2/stemmer.html Could you implement this? Some more info: > sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Thanks, Johannes2019-02-22 18:27milanbv
Thanks to:
Vienna University of Economics and Business Powered By FusionForge