From Fedora Project Wiki
mNo edit summary
mNo edit summary
Line 428: Line 428:
| sw  || Swahili      || hunspell-sw ||
| sw  || Swahili      || hunspell-sw ||
|-
|-
| tet || Tetun       || [http://wiki.services.openoffice.org/wiki/Dictionaries#Tetum_.28Indonesia.29 available] ||
| tet || Tetum       || [http://wiki.services.openoffice.org/wiki/Dictionaries#Tetum_.28Indonesia.29 available] ||
|}
|}



Revision as of 09:05, 6 November 2008

Fix the dictionary proliferation problem

Summary

Fix the proliferation of dictionaries in the OS.

Owners

Current status

  • Targeted release: Fedora 9
  • Last modified: 2008-04-07
  • Percentage of completion: 100%
  • This is complete, all major applications and default GNOME/KDE spell checking now goes through hunspell. All that remains is to package dictionaries for the lesser used languages where there hasn't already been a sufficiently vibrant fedora-using language community that has taken up packaging a dictionary for their language.

Usage cases/rationale

We have separate dictionaries for each language for OpenOffice.org, Firefox, Thunderbird, and aspell (which gnome and KDE use). This is dumb.

Benefit to Fedora

We get code reuse, a smaller distribution, and a decreased memory footprint.

Scope

Requires changing the OpenOffice.org, thunderbird, firefox, and dictionary packages.

Test Plan

Test spell checking in all apps.

Dependencies

None.

Details

  1. Split out hunspell from OpenOffice.org - rhbz#214764 complete
  2. Make OpenOffice.org use it - rhbz#214764 complete
  3. Split out the dictionaries into separate packages - rhbz#218769 (english) complete
  4. Make OpenOffice.org use system dictionaries - complete
  5. Make gedit/xchat use it, i.e. enchant. enchant by default already generally prefers using hunspell over aspell, just needs to be told where the dictionaries are - complete
  6. Make evolution use it, i.e. gnome-spell. gnome-spell can be patched to use enchant to achieve this - rhbz#426347 complete
  7. Make tomboy/pidgin use it, i.e. gtkspell. Same story as gnome-spell - rhbz#245888 complete
  8. Make Firefox (and other gecko apps) use it - rhbz#218762 complete, upstream state is now resolved
  9. Make KDE use enchant and/or hunspell - complete - KDE 4 already defaults to enchant in Sonnet. (For K3Spell, see "legacy KSpell" below.) The aspell backend was dropped entirely in Rawhide. For kdelibs3:
    • The legacy KSpell uses command-line spellcheckers. Kevin Kofler wrote a patch to support hunspell, and kde-settings in Rawhide was changed to make it the default.
    • The newer KSpell2 API is plugin-based and uses libraries. It is what KDE 4's Sonnet is based on. Kevin Kofler backported Sonnet's enchant backend. The aspell and ispell backends were dropped in Rawhide.
    See the fedora-devel-list message.
  10. Remove copy of hunspell from enchant - rhbz#426402 complete
  11. Remove copy of hunspell from xulrunner complete
  12. Split enchant to have a separate enchant-aspell rpm to enable optionally removing the aspell support - rhbz#426402 complete
  13. Prefer hunspell over aspell as the default for install in comps. See table below for mis-match in language support. rhbz#439037 complete
  14. Repackage/replace the aspell dictionaries with hunspell dictionaries 80% see table below for language support

Optional

  1. Write an aspell compatibility layer so aspell apps can use the same dictionaries no volunteer -> deferred, is this neccessary at all ? All major desktop apps work now out of the box
  2. Make vim use hunspell - rhbz#219777 patch available, not necessary if vim continues to not use any spell-checking, but preferred over introducing built-in vim spellchecker which has yet another format which hunspell dicts are converted to for use

Dictionaries

1. Language Support Matrix (glibc upwards)

Language Code Language aspell hunspell notes
aa Afar afarfriends.org hosted ALSEC report.
af Afrikaans aspell-af hunspell-af
am Amharic available And one for non-commercial use
an Aragonese www.iea.es, see Spain: Lexicography In Iberian Languages
ar Arabic aspell-ar hunspell-ar
as Assamese www.xobdo.net
ast Asturian www.academiadelallingua.com, see Spain: Lexicography In Iberian Languages
az Azeri (Latin) hunspell-az
be Belarusian hunspell-be
ber Amazigh (Tifinagh) hunspell-ber
ber Amazigh (Latin)
bg Bulgarian aspell-bg hunspell-bg
bn Bengali aspell-bn hunspell-bn
bo Tibetan bo.openoffice.org. Latest language support update.
br Breton aspell-br hunspell-br
bs Bosnian From a pure spelling-dictionary point of view, would there be differences from hunspell-hr ?
byn Blin Blin Orthography: A History and an Assessment
ca Catalan aspell-ca hunspell-ca
crh Crimean Tatar corpus
cs Czech aspell-cs hunspell-cs
csb Kashubian hunspell-csb
cy Welsh aspell-cy hunspell-cy
da Danish aspell-da hunspell-da
de German aspell-de hunspell-de
dz Dzongkha crubadan corpus building
el Greek aspell-el hunspell-el
en English aspell-en hunspell-en
es Spanish aspell-es hunspell-es
et Estonian hunspell-ee
eu Basque hunspell-eu
fa Farsi hunspell-fa
fi Finnish Finnish Community has a parallel Voikko solution. With an enchant backend, an OpenOffice.org extension, and a Firefox extension.
fil Filipino hunspell-tl Filipino is effectively an official Tagalog-based language
fo Faeroese aspell-fo hunspell-fo
fr French aspell-fr hunspell-fr
fur Friulian hunspell-fur
fy Frisian hunspell-fy
ga Irish aspell-ga hunspell-ga
gd Scots Gaelic aspell-gd hunspell-gd
gez Ge'ez Ge'ez Frontier Foundation
gl Galician aspell-gl hunspell-gl
gu Gujarati aspell-gu hunspell-gu
gv Manx convertable
ha Hausa crubadan possible wordlist, www.dictionary.kasahorow.com
he Hebrew aspell-he hunspell-he
hi Hindi aspell-hi hunspell-hi
hr Croatian aspell-hr hunspell-hr
hsb Upper Sorbian hunspell-hsb
hu Hungarian hunspell-hu
hy Armenian hunspell-hy
id Indonesian aspell-id hunspell-id
ig Igbo crubadan, www.dictionary.kasahorow.com
ik Inupiaq Iñupiaq parser project. Broken download link to MSWord dictionary
is Icelandic aspell-is hunspell-is
it Italian aspell-it hunspell-it
iu Inuktitut www.livingdictionary.com
ja Japanese
ka Georgian ka.openoffice.org : Crubadan is aware of 29023 words
kk Kazakh available
kl Kalaallisut Greenlandic parser project
km Khmer hunspell-km
kn Kannada BharateeyaOO.o
ko Korean
ku Kurdish (Latin) hunspell-ku
ku Kurdish (Arabic)
kw Cornish crubadan corpus building
ky Kyrgyz OOo localization beginnings
lg Luganda A general translation effort.
li Limburgish crubadan corpus building
lo Lao Lao OOo localization
lt Lithuanian hunspell-lt
lv Latvian hunspell-lv
mai Maithili maithiliacademy.org
mg Malagasy hunspell-mg
mi Maori hunspell-mi
mk Macedonian hunspell-mk
ml Malayalam aspell-ml hunspell-ml
mn Mongolian hunspell-mn
mr Marathi aspell-mr hunspell-mr
ms Malay hunspell-ms
mt Maltese hunspell-mt
nb Bokmaal aspell-no hunspell-nb
nds Lowlands Saxon hunspell-nds
ne Nepali hunspell-ne
nl Dutch aspell-nl hunspell-nl
nn Nynorsk aspell-no hunspell-nn
nr Ndebele (Southern) hunspell-nr
nso Sotho (Northern) hunspell-nso
oc Occitan hunspell-oc
om Oromo crubadan corpus building. Oromo wiki entry
or Oriya aspell-or hunspell-or
pa Punjabi aspell-pa hunspell-pa
pap Papiamento crubadan corpus building
pl Polish aspell-pl hunspell-pl
pt Portuguese aspell-pt hunspell-pt
ro Romanian hunspell-ro
ru Russian aspell-ru hunspell-ru
rw Kinyarwanda hunspell-rw
sa Sanskrit An apparent effort to create a Sanskrit hunspell dictionary
sc Sardinian hunspell-sc
se Sami, Northern available A colossal 50Megs
shs Secwepemctsin www.native-languages.org
si Sinhala A very small wordlist
sid Sidamo Some info
sk Slovak aspell-sk hunspell-sk
sl Slovenian aspell-sl hunspell-sl
so Somali An apparent effort to create a Somali hunspell dictionary
sq Albanian hunspell-sq
sr Serbian aspell-sr hunspell-sr
ss Swati hunspell-ss
st Sotho (Southern) hunspell-st
sv Swedish aspell-sv hunspell-sv
ta Tamil aspell-ta hunspell-ta
te Telugu aspell-te hunspell-te
tg Tajik An apparent effort to create a Tajik hunspell dictionary
th Thai hunspell-th
ti Tigrigna non-commercial use
tig Tigre crubadan corpus building
tk Turkmen hunspell-tk
tl Tagalog hunspell-tl
tn Tswana hunspell-tn
tr Turkish available But like Finnish through voikko the typical solution for Turkish has been the Zemberek library, and to have an enchant backend, an Openoffice.org Extension, and a Firefox extension)
ts Tsonga hunspell-ts
tt Tatar available Hard to see where this came from originally, and what license it is exactly, GPLv2+ (?). Perhaps it is an original work of ALT Linux and that actually is the canonical upstream ?
ug Uyghur www.uyghurdictionary.org
uk Ukrainian hunspell-uk
ur Urdu hunspell-ur
uz Uzbek hunspell-uz
ve Venda hunspell-ve
vi Vietnamese hunspell-vi
wa Walloon hunspell-wa
wo Wolof www.alfanet.anafa.org make Wolof localizations of Firefox and Abiword. www.dictionary.kasahorow.com
xh Xhosa hunspell-xh
yi Yiddish The uspell spell-checker
yo Yoruba An apparent effort to create a Yoruba hunspell dictionary. www.dictionary.kasahorow.com
zh Chinese
zu Zulu hunspell-zu


2. Language Support Matrix (extra OOo recognized not in glibc)

Language Code Language hunspell notes
ak Akan small list of unknown licence. www.dictionary.kasahorow.com
az Azeri (Cyrillic)
bm Bambara Online Dictionary
brx Bodo Online Dictionary
cop Coptic available
cv Chuvash From this forum looks like there was a cv_RU-1.00.zip but download site is gone/down.
dgo Dogri Central Institute for Indian Languages
dv Dhivehi English-Dhivehi dictionary
ee Ewe online dictionary
eo Esperanto available
fj Fijian hunspell-fj
gsc Gascon Non-Commercial BY-NC-ND license
gug Guarani crubadan corpus building
hil Hiligaynon convertable
ia Interlingua available
ks Kashmiri online dictionary
kok Konkani [http://www.savemylanguage.org/ online dictionary
la Latin hunspell-la
lb Luxembourgish available We don't have the EUPL on our licence list yet
lg Ganda online dictionary
ln Lingala crubadan corpus building
mos Mossi info, dictionary effort (hunspell has no problem with utf-8 .dic files FWIW)
mni Manipuri some info
my Burmese online dictionary
ny Nyanja convertable
quh Quechua South Bolivia current effort
qul Quechua North Bolivia current effort
rm Raeto-Romance
sat Santali online dictionary
sd Sindhi online dictionary
sg Sango www.dictionary.kasahorow.com
sjd Sami, Kildin Northern Sami
sma Sami, Southern Northern Sami
smj Sami, Lule Northern Sami
smn Sami, Inari Northern Sami
sms Sami, Skolt Northern Sami
sw Swahili hunspell-sw
tet Tetum available

User experience

Should not affect user experience.

Contingency plan

Continue to ship older dictionaries.

Documentation

[1]

Release Notes

There is a new default spell checking back-end, hunspell, for both the GNOME and KDE desktops, as well as applications such as OpenOffice.org, Firefox, and other XULRunner-based applications. This common back-end includes a set of shared, multi-lingual dictionaries for use with hunspell. This feature uses a single set of common dictionaries regardless of the application, which gives consistent suggestions for misspelled words and uses less diskpace by eliminating duplicate dictionaries.

Comments

Note that JDS is going down this route as well

The OpenOffice.org hunspell dictionary list of working dictionaries

The mozilla hunspell dictionary list of tri-licensed dictionaries

The firefox extension list of available language extensions

How to build a dictionary

How to convert an ispell affix to hunspell .aff

Language Codes

A somewhat related issue .

Will help on adding Indic hunspell dictionaries in Fedora - paragn.

php5 and bluefish still link to aspell at least - kmaraas. (It's not practical for me to port everything, just the core default installed components and the default spell-checking solutions for the main desktop environments and applications - caolanm)

Ubuntu is now following the Fedora practice as well.