From Fedora Project Wiki
m (→‎Software Investigation and Evaluation: Remove test references tag)
(Reorganize headers, including breaking Software Investigation and Evaluation into In Progress and Not Suitable)
Line 1: Line 1:
{{header|infra}}
{{header|infra}}


= Project Sponsor =
= Points of Contact =
 
=== Project Sponsor ===
'''Name:''' Mike McGrath<br>
'''Name:''' Mike McGrath<br>
'''Fedora Account Name:''' mmcgrath<br>
'''Fedora Account Name:''' mmcgrath<br>
Line 7: Line 9:
'''Infrastructure Sponsor:''' mmcgrath<br>
'''Infrastructure Sponsor:''' mmcgrath<br>


== Secondary Contact info ==
=== Secondary Contact info ===
'''Name:''' Huzaifa Sidhpurwala<br>
'''Name:''' Huzaifa Sidhpurwala<br>
'''Fedora Account Name:''' huzaifas<br>
'''Fedora Account Name:''' huzaifas<br>
Line 25: Line 27:
'''Expiration/Delivery Date (required):''' F13<br>
'''Expiration/Delivery Date (required):''' F13<br>


Description/Summary:
=== Description/Summary ===
 
Fedora needs a search engine<ref name="Trac">{{cite web|url=https://fedorahosted.org/fedora-infrastructure/ticket/1055|title=Fedora Search Engine|publisher=[[Infrastructure/Tickets]]}}</ref>
Fedora needs a search engine<ref name="Trac">{{cite web|url=https://fedorahosted.org/fedora-infrastructure/ticket/1055|title=Fedora Search Engine|publisher=[[Infrastructure/Tickets]]}}</ref>


Requirements:
=== Requirements ===
 
* Crawl the web sites (wiki and non-wiki)
* Crawl the web sites (wiki and non-wiki)
* Search the web sites (wiki and non-wiki)
* Search the web sites (wiki and non-wiki)


Preferences:
=== Preferences ===
 
* Python-based (no Java)
* Python-based (no Java)
* Programmable keywords to have control over what pages get displayed for certain keywords
* Programmable keywords to have control over what pages get displayed for certain keywords
* XML or library interface so other applications can use it
* XML or library interface so other applications can use it


Project plan (Detailed):
=== Project Plan ===
# Investigate and evaluate existing open source search engines
# Investigate and evaluate existing open source search engines
# Select candidate software
# Select candidate software
Line 45: Line 50:
# Deploy
# Deploy


== Specific resources needed ==
=== Resources Needed ===
 
* Public Test for testing candidate software
* Public Test for testing candidate software
* Permanent home(s) for deployment
* Permanent home(s) for deployment
Line 52: Line 58:


== Software Investigation and Evaluation ==
== Software Investigation and Evaluation ==
=== In Progress ===


* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>
* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>
Line 80: Line 88:
* Plucene
* Plucene
: Perl port of Lucene
: Perl port of Lucene
=== Not Suitable ===


* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
: Not suitable
: written in C
: written in C


* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
: Not suitable
: written in Java
: written in Java


* Grub <ref name="Grub">{{cite web|url=http://grub.org/|title=Grub|publisher=Wikia, Inc.}}</ref>
* Grub <ref name="Grub">{{cite web|url=http://grub.org/|title=Grub|publisher=Wikia, Inc.}}</ref>
: Not suitable
: written in C#
: written in C#


* ht://dig <ref name="htDig">{{cite web|url=http://www.htdig.org/|title=ht://Dig|publisher=The ht://Dig Group}}</ref>
* ht://dig <ref name="htDig">{{cite web|url=http://www.htdig.org/|title=ht://Dig|publisher=The ht://Dig Group}}</ref>
: Not suitable
:* written in C++
:* written in C++
:* not actively maintained
:* not actively maintained


* Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
* Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
: Not suitable
: written in C/C++
: written in C/C++


* Isearch <ref name="Isearch">{{cite web|url=http://isite.awcubed.com/|title=Isearch|publisher=Isite}}</ref>
* Isearch <ref name="Isearch">{{cite web|url=http://isite.awcubed.com/|title=Isearch|publisher=Isite}}</ref>
: Not suitable
: written in C++
: written in C++


* Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref>
* Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref>
: Not suitable
:* originally in Java, ported to others
:* originally in Java, ported to others
:* Perl ports are Plucene and KinoSearch; Ruby port is Ferret
:* Perl ports are Plucene and KinoSearch; Ruby port is Ferret
Line 113: Line 116:


* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
: Not suitable
: written in C
: written in C


* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
: Not suitable
:* Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
:* Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
:* EzMwLucene is wiki-only, therefore MWSearch is wiki-only
:* EzMwLucene is wiki-only, therefore MWSearch is wiki-only


* Nutch <ref name="Nutch">{{cite web|url=http://lucene.apache.org/nutch/|title=Nutch|publisher=Apache Software Foundation}}</ref>
* Nutch <ref name="Nutch">{{cite web|url=http://lucene.apache.org/nutch/|title=Nutch|publisher=Apache Software Foundation}}</ref>
: Not suitable
:* written in Java
:* written in Java
:* based on Lucene
:* based on Lucene


* RigorousSearch <ref name="RigorousSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch|title=RigorousSearch Extension|publisher=[[MediaWiki]]}}</ref>
* RigorousSearch <ref name="RigorousSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch|title=RigorousSearch Extension|publisher=[[MediaWiki]]}}</ref>
: Not suitable
: Crawls the MediaWiki database, not the web site.  It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
: Crawls the MediaWiki database, not the web site.  It doesn't work for non-MediaWiki web sites, including any non-wiki web site.


* Sphinx <ref name="Sphinx">{{cite web|url=http://sphinxsearch.com/|title=Sphinx|publisher=Sphinx Technologies}}</ref>
* Sphinx <ref name="Sphinx">{{cite web|url=http://sphinxsearch.com/|title=Sphinx|publisher=Sphinx Technologies}}</ref>
: Not suitable
: written in C++
: written in C++


* Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref>
* Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref>
: Not suitable
: written in C
: written in C
: Swish++ is a rewrite in C++
: Swish++ is a rewrite in C++


* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
: Not suitable
: written in Java
: written in Java


* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref>
* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref>
: Not suitable
: written in C++
: written in C++


* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref>
* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref>
: Not suitable
: written in C
: written in C


* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
: Not suitable
: written in C
: written in C


== Public Testing ==
== Public Testing ==
<tbd>
<tbd>


== Deployment Plan ==
== Deployment Plan ==
<tbd>
<tbd>


= References =
= References =
{{reflist}}
{{reflist}}


[[Category:Infrastructure]]
[[Category:Infrastructure]]

Revision as of 22:50, 12 October 2009


Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine[1]

Requirements

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences

  • Python-based (no Java)
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project Plan

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Resources Needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

In Progress

  • HtdigSearch [2]
Huzaifa (in progress)
  • SphinxSearch [3]
Huzaifa (in progress)
  • Ferret
Ruby port of Lucene
  • Gonzui [4] (specializes in source code search)
  • written in Ruby
  • not actively maintained
  • KinoSearch
Perl port of Lucene
written in Perl
Not suitable
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Plucene
Perl port of Lucene

Not Suitable

  • DataparkSearch [7]
written in C
written in Java
written in C#
  • written in C++
  • not actively maintained
written in C/C++
written in C++
  • originally in Java, ported to others
  • Perl ports are Plucene and KinoSearch; Ruby port is Ferret
  • see Lucene Implementations [14]
written in C
  • Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
  • written in Java
  • based on Lucene
Crawls the MediaWiki database, not the web site. It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
written in C++
written in C
Swish++ is a rewrite in C++
  • Terrier (TERabyte RetrIEveR) [21]
written in Java
written in C++
written in C
written in C

Public Testing

<tbd>

Deployment Plan

<tbd>

References

  1. "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch. 
  3. "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch. 
  4. "Gonzui". SourceForge. http://gonzui.sourceforge.net/. 
  5. "Namazu". Namazu Project. http://www.namazu.org/. 
  6. "OpenFTS". SourceForge. http://openfts.sourceforge.net/. 
  7. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  8. "Egothor". Egothor. http://www.egothor.org/. 
  9. "Grub". Wikia, Inc.. http://grub.org/. 
  10. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  11. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  12. "Isearch". Isite. http://isite.awcubed.com/. 
  13. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  14. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  15. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  16. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  17. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  18. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  19. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  20. "Swish-e". Swish-e. http://swish-e.org/. 
  21. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/. 
  22. "Xapian". Xapian Project. http://xapian.org/. 
  23. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/. 
  24. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.