From Fedora Project Wiki
(→‎Software Investigation and Evaluation: Moved C/C++ to "In Progress"; Moved "not maintained" to "Not Suitable")
Line 60: Line 60:


=== In Progress ===
=== In Progress ===
* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
: written in C
* Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
: written in C/C++
* Isearch <ref name="Isearch">{{cite web|url=http://isite.awcubed.com/|title=Isearch|publisher=Isite}}</ref>
: written in C++


* KinoSearch <ref name="KinoSearch">{{cite web|url=http://www.rectangular.com/kinosearch/|title=KinoSearch|publisher=Rectangular Research}}</ref> '''- Allen investigating'''
* KinoSearch <ref name="KinoSearch">{{cite web|url=http://www.rectangular.com/kinosearch/|title=KinoSearch|publisher=Rectangular Research}}</ref> '''- Allen investigating'''
: Perl port of Lucene
: Perl port of Lucene
* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
: written in C


* Namazu <ref name="Namazu">{{cite web|url=http://www.namazu.org/|title=Namazu|publisher=Namazu Project}}</ref> '''- Huzaifa investigating'''
* Namazu <ref name="Namazu">{{cite web|url=http://www.namazu.org/|title=Namazu|publisher=Namazu Project}}</ref> '''- Huzaifa investigating'''
: written in Perl
: written in Perl


* OpenFTS <ref name="OpenFTS">{{cite web|url=http://openfts.sourceforge.net/|title=OpenFTS|publisher=SourceForge}}</ref> '''- Huzaifa investigating'''
* Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref>
:* written in Perl or TCL on top of PostgreSQL
: written in C
:* Python interface available
: Swish++ is a rewrite in C++
:* not actively maintained


* Plucene <ref name="Plucene">{{cite web|url=http://search.cpan.org/~tmtm/Plucene-1.25|title=Plucene|publisher=CPAN}}</ref> '''- Allen investigating'''
* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref>
:* Perl port of Lucene
: written in C++
:* not actively maintained


=== Not Suitable ===
* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref>
: written in C


* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
: written in C
: written in C
=== Not Suitable ===


* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
Line 97: Line 110:
* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>  
* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>  
: It's just a MediaWiki plugin, not suitable for searching non-wiki sites
: It's just a MediaWiki plugin, not suitable for searching non-wiki sites
* Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
: written in C/C++
* Isearch <ref name="Isearch">{{cite web|url=http://isite.awcubed.com/|title=Isearch|publisher=Isite}}</ref>
: written in C++


* Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref>
* Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref>
: written in Java, but ported to others <ref name="LuceneImplementations">{{cite web|url=http://wiki.apache.org/lucene-java/LuceneImplementations|title=Lucene Implementations|publisher=Apache Software Foundation}}</ref>
: written in Java, but ported to others <ref name="LuceneImplementations">{{cite web|url=http://wiki.apache.org/lucene-java/LuceneImplementations|title=Lucene Implementations|publisher=Apache Software Foundation}}</ref>
* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
: written in C


* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
Line 117: Line 121:
:* written in Java
:* written in Java
:* based on Lucene
:* based on Lucene
* OpenFTS <ref name="OpenFTS">{{cite web|url=http://openfts.sourceforge.net/|title=OpenFTS|publisher=SourceForge}}</ref>
:* written in Perl or TCL on top of PostgreSQL
:* Python interface available
:* not actively maintained
* Plucene <ref name="Plucene">{{cite web|url=http://search.cpan.org/~tmtm/Plucene-1.25|title=Plucene|publisher=CPAN}}</ref>
:* Perl port of Lucene
:* not actively maintained


* RigorousSearch <ref name="RigorousSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch|title=RigorousSearch Extension|publisher=[[MediaWiki]]}}</ref>
* RigorousSearch <ref name="RigorousSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch|title=RigorousSearch Extension|publisher=[[MediaWiki]]}}</ref>
Line 128: Line 141:
* SphinxSearch <ref name="SphinxSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch|title=SphinxSearch Extension|publisher=[[MediaWiki]]}}</ref>  
* SphinxSearch <ref name="SphinxSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch|title=SphinxSearch Extension|publisher=[[MediaWiki]]}}</ref>  
: Written in C++
: Written in C++
 
: Wiki-only (?)
* Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref>
: written in C
: Swish++ is a rewrite in C++


* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
: written in Java
: written in Java
* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref>
: written in C++
* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref>
: written in C
* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
: written in C


== Public Testing ==
== Public Testing ==

Revision as of 16:01, 21 October 2009


Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine[1]

Requirements

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences

  • Python-based (no Java)
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project Plan

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Resources Needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

In Progress

  • DataparkSearch [2]
written in C
written in C/C++
written in C++
  • KinoSearch [5] - Allen investigating
Perl port of Lucene
  • mnoGoSearch [6]
written in C
  • Namazu [7] - Huzaifa investigating
written in Perl
written in C
Swish++ is a rewrite in C++
written in C++
written in C
written in C

Not Suitable

written in Java
written in C#
  • written in Java
  • archives content rather than simply indexing it
  • written in C++
  • not actively maintained
It's just a MediaWiki plugin, not suitable for searching non-wiki sites
written in Java, but ported to others [18]
  • Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
  • written in Java
  • based on Lucene
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Perl port of Lucene
  • not actively maintained
Crawls the MediaWiki database, not the web site
Doesn't work for non-MediaWiki web sites, including any non-wiki web site
written in C++
designed to index SQL tables, not web pages.
Written in C++
Wiki-only (?)
  • Terrier (TERabyte RetrIEveR) [26]
written in Java

Public Testing

<tbd>

Deployment Plan

<tbd>

References

  1. "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  3. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  4. "Isearch". Isite. http://isite.awcubed.com/. 
  5. "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/. 
  6. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  7. "Namazu". Namazu Project. http://www.namazu.org/. 
  8. "Swish-e". Swish-e. http://swish-e.org/. 
  9. "Xapian". Xapian Project. http://xapian.org/. 
  10. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/. 
  11. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/. 
  12. "Egothor". Egothor. http://www.egothor.org/. 
  13. "Grub". Wikia, Inc.. http://grub.org/. 
  14. "Heritrix". Internet Archive. http://crawler.archive.org/. 
  15. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  16. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch. 
  17. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  18. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  19. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  20. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  21. "OpenFTS". SourceForge. http://openfts.sourceforge.net/. 
  22. "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25. 
  23. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  24. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  25. "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch. 
  26. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.