From Fedora Project Wiki
(Created page with "= Sphinx = * [http://www.mediawiki.org/wiki/Extension:SphinxSearch This page is helpful] and I used their config and modified it to our needs. * The Sphinx indexer simply runs o...")
 
 
(4 intermediate revisions by the same user not shown)
Line 4: Line 4:
* The Sphinx indexer simply runs on a cron, so that part is simple.
* The Sphinx indexer simply runs on a cron, so that part is simple.
* As far as front end, we are going to look at packaging the above linked MW extension.
* As far as front end, we are going to look at packaging the above linked MW extension.
** The extension depends on sphinxapi.php, which is in the libsphinxclient package, at */usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php*.
** The extension depends on sphinxapi.php, which is in the libsphinxclient package, at '''/usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php'''.
** The extension does not seem to work with MW 1.16, but we want to upgrade eventually anyway.
** The extension does not seem to work with MW 1.16, but we want to upgrade eventually anyway.
* '''Sphinx does not crawl, it only indexes databases, which kind of defeats the purpose for us.'''
= Xapian =
* Doesn't have a crawler built in.
* Most stuff is done via Omega, Xapian just backs it.
* Hacky way to crawl sites: Crawl with htdig, convert into a format omega understands and can index.
* htdig is unsupported and '''OLD'''.
* htdig seems to segfault on https sites in my testing.
* Omega's default UI is '''ugly''' but that is changeable.
= Mnogosearch =
* [http://www.mnogosearch.org/ Link]
* Looks nice. Has a somewhat nice UI, and is customizable.
* Built in crawler, with a default 1000 line (with comments) config file.
* CGI barfs when there are results: [http://mnogosearch.org/bugs/index.php?id=19129 bug 19129] and [http://mnogosearch.org/bugs/index.php?id=19141 bug 19141] upstream.
** Being able to view results might be important, in a search engine. :)
= Others to try =
* Apache Lucene (with Apache Nutch to crawl).
** Heavily relies Java so probably out of the question (Lucene is Java, Nutch is a Tomcat servlet. Nuff said.)
* [http://www.dataparksearch.org/ Datapark Search]
** Fork of Mnogosearch?
** Written in C.
* ASPseek
** C++
** Last copyright year on [http://www.aspseek.org/ their site] is 2003. Is it unmaintained?

Latest revision as of 20:35, 9 February 2012

Sphinx

  • This page is helpful and I used their config and modified it to our needs.
  • The Sphinx indexer simply runs on a cron, so that part is simple.
  • As far as front end, we are going to look at packaging the above linked MW extension.
    • The extension depends on sphinxapi.php, which is in the libsphinxclient package, at /usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php.
    • The extension does not seem to work with MW 1.16, but we want to upgrade eventually anyway.
  • Sphinx does not crawl, it only indexes databases, which kind of defeats the purpose for us.

Xapian

  • Doesn't have a crawler built in.
  • Most stuff is done via Omega, Xapian just backs it.
  • Hacky way to crawl sites: Crawl with htdig, convert into a format omega understands and can index.
  • htdig is unsupported and OLD.
  • htdig seems to segfault on https sites in my testing.
  • Omega's default UI is ugly but that is changeable.

Mnogosearch

  • Link
  • Looks nice. Has a somewhat nice UI, and is customizable.
  • Built in crawler, with a default 1000 line (with comments) config file.
  • CGI barfs when there are results: bug 19129 and bug 19141 upstream.
    • Being able to view results might be important, in a search engine. :)

Others to try

  • Apache Lucene (with Apache Nutch to crawl).
    • Heavily relies Java so probably out of the question (Lucene is Java, Nutch is a Tomcat servlet. Nuff said.)
  • Datapark Search
    • Fork of Mnogosearch?
    • Written in C.
  • ASPseek
    • C++
    • Last copyright year on their site is 2003. Is it unmaintained?