INDRI
- Parses PDF, HTML, XML, and TREC documents
- API can be used from Java, PHP, or C++
- Works on Windows, Linux, Solaris and Mac OS X
- Can be used on a cluster of machines for faster indexing and retrieval
- Last update 01/05/2015
- 100%-pure Java
- small RAM requirements -only 1MB heap
- index size roughly 20-30% the size of text indexed
- fast, memory-efficient and typo-tolerant
- ranked searching
- many powerful query types
- Cross-Platform Solution
- pluggable ranking models
Lucene implementations in languages other than Java:
C++, .NET, Objective-C, C, Python, Perl, Ruby, Common Lisp, Zend Framework for PHP 5 and etc.
Managing Gigabytes for Java
- Java
- efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries
- Indices can be built for a collection split in several parts, and combined later
- Indices can be clustered both lexically and documentally
- can index plain text, e-mail, PDF, HTML, XML, Microsoft® Word/PowerPoint/Excel
- Includes a web spider for indexing remote documents over HTTP
- Can report structural errors in your XML and HTML documents
Xapian - written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C#, Ruby, Lua, Erlang and Node.js. The latest stable version is 1.2.20, released on 2015-03-04.
ht://Dig Search Engine Software (download the most recent version)
Zettair (written in C).