title: Searching documents with Apache Solr author: name: Alex Volkov twitter: a_volkov output: basic.html controls: true
--
- Makes it easy to develop search application.
- Built on Lucene
- Used by DucksDucksGo, Ticketmaster, Instagram and Netflix
--
- Needed better search than Mezzanine CMS provices
- Needed an ability to search PDF and Word documents
- Used Django haystack as a frontend to Solr.
- Some works still needed to replace built-in search in Mezzaning
--
- Elastic Search
- Whoosh
- Xapian
--
Solr is built as a web service, running on Jetty web server, On port 8983, with RESTful API with JSON responses.
--
--
- PySolr -- A python library to interact with all the messy XML
--
- Apache Tika -- used to be stand-alone project for document extraction, now a part of solr
- Documents can be indexed directly or extracted and indexed.
--
$ wget http://apache.mirror.vexxhost.com/lucene/solr/5.3.1/solr-5.3.1.tgz
$ tar xzf solr-5.3.1.tgz
$ cd solr-5.3.1
--
$ ./bin/solr start -e schemaless
Navigate with a browser to solr port, the page will show all kinds of stats.
--
Installation
$ pip install pysolr
Configuration
import pysolr
solr = pysolr.Solr(
'http://localhost:8983/solr/gettingstarted',
timeout=10
)
--
with open('testfile.doc', 'rb') as fh:
# only extract data
data = solr.extract(fh, extract_only=True)
content = extract_content(data['contents'])
#prepare index
index = [{
'id': 'that_word_document',
'title': content,
}]
solr.add(index)
--
results = solr.search('indexed token')
Results contain a list of document ids/scores
--
solr.delete=(id='that_word_document')
--
solr.delete=(q='*:*')
--
See solr_test.py
--