Skip to content

Commit

Permalink
Merge pull request #1876 from datadryad/solr-docs
Browse files Browse the repository at this point in the history
Improved SOLR documentation
  • Loading branch information
ahamelers authored Oct 10, 2024
2 parents c60b06d + 3c2b736 commit e994f43
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 7 deletions.
2 changes: 2 additions & 0 deletions documentation/external_services/amazon_aws_ec2_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@ mysql_stg.sh < myfile.sql
SOLR setup
============

For general SOLR information, see [solr.md](../solr.md)

SOLR should be installed on a separate machine from the Rails server!

All of these tools are outdated. To make SOLR work with Dryad's old
Expand Down
73 changes: 66 additions & 7 deletions documentation/solr.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,82 @@
# How to backup or repopulate SOLR indexes

## SOLR Setup
SOLR Setup
===========

Right now our SOLR setup is quite manual and we manually copy over some configuration files to the SOLR server.
Our SOLR setup is quite manual and we manually copy over some configuration files to the SOLR server.
We also manually create the core. It is fairly simple to run SOLR since it's a Java application that can
be extracted and run.

The [README.md](config/solr_config/README.md) details how to set up the blacklight core and schema and
our additions to it, which should always be checked into our repository after we make updates to the
schema.

## Adding data to SOLR
Config files
------------

- solrconfig.xml - basic SOLR server config
- schema.xml - fields to store and how they are processed
- stopwords.txt - basic "meaningless" words that will be ignored
- stopwords_en.txt - stopwords specific to English
- synonyms.txt - words that should be indexed/queried together
- blacklight.yml - basic Blacklight config
- settings.yml - GeoBlacklight config

We haven't done formal backups of the SOLR data since it is fairly easy to repopulate from our database
and in fact we have regenerated it on multiple occasions (every time we make schema changes or add facets).

Adding data to SOLR
====================

The [rsolr:reindex](lib/tasks/rsolr.rake) rake task is used to repopulate the SOLR index from the database
and it runs quickly (somewhere in the range of a few minutes to a couple hours if I recall).

## Backing up SOLR
Using SOLR
===========

The Dryad search API is largely a passthrough to SOLR. See the [search API documentation](apis/search.md) for details.

SOLR UI
--------

If you want to view a live SOLR, you need to add your local IP to the relevant
SOLR security group, and access its specific port, e.g.,
http://34.222.121.163:8983/

To get details about terms in the index:
1. select "geoblacklight" core in the left menu
2. select Schema
3. select a field
4. Load Term Info

About indexes:
- Fields with _s are the original string
- _sort is a processed version suitable for sorting
- _ti is tokenized for searching in the index

Query parsing
-------------

SOLR has the notion of a "default" field that responses to unstructured queries.
You can always override this by specifying a field name in the query.

- [Basic overview of SOLR queries](https://yonik.com/solr/query-syntax/)
- [Full details in the SOLR
- docs](https://solr.apache.org/guide/6_6/the-standard-query-parser.html)


Security
========

- The SOLR application has no internal security -- anyone who has access can add/delete documents
- We do security by limiting access to the relevant EC2 IP addresses. If you start/stop one of the Dryad servers, it may be assigned a
new IP address, which will cause searches to be blocked by the SOLR server's
security group. You will need to edit the security group to allow the Rails
server to access SOLR again.


Backing up SOLR
===============

We haven't done formal backups of the SOLR data since it is fairly easy to repopulate from our database
and in fact we have regenerated it on multiple occasions (every time we make schema changes or add facets).

SOLR also offers a backup and restore functionality, so it could be manually backed up from one server and
restored onto another. The [SOLR backup and restore documentation](https://solr.apache.org/guide/solr/latest/deployment-guide/backup-restore.html)
Expand All @@ -27,3 +85,4 @@ gives information about how to back up a core and restore it.
While it's possible to use this option rather than repopulating the indexes from the database, I suspect it
would not offer much of an advantage over running the repopulation rake task. There may be other reasons
we want to keep backups or automate them, though.

0 comments on commit e994f43

Please sign in to comment.