This document is for people who are trying to stand up an instance of Hamlet on localhost in order to write code. It assumes you are generally familiar with setting up development environments (for instance, that you can install Python dependencies and stand up local Postgres).
You will need:
-
git
-
pipenv (https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv)
-
the Heroku CLI
-
postgres
-
Get the code and dependencies
git clone https://github.com/MITLibraries/hamlet.git
cd hamlet
pipenv install
- If you want to do neural net training or OCR source files, there are additional non-pip dependencies; see below.
pipenv shell
-
Set up your postgres database
- Create a database
- The name of this database should be
hamlet
, or else set an environment variableDJANGO_DB
with its name
- The name of this database should be
- Create a database user
- The name of this user should be
hamlet
, or else set an environment variableDJANGO_DB_USER
with its name
- The name of this user should be
- Grant all privileges on your database to your user
- Set an environment variable
DJANGO_DB_PASSWORD
with your database user's password python manage.py migrate
- Ask Andy or Andromeda for a data dump to populate the db.
- Create a database
-
Set an environment variable
DJANGO_SETTINGS_MODULE=hamlet.settings.local
-
python manage.py createsuperuser
and follow the prompts - this will let you log in at/admin
-
python manage.py collectstatic --noinput
-
heroku local
Run tests with python manage.py test --settings=hamlet.settings.test
.
This ensures that they use the test neural net. The primary keys of objects in the test file are written around the assumption that they will be present in both the test net and the fixtures.
You can generate additional fixtures with statements like python manage.py dumpdata theses.Person --pks=63970,29903 > hamlet/theses/fixtures/authors.json
, but make sure to include the pks of all objects already in the fixtures (or to write it to a separate file and then unite it with the existing - you can't just append because the json syntax will be wrong). Also make sure that the theses you use are in fact present in the test neural net.
If you are seeing unpredictable test failures (e.g. tests that succeed in isolation but fail in a suite), make sure you're using the right settings file.
For the most part, dependencies are installed via pipenv. There's a .env
file (kept out of version control) for use by pipenv shell
. It specifies:
DJANGO_SETTINGS_MODULE
DJANGO_SETTINGS_MODULE='hamlet.settings.local'
(for using heroku local)DJANGO_SETTINGS_MODULE='hamlet.settings.base'
(for python manage.py runserver)
DJANGO_DB_PASSWORD='(your password)'
DJANGO_DEBUG_IS_TRUE='True' (if you want)
DSPACE_OAI_IDENTIFIER
DSPACE_OAI_URI
The latter two are only relevant if you plan to be downloading files or metadata from DSpace. They can be omitted or given dummy values otherwise.
Some dependencies require extra help. However, you only need to bother if you are running the functions that rely on those dependencies.
-
In order to run the Django app, including the test suite:
- python-magic needs libmagic (
brew install libmagic
on OSX). - captcha says it needs
apt-get -y install libz-dev libjpeg-dev libfreetype6-dev python-dev
or similar. You can't yum install them on AWS, but the captcha works anyway, so maybe it's lying.
- python-magic needs libmagic (
-
In order to run OCR on thesis PDFs:
- tika requires Java
-
In order to train neural nets:
- nltk may require installing corpora through the python shell
- gensim wants a C compiler (it can run without one but will be 70x slower; a single neural net training run can take literally days in this case)
You may set the following environment variables to configure your database:
DJANGO_DB_ENGINE
(defaultdjango.db.backends.postgresql
)DJANGO_DB
(defaulthamlet
)DJANGO_DB_USER
(defaulthamlet
)DJANGO_DB_PASSWORD
(no default)DJANGO_DB_HOST
(defaultlocalhost
)
If you are running with the Postgres default, be sure to stand up Postgres; create the named database and user; and grant privileges on your database to your user.
If you need to edit styles, edit files in hamlet/static/sass/apps/
. Don't edit css directly - these changes will be blown away during asset precompilation.
- run
python manage.py collectstatic
- use
hamlet.settings.base
- run
python manage.py compress
- then run
python manage.py collectstatic
- use
hamlet.settings.local
The static asset pipeline runs automatically; see .ebextensions/02_python.config
.
Hamlet needs a neural net in order to operate. By default it uses the files in hamlet/testmodels
, but this is configurable.
hamlet.model is a copy of all_theses_no_split_w4_s52.model
. This is a model trained with a window size of 4 and a step of 52. It is kept out of version control because it is too big.
hamlet/testmodels/
contains some smaller models not suitable for production, but usable for testing (and small enough to be pushed to GitHub, although it will complain, and hence to be used on Travis). You can configure your local settings to point at these files and that will suffice for development.
These models don't represent the entire MIT thesis collection (that's what lets them be smaller), so don't be surprised if documents of interest are not present.
hamlet.settings.local
defaults to using the test model, since it is checked
into version control. If you have a different model you want to use, set DJANGO_MODEL_PATH=/full/path/to/model
in .env
.
- Make sure your settings file points to the desired
MODEL_FILE
python manage.py shell
from gensim.models.doc2vec import Doc2Vec
from django.conf import settings
model = Doc2Vec.load(settings.MODEL_FILE)
identifier = '1721.1-%d.txt' % YOUR THESIS IDENTIFIER HERE
identifier in model.docvecs.doctags.keys()
If you don't have a target thesis object but you need one you know is in the neural net, look at the output of model.docvecs.doctags.keys()
. This is a list of filenames of text files from dspace; they are all of the format 1721.1-NUMBER.txt
, where NUMBER
is the identifier of the thesis. You can look up Thesis
objects in your database by this identifier (which is Thesis.identifier
, not the primary key).
You can start up a running instance locally using docker compose:
$ docker-compose up
By default, this will use the test model included in the codebase. If you'd like to use a different model, you will need to add it to a hamlet docker volume. This volume is mounted as /data
in the container. You'll also need to set the DJANGO_MODEL_PATH
env var to point to the model within that docker volume. Assuming your model files are in the top directory of the project:
$ docker run --name hamlet-data --mount type=volume,src=hamlet,target=/data busybox true
$ for f in hamlet.model*; do docker cp $f hamlet-data:/data; done
$ docker rm hamlet-data
$ export DJANGO_MODEL_PATH=/data/hamlet.model
$ docker-compose up