Skip to content

Commit

Permalink
Updated README with CI/CD instructions included
Browse files Browse the repository at this point in the history
  • Loading branch information
liubovpashkova committed Jul 5, 2024
1 parent 80d0c86 commit ce2e76c
Showing 1 changed file with 19 additions and 49 deletions.
68 changes: 19 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,39 @@
# README
# PanKB LLM (PRE-PROD)

## Overview

A GenAI Assistant based on Langchain + Streamlit + Azure Cosmos DB for MongoDB (vCore) + Docker.

Authors:
- Binhuan Sun ([email protected]): data preprocessing, LLM, DEV vector DB creation (Chroma), retriever, Streamlit web app.
- Pashkova Liubov ([email protected]): Changing the DEV vector DB (Chroma) to the PROD vector DB instance (Azure Cosmos DB for MongoDB (vCore)) and adjusting the choice of embeddings to the Cosmos DB limitations, the PROD DB index creation, dockerization, integration of the streamlit app with the Django framework templates, the Github repo set up.

## Important considerations & limitations

The DB population process took 93 minutes (<i>toedit: goddamn long!!!introduce multithreading to speed up!!!</i>). The MongoDB storage size the populated collection took is ~ 1.0 GiB, incl. the indexes.

Please note the following limitations and considerations:
- If we use an Azure Cosmos DB for MongoDB instance as the vector DB, we can try only embeddings with dimensionalities <= 2000 because for Azure Cosmos DB for MongoDB the maximum number of supported dimensions is 2000. Maybe it is even for the better, large embeddings are more expensive and not always provide a significant increase in performance. Examples: https://platform.openai.com/docs/guides/embeddings
- We have to create the similarity index. The dimensionality of this index must match the dimensionality of the embeddings.
- The CPU (M30) on a server, where we have our Azure Cosmos DB for MongoDB instance, supports only the <i>vector-ivf</i> index type. To create the <i>vector-hnsw</i> index, we need to upgrade to the M40 tier (it will cost us 780.42 USD per month instead of 211.36 that we pay for M30 now).
- Data preprocessing, LLM, DEV vector DB creation (Chroma), retriever, Streamlit web app: Binhuan Sun, [email protected]
- Changing the DEV vector DB (Chroma) to the PROD vector DB instance (Azure Cosmos DB for MongoDB (vCore)) and adjusting the choice of embeddings to the Cosmos DB limitations, the PROD DB index creation, dockerization, integration of the streamlit app with the Django framework templates, the github repo maintenance: Pashkova Liubov, [email protected]

## Scripts execution

Create the .env file in the following format:
```
OPENAI_API_KEY=<insert the API key here without quotes>
COHERE_API_KEY=<insert the API key here without quotes>
TOGETHER_API_KEY=<insert the API key here without quotes>
GOOGLE_API_KEY=<insert the API key here without quotes>
ANTHROPIC_API_KEY=<insert the API key here without quotes>
REPLICATE_API_TOKEN=<insert the API key here without quotes>
VOYAGE_API_KEY=<insert the API key here without quotes>
## MongoDB-PROD (Azure Cosmos DB for MongoDB) Connection String
# Had to multiply maxIdleTimeMS by 10 to handle
# urllib3.exceptions.ProtocolError:
# ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
MONGODB_CONN_STRING = "<insert the connection string here with quotes>"
```
The connection string and API keys can be obtained from the authors.

The DB population script does not have to be executed in a docker container. It can be done with the following commands:
Every time when one pushes to the `pre-prod` repo (usually from the DEV server), the changes in the AI Assistant Web Application will be AUTOMATICALLY deployed to the PRE-PROD server. The automation (CI/CD) is achieved with the help of Github Actions enabled for the repository. The respective config file is `.github/workflows/deploy-preprod-to-azurevm.yml`. In order for the automated deployment to work, you should set up the values of the following secret Github Actions secrets:
```
pip3 install -r requirements.txt
python3 make_vectordb.py ./Paper_all pankb_vector_store
PANKB_PREPROD_HOST - the PROD server IP address
PANKB_PREPROD_SSH_USERNAME - the ssh user name to connect to the PROD server
PANKB_PREPROD_PRIVATE_SSH_KEY - the ssh key that is used to connect to the PROD server
OPENAI_API_KEY
COHERE_API_KEY
TOGETHER_API_KEY
GOOGLE_API_KEY
ANTHROPIC_API_KEY
REPLICATE_API_TOKEN
VOYAGE_API_KEY
PANKB_PREPROD_MONGODB_CONN_STRING - MongoDB PRE-PROD (Azure CosmosDB for MongoDB) Connection String
```
The first command above installs all the requirements. The second one runs the script with two command line arguments: the name of the folder containing the documents to feed to the LLM and the name of the MongoDB collection that will contain the vector DB.
These secrets are encrypted and safely stored on Github in the "Settings - Secrets and Variables - Actions - Repository secrets" section. In this section, you can also add new Github Actions secrets and edit the existing ones. However, in order to change a secret name, you have to remove the existing secret and add the new one instead of the old one.

The command for building and rebuilding the docker container with the Streamlit app inside:
```
docker compose up -d --build --force-recreate
```
The dockerized streamlit app does not have to be executed in <i>tmux></i>. It will always be up and running even after the VM is rebooted (achieved by using the option `restart: always` in the docker compose file).
After the Github Actions deployment job has successfully run, the web-application must be available at <a href="pankb.org/ai_assistant" target="_blank">pankb.org/ai_assistant</a>.

The status of the docker container can be checked with the following command:
```
docker ps
```
The command should produce approx. the following output among others:
The command should produce approx. the following output in case of the successful deployment:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
54d89d7c4fad pankb_llm:latest "streamlit run strea…" 10 minutes ago Up 10 minutes 0.0.0.0:8501->8501/tcp, :::8501->8501/tcp pankb-llm
```

## Availability

Currently, the Streamlit app is available as a django application:
```
http://<toedit: pankb server-ip or domain name>/ai_assistant/
54d89d7c4fad pankb_llm:latest "streamlit run strea…" 23 seconds ago Up 12 seconds 0.0.0.0:8501->8501/tcp, :::8501->8501/tcp pankb-llm
```

0 comments on commit ce2e76c

Please sign in to comment.