We built a RAG system which runs locally on cpu in an offline mode. It uses open source large language models for performing retrieval augmented generation.
๐ FAISS (Facebook AI Similarity Search) โ Fast, efficient, and scalable vector search for document retrieval.
๐ฏ BAAI/bge-reranker-base โ Advanced model for reranking results to ensure relevant and accurate information is returned.
- Minimum CPU memory and RAM usage
- Runs locally even in an offline environment (For PDFs and other documents)
- Highly efficient and quantized model
- Multilingual support with over 29 languages including Chinese
- Fast inference
- Intuitive UI
- Add new documents to the system without the need for a complete reindexing process, ensuring dynamic and flexible integration of new knowledge.
- Built with a focus on minimizing memory usage, the system leverages lightweight retrieval techniques such as FAISS (or alternatives like inverted indices) to manage large datasets without consuming excessive memory.
- Low Latency
- Total Memory usage: 338 MB (model) + 121 MB (embeddings)
- Reranking model 1.1GB but loads only when required and loads once.
Nvidia GPUs with compute capability 5.0+ because it uses ollama and ollama supports this GPU capability
> git clone https://github.com/ParamThakkar123/Secure-Local-Offline-Rag-System.git
> cd Secure-Local-Offline-Rag-System
> pip install -r requirements.txt
Download Ollama app and run it
> ollama pull qwen2:0.5b-instruct-q3_K_S
> ollama pull nextfire/paraphrase-multilingual-minilm:l12-v2
> streamlit run app.py
> python -m streamlit run app.py