Searchlight is a powerful and efficient Text Processing API for PDF's developed with Python. It processes Documents to highlight specified search words and includes various features like word search, unique words count, highlighting search word and integration with MongoDB and AWS S3 bucket.
- Word Search: Search for specific words in a PDF.
- Unique Words Count: Count the number of unique words in a PDF.
- Highlighting: Highlights the Search Word in the PDF.
- MongoDB Integration: Store data and results in MongoDB.
- AWS S3 Integration: Upload and retrieve PDFs from an AWS S3 bucket.
- Clone the repository
git clone https://github.com/tratum/Searchlight.git
- Navigate to the project directory
cd Searchlight
- Create and activate a virtual environment
python -m venv .venv source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`
- Install the required dependencies
pip install -r requirements.txt
- Create a .env file in the root directory
cd Searchlight touch .env
- Navigate to the
.env
file and Configure your MongoDB and AWS S3 SettingsATLAS_URI= your_mongodb_uri DB_NAME= your_db_name COLLECTION_NAME= your_collection_name RAW_COLLECTION_NAME= your_collection_name USER_COLLECTION_NAME=tbl_users AWS_ACCESS_KEY ='your_aws_access_key' AWS_SECRET_KEY='your_aws_secret_access_key' BUCKET_NAME='your_s3_bucket_name'
-
Start the API Server
python -m uvicorn main:app --reload
-
Use the following endpoint to upload a PDF and perform Text Processing
http://127.0.0.1:8000/searchlight/upload
Mandatory Parameters are:
keyword
: The word to search and highlight in the PDF.pdf
: The PDF file to process.
Here is an example of how to use the API with cURL
:
curl -X POST "http://127.0.0.1:8000/searchlight/upload" -F "keyword=example" -F "pdf=@/path/to/your/document.pdf"
Contributions are welcome! Please open an issue or submit a pull request for any changes or improvements.
This project is licensed under the MIT License. See the LICENSE file for details
- This project is built with FastAPI