This repository provides tools to download, process, and search through 3GPP documents via an inverted index, using Elasticsearch and Flask.
- main.ipynb: A Jupyter notebook to run all the scripts in order.
- dl_docs.py: Download documents (in .zip format) from a 3GPP website.
- process_docs.py: Scan '.doc' files, extract raw text, and save them as '.txt' files.
- inverse_index.py: Tokenize the '.txt' files and index them using Elasticsearch. Also contains a function to search for documents directly.
- app.py: A Flask application to search for a given 3GPP specification and display matching filenames along with links to view/download the files.
- index.html: A web page to render the search form and display the results.
conda create -n search-3gpp python==3.11
conda activate search-3gpp
pip install -r requirements.txt
Install the following libraries on your OS:
# For antiword
!sudo apt-get install antiword
# For Elasticsearch (assuming Debian/Ubuntu)
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.0-amd64.deb
!sudo apt-get update && sudo dpkg -i elasticsearch-7.10.0-amd64.deb
.
├── requirements.txt
└── src
├── app.py
├── dl_docs.py
├── inverse_index.py
├── main.ipynb
├── process_docs.py
└── templates
└── index.html
-
Download Documents
python src/dl_docs.py --base_url [BASE_URL] --save_dir [SAVE_DIR] --max_files [MAX_FILES]
-
Process Documents
python src/process_docs.py --src_dir [SRC_DIR] --dest_dir [DEST_DIR]
-
Start Elasticsearch Instance
sudo service elasticsearch start
-
Index Documents to Elasticsearch
python src/inverse_index.py --idx_name [IDX_NAME] --docs_dir [DOCS_DIR] --reset_idx [RESET_IDX]
-
Run Flask Application
python src/app.py
Visit http://127.0.0.1:5000/
in your browser to use the application.