This tool returns each token type inside a PDF
- FORMULA
- FOOTNOTE
- LIST
- TABLE
- FIGURE
- TITLE
- TEXT
- HEADER
- PAGE_NUMBER
- IMAGE_CAPTION
- FOOTER
- TABLE_OF_CONTENT
- MARK
Create venv:
make install_venv
Get the token types from a PDF:
source venv/bin/activate
python src/predict.py /path/to/pdf
Get the labeled data tool from the GitHub repository:
https://github.com/huridocs/pdf-labeled-data
Change the paths in src/config.py
LABELED_DATA_ROOT_PATH = /path/to/pdf-labeled-data/project TRAINED_MODEL_PATH = /path/to/save/trained/model
Create venv:
make install_venv
Train a new model:
source venv/bin/activate
python src/train.py
python src/predict.py /path/to/pdf /path/to/model