GitHub - huridocs/pdf-tokens-type-labeler

PDF tokens type labeler

This tool returns each token type inside a PDF

Tokens Types List

FORMULA
FOOTNOTE
LIST
TABLE
FIGURE
TITLE
TEXT
HEADER
PAGE_NUMBER
IMAGE_CAPTION
FOOTER
TABLE_OF_CONTENT
MARK

Quick Start

Create venv:

make install_venv

Get the token types from a PDF:

source venv/bin/activate
python src/predict.py /path/to/pdf

Train a new model

Get the labeled data tool from the GitHub repository:

https://github.com/huridocs/pdf-labeled-data

Change the paths in src/config.py

LABELED_DATA_ROOT_PATH = /path/to/pdf-labeled-data/project TRAINED_MODEL_PATH = /path/to/save/trained/model

Create venv:

make install_venv

Train a new model:

source venv/bin/activate
python src/train.py

Use a custom model

python src/predict.py /path/to/pdf /path/to/model

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.github		.github
results		results
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dev-requirements.txt		dev-requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF tokens type labeler

Tokens Types List

Quick Start

Train a new model

Use a custom model

About

Releases

Packages

Contributors 2

Languages

huridocs/pdf-tokens-type-labeler

Folders and files

Latest commit

History

Repository files navigation

PDF tokens type labeler

Tokens Types List

Quick Start

Train a new model

Use a custom model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages