This project is no longer actively maintained. We are focusing our efforts on developing a new and improved version, which can be found in the following repository:
https://github.com/huridocs/pdf-document-layout-analysis
We encourage you to check out the new project, as it offers enhanced features, and an updated codebase.
Thank you for your understanding and continued support!
A Docker-powered service for OCRing PDFs
- Redis server for managing queues
- Docker (install)
- Docker-compose (install)
- Note: On mac Docker-compose is installed with Docker
Start the service:
./run start
This script will start the service with default configurations. You can override default values with file ./src/config.yml
(you may need to create the file) with the following values:
redis_host: localhost
redis_port: 6379
service_host: localhost
service_port: 5050
A virtual env is needed for some of the development tasks
./run install_venv
Start the service for testing (with a redis server included)
./run start:testing
Check service is up and get general info on supported languages and other important information:
curl localhost:5050/info
Test OCR is working (basic sync method)
curl -X POST -F 'file=@./src/test_files/sample-english.pdf' localhost:5051 --output english.pdf
If language is not specified, english will be used by default. In order to specify a language for better OCR results:
curl -X POST -F 'language=fr' -F 'file=@./src/test_files/sample-french.pdf' localhost:5050 --output french.pdf
Remember you can check supported languages on localhost:5050/info
To list all available commands just run ./run
, some useful commands:
./run test
./run linter
./run check_format
./run formatter
- Asynchronous OCR
- HTTP server
- Retrieve OCRed PDF
- Queue processor
- Service configuration
- Troubleshooting
-
Upload PDF file to the OCR service
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5051/upload/[namespace]
- Add OCR task to queue
To add an OCR task to queue, a message should be sent to a ocr_tasks
Redis queue. Params should include filename and, optionally, a supported language.
Python code: TODO: check python code!!!
from rsmq import RedisSMQ
queue = RedisSMQ(host=[redis host], port=[redis port], qname='ocr_tasks', quiet=True)
message_json = '{"tenant": "tenant_name", "task": "ocr", "params": {"filename": "pdf_file_name.pdf", "language": 'fr'}}'
queue.sendMessage().message(message_json).execute()
- Retrieve OCRed PDF
Upon completion of the OCR process, a message is placed in the ocr_results
Redis queue. This response is, for now, using specific Uwazi terminology. To check if the process for a specific file has been completed:
queue = RedisSMQ(host=[redis host], port=[redis port], qname='ocr_results', quiet=True)
results_message = queue.receiveMessage().exceptions(False).execute()
# The message.message contains the following information:
# {
# "tenant": "namespace",
# "task": "pdf_name.pdf",
# "success": true,
# "error_message": "",
# "file_url": "http://localhost:5050/processed_pdf/[namespace]/[pdf_name]"
# }
curl -X GET http://localhost:5050/processed_pdf/[namespace]/[pdf_name]
The container HTTP server
is coded using Python 3.9 and uses the FastApi web framework.
If the service is running, the end point definitions can be founded in the following url:
http://localhost:5050/docs
The endpoints code can be found inside the file ./src/app.py
.
The errors are logged in file ./data/service.log
.
The container Queue processor
is coded using Python 3.9, and it is in charge of communications with the Redis queue.
The code can be found in the file ./src/QueueProcessor.py
and it uses the library RedisSMQ
to interact with the Redis queues.
In MacOS, the following config.yml
can be used in order to access Redis in the host's localhost:
redis_host: host.docker.internal
redis_port: 6379
service_host: localhost
service_port: 5050
Solution: Change RAM memory available to the docker containers to 3Gb or 4Gb