Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwissInfo - WWII Radio Bulletins importers #148

Open
6 tasks
piconti opened this issue Jan 16, 2025 · 0 comments
Open
6 tasks

SwissInfo - WWII Radio Bulletins importers #148

piconti opened this issue Jan 16, 2025 · 0 comments
Assignees

Comments

@piconti
Copy link
Member

piconti commented Jan 16, 2025

Similarly to #147, Create the importer for SwissInfo Radio bulletins.

The WWII radio bulleting data is already on the EPFL NAS, unfortunately it's in text-embedded pdf format, so the OCR first needs to be extracted from the PDF.

Action points for this issue are:

  • Look at how the OCr text can be extracted from the pdfs and what output formats it would create.
    • Tetml? Other tools? Look at libraries that do this
    • Explore the solutions proposed by swissAI: PDF-Extract-Kit and docling
  • Based on the structure of the resulting OCR data, identify if the Radio Bulletin CI schema would work for it
  • Implement an importer that creates radio-bulletin content-items
  • Document and comment the code relating to this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant