Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Inline Math Support #517

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

Add Inline Math Support #517

wants to merge 4 commits into from

Conversation

tarun-menta
Copy link
Collaborator

Inline math is now detected from surya and spliced into the existing provider lines

Complete refactor of OCRBuilder and LayoutBuilder, OCR decisions are now made per-line, instead of for a full page.

Copy link
Owner

@VikParuchuri VikParuchuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • High level, I think an approach where we don't combine all the logic into the OCRBuilder is more maintainable (see comments)
  • Needs tests

"block_id": block.id,
"token_count": token_count
})
for block in page.contained_blocks(document):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean!

from marker.schema.polygon import PolygonBox
from marker.schema.registry import get_block_class
from marker.schema.text.line import Line
from marker.schema.text.span import Span
from marker.settings import settings
from marker.util import matrix_intersection_area, rescale_bbox


class OcrBuilder(BaseBuilder):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right way to do this is to add a separate linebuilder that gets the lines and adds them to the pages. Then we can use the same lines in the layout builder and the ocr builder.

Merging the layout line logic into the ocr builder doesn't seem like the right way to go, since the ocr builder is specific to recognizing text.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically:

  • create document
  • do line and math detection, and break detected lines properly
  • pass inline math, etc, to the providers, so they can break lines properly (you can use the keep_chars flag when calling pdftext to keep individual characters and positions inside spans, which will help you break them properly, and be a lot cleaner)
  • run layout builder, using detected lines for heuristics
  • if needed, run ocr builder (using the previous lines)
  • run processor for equations

What do you think? I think this will be cleaner than combining all the logic into the ocr builder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing inline math to the providers isn't too clean. We can include chars in the provider output instead. How about:

  • Create document, provider
  • Run LayoutBuilder
  • Run LineBuilder
    • Detects and splits text and inline math lines
    • Use layout heuristics to decide on good/bad provider lines
    • Splice inline boxes into good provider lines
    • Merge good provider lines , detected text lines (after filtering for duplicates, with empty text)
  • Run OCR Builder
    • Finds all blocks with lines that need OCR (extraction_method='surya')
    • Can run a heuristic to decide whether to OCR or not
    • Run OCR, or delete the empty text lines

Copy link
Owner

@VikParuchuri VikParuchuri Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good! Only issue is I'm not sure all future providers can return characters, but we can have a flag for providers that do, so we only run the line splitting on those - what do you think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fall back to line splitting on spans for those, and then completely skip for anything else.

Might be an interesting LLM processor we can write for this down the line too.

@@ -80,3 +80,12 @@ def matrix_intersection_area(boxes1: List[List[float]], boxes2: List[List[float]
height = np.maximum(0, max_y - min_y)

return width * height # Shape: (N, M)

def rescale_bbox(bbox: List[float], old_size=tuple[float], new_size=tuple[float]):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use PolygonBox.from_bbox(bbox).rescale(old_size, new_size).bbox versus adding a new function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants