Add Inline Math Support #517

tarun-menta · 2025-01-30T14:20:45Z

Inline math is now detected from surya and spliced into the existing provider lines

Complete refactor of OCRBuilder and LayoutBuilder, OCR decisions are now made per-line, instead of for a full page.

VikParuchuri

High level, I think an approach where we don't combine all the logic into the OCRBuilder is more maintainable (see comments)
Needs tests

VikParuchuri · 2025-01-30T17:54:51Z

marker/processors/equation.py

-                    "block_id": block.id,
-                    "token_count": token_count
-                })
+            for block in page.contained_blocks(document):


Very clean!

VikParuchuri · 2025-01-30T17:55:01Z

marker/builders/ocr.py

 from marker.schema.polygon import PolygonBox
 from marker.schema.registry import get_block_class
 from marker.schema.text.line import Line
 from marker.schema.text.span import Span
 from marker.settings import settings
+from marker.util import matrix_intersection_area, rescale_bbox


 class OcrBuilder(BaseBuilder):


I think the right way to do this is to add a separate linebuilder that gets the lines and adds them to the pages. Then we can use the same lines in the layout builder and the ocr builder.

Merging the layout line logic into the ocr builder doesn't seem like the right way to go, since the ocr builder is specific to recognizing text.

Basically:

create document

do line and math detection, and break detected lines properly

pass inline math, etc, to the providers, so they can break lines properly (you can use the keep_chars flag when calling pdftext to keep individual characters and positions inside spans, which will help you break them properly, and be a lot cleaner)

run layout builder, using detected lines for heuristics

if needed, run ocr builder (using the previous lines)

run processor for equations

What do you think? I think this will be cleaner than combining all the logic into the ocr builder

I think passing inline math to the providers isn't too clean. We can include chars in the provider output instead. How about:

Create document, provider

Run LayoutBuilder

Run LineBuilder

Detects and splits text and inline math lines

Use layout heuristics to decide on good/bad provider lines

Splice inline boxes into good provider lines

Merge good provider lines , detected text lines (after filtering for duplicates, with empty text)

Run OCR Builder

Finds all blocks with lines that need OCR (extraction_method='surya')

Can run a heuristic to decide whether to OCR or not

Run OCR, or delete the empty text lines

That sounds good! Only issue is I'm not sure all future providers can return characters, but we can have a flag for providers that do, so we only run the line splitting on those - what do you think.

We can fall back to line splitting on spans for those, and then completely skip for anything else.

Might be an interesting LLM processor we can write for this down the line too.

VikParuchuri · 2025-01-30T18:03:08Z

marker/util.py

@@ -80,3 +80,12 @@ def matrix_intersection_area(boxes1: List[List[float]], boxes2: List[List[float]
    height = np.maximum(0, max_y - min_y)

    return width * height  # Shape: (N, M)
+
+def rescale_bbox(bbox: List[float], old_size=tuple[float], new_size=tuple[float]):


You can use PolygonBox.from_bbox(bbox).rescale(old_size, new_size).bbox versus adding a new function.

tarun-menta added 4 commits January 30, 2025 18:44

Initial inline math support

f695b49

Run OCR on any missing lines from good pages

c32eb88

Cleanup debug statements

8c9a4fb

Move line splitting for math boxes from surya into marker

c7c7dcf

VikParuchuri requested changes Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Inline Math Support #517

Add Inline Math Support #517

tarun-menta commented Jan 30, 2025

VikParuchuri left a comment

VikParuchuri Jan 30, 2025

VikParuchuri Jan 30, 2025

VikParuchuri Jan 30, 2025

tarun-menta Jan 30, 2025

VikParuchuri Jan 30, 2025 •

edited

Loading

tarun-menta Jan 30, 2025

VikParuchuri Jan 30, 2025

Add Inline Math Support #517

Are you sure you want to change the base?

Add Inline Math Support #517

Conversation

tarun-menta commented Jan 30, 2025

VikParuchuri left a comment

Choose a reason for hiding this comment

VikParuchuri Jan 30, 2025

Choose a reason for hiding this comment

VikParuchuri Jan 30, 2025

Choose a reason for hiding this comment

VikParuchuri Jan 30, 2025

Choose a reason for hiding this comment

tarun-menta Jan 30, 2025

Choose a reason for hiding this comment

VikParuchuri Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

tarun-menta Jan 30, 2025

Choose a reason for hiding this comment

VikParuchuri Jan 30, 2025

Choose a reason for hiding this comment

VikParuchuri Jan 30, 2025 •

edited

Loading