-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Inline Math Support #517
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- High level, I think an approach where we don't combine all the logic into the OCRBuilder is more maintainable (see comments)
- Needs tests
"block_id": block.id, | ||
"token_count": token_count | ||
}) | ||
for block in page.contained_blocks(document): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean!
from marker.schema.polygon import PolygonBox | ||
from marker.schema.registry import get_block_class | ||
from marker.schema.text.line import Line | ||
from marker.schema.text.span import Span | ||
from marker.settings import settings | ||
from marker.util import matrix_intersection_area, rescale_bbox | ||
|
||
|
||
class OcrBuilder(BaseBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right way to do this is to add a separate linebuilder that gets the lines and adds them to the pages. Then we can use the same lines in the layout builder and the ocr builder.
Merging the layout line logic into the ocr builder doesn't seem like the right way to go, since the ocr builder is specific to recognizing text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically:
- create document
- do line and math detection, and break detected lines properly
- pass inline math, etc, to the providers, so they can break lines properly (you can use the keep_chars flag when calling pdftext to keep individual characters and positions inside spans, which will help you break them properly, and be a lot cleaner)
- run layout builder, using detected lines for heuristics
- if needed, run ocr builder (using the previous lines)
- run processor for equations
What do you think? I think this will be cleaner than combining all the logic into the ocr builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think passing inline math to the providers isn't too clean. We can include chars in the provider output instead. How about:
- Create document, provider
- Run LayoutBuilder
- Run LineBuilder
- Detects and splits text and inline math lines
- Use layout heuristics to decide on good/bad provider lines
- Splice inline boxes into good provider lines
- Merge good provider lines , detected text lines (after filtering for duplicates, with empty text)
- Run OCR Builder
- Finds all blocks with lines that need OCR (
extraction_method='surya'
) - Can run a heuristic to decide whether to OCR or not
- Run OCR, or delete the empty text lines
- Finds all blocks with lines that need OCR (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good! Only issue is I'm not sure all future providers can return characters, but we can have a flag for providers that do, so we only run the line splitting on those - what do you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fall back to line splitting on spans for those, and then completely skip for anything else.
Might be an interesting LLM processor we can write for this down the line too.
@@ -80,3 +80,12 @@ def matrix_intersection_area(boxes1: List[List[float]], boxes2: List[List[float] | |||
height = np.maximum(0, max_y - min_y) | |||
|
|||
return width * height # Shape: (N, M) | |||
|
|||
def rescale_bbox(bbox: List[float], old_size=tuple[float], new_size=tuple[float]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use PolygonBox.from_bbox(bbox).rescale(old_size, new_size).bbox versus adding a new function.
Inline math is now detected from
surya
and spliced into the existing provider linesComplete refactor of
OCRBuilder
andLayoutBuilder
, OCR decisions are now made per-line, instead of for a full page.