You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run marker with parallel processing on 3 pdfs (as a sample of a larger number of pdfs i need to process in parallel). For that, i tried the "marker" and "marker_chunk_convert" CLI's. However, I'm getting higher processing time compared to the sum of processing times of the individual pdf files using the "marker_single" recipe. One should expect to get the total parallel processing time to be the highest of all three pdfs (of course assuming once chunk and the number of pdfs in that chunk is smaller than or equal to the number of devices/ workers), which is the case in my experiment.
What am I missing? Please advise.
Here are the details.
For single pdf file processing, I used:
marker_single path_to_input_folder --output_dir path_to_input_folder
Resulting processing times of individual pdf files: 31 sec, 42 sec, and 26 sec
For parallel processing of multiple pdf files, I'm used:
marker path_to_input_folder --output_dir path_to_input_folder --workers 3
Resulting processing time: 1 min 48 sec, which is greater than the sum of the processing times of all three pdfs (31+42+26= 1 min 39 sec)
As an alternate for parallel processing of multiple pdf files, I'm used:
NUM_DEVICES=3 NUM_WORKERS=3 NUM_CHUNKS=1 path_to_input_folder --output_dir path_to_input_folder
Resulting processing time: 1 minute 52 seconds
OS: ubuntu 20.4
H/W: 128-core CPU
Python 3.10
The text was updated successfully, but these errors were encountered:
On CPU, a single pdf inference might be saturating your compute, meaning that you won't get benefits from additional workers. The multiprocessing is more useful on GPU/MPS.
how to force GPU processing. I am using the following command
marker /path/to/input/folder --workers 4 --skip_existing --output_dir Markdown-Table-Output --disable_image_extraction --converter_cls marker.converters.table.TableConverter
Hi,
I am trying to run marker with parallel processing on 3 pdfs (as a sample of a larger number of pdfs i need to process in parallel). For that, i tried the "marker" and "marker_chunk_convert" CLI's. However, I'm getting higher processing time compared to the sum of processing times of the individual pdf files using the "marker_single" recipe. One should expect to get the total parallel processing time to be the highest of all three pdfs (of course assuming once chunk and the number of pdfs in that chunk is smaller than or equal to the number of devices/ workers), which is the case in my experiment.
What am I missing? Please advise.
Here are the details.
For single pdf file processing, I used:
marker_single path_to_input_folder --output_dir path_to_input_folder
Resulting processing times of individual pdf files: 31 sec, 42 sec, and 26 sec
For parallel processing of multiple pdf files, I'm used:
marker path_to_input_folder --output_dir path_to_input_folder --workers 3
Resulting processing time: 1 min 48 sec, which is greater than the sum of the processing times of all three pdfs (31+42+26= 1 min 39 sec)
As an alternate for parallel processing of multiple pdf files, I'm used:
NUM_DEVICES=3 NUM_WORKERS=3 NUM_CHUNKS=1 path_to_input_folder --output_dir path_to_input_folder
Resulting processing time: 1 minute 52 seconds
OS: ubuntu 20.4
H/W: 128-core CPU
Python 3.10
The text was updated successfully, but these errors were encountered: