Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarking for inference notebooks #184

Open
Shreyanand opened this issue Jul 14, 2022 · 4 comments
Open

Add benchmarking for inference notebooks #184

Shreyanand opened this issue Jul 14, 2022 · 4 comments
Assignees
Labels
sparsification Indicates that the issue exists to achieve model sparsification.

Comments

@Shreyanand
Copy link
Member

For us to evaluate sparsification results, we need to evaluate the performance of each inference step: relevance and kpi-extraction.

  • As a part of this issue, create a notebook called benchmarks.ipynb in the demo2 directory. This notebook will load the relevance model and the kpi extraction model and infer from large number of pdfs (145 samples).

  • Results should look something like this:

    • Relevance: [t1, t2, ... t145] distribution of 145 inference times; find it's min, mean, max, std
    • KPI extraction: [t1, t2, ... t145] distribution of 145 inference times; find it's min, mean, max, std
    • This should borrow inference code from infer_relevance and infer_kpi notebooks.
  • Second, get the performance metrics for each model. Look at the end of train_relevance, and train_kpi_extraction, find and borrow relevant code and the test dataset to get performance metrics: f1 score, recall, precision, and accuracy.

  • Results should look something like this:

    • Relevance: f1 score, recall, precision, and accuracy (on the test set ~30 files assuming 80,20 split; double check this bit)
    • KPI extraction: f1 score, recall, precision, and accuracy '
  • Print the model size in MBs for both the models

@Shreyanand Shreyanand added the sparsification Indicates that the issue exists to achieve model sparsification. label Jul 14, 2022
@Shreyanand
Copy link
Member Author

@rishirich please add any updates for the approach you are taking to solve this issue here.

@rishirich
Copy link
Contributor

@Shreyanand I was trying to take direct measurements of times taken by each PDF, but then figured that the time taken per PDF is dictated by the number of pages and its text density in general.
I did a deep-dive into the code and checked how the data was gathered and chunking was done on it.
I think a more accurate way of benchmarking would be to create chunks out of each individual page (chunk = number of questions X number of paragraphs), run inferencing on that chunk (i.e. a page), and then proceed to the next.
Once all the pages in the PDF are processed, we can note the average time taken by every page in that pdf.
A benefit of this method would be that the density of text within each page of the PDF gets considered.

We can then run this for all the PDFs, get average inferencing time per page per PDF per question, and collect all the averages for PDFs and check the mean, min, max, and std of these average times. This way, we get to consider the average density of text per page per PDF, and the varying sizes (number of PDFs) won't dictate the average inferencing times per PDF.

After this, for any particular PDF, we multiply this average by the number of pages in that PDF to get the expected inferencing time, and can also record the actual inferencing time.

@MichaelTiemannOSC
Copy link
Contributor

We discussed in the Data Extraction weekly meeting that the extractor's pattern for recognizing paragraphs (a newline or perhaps a pair of newlines) was creating pessimal results for CDP documents where a paragraph is a short sentence "State the global scope 1 CO2 emissions (in megatons)" and the answer is even shorter ("1000"). Many small paragraphs are NOT conducive either to its method of extraction, as well as creating lots of fruitless paragraphs to search. The team will try a new approach of using a question number (a regexp that would match (C4.1a, C4.2, etc) and would treat all the text between as sentences. This will both create a lot more context and greatly reduce the number of paragraphs that have to be searched.

Bottom line: number of "paragraphs" as well as pages should be measured.

@MichaelTiemannOSC
Copy link
Contributor

@DaBeIDS @MichaelTiemannOSC for visibility

rishirich added a commit to rishirich/aicoe-osc-demo that referenced this issue Aug 10, 2022
Contains modified inferencing coded for both Relevance and KPI Inferencing phases of the inference pipeline to add benchmarking steps.
Both models have been benchmarked thoroughly and includes the following metrics:
1) Relevance Model:
    - Total Number of Data Points Processed
    - Total Inference Time
    - Average Number of Pages Per PDF
    - Average Inference Time Per PDF
    - Minimum Inference Time of PDF
    - Maximum Inference Time of PDF
    - Std of Inference Times of PDFs
    - Average Time Per Data Point Processed
    - Average Data Points Processed Per Second
2) KPI Model:
    - Total Number of Data Points Processed
    - Total Inference Time
    - Average Inference Time Per CSV
    - Minimum Inference Time of CSV
    - Maximum Inference Time of CSV
    - Std of Inference Times of CSVs
    - Average Time Per Data Point Processed
    - Average Data Points Processed Per Second

Signed-off-by: [Rishikesh Gawade](https://github.com/rishirich/)
rishirich added a commit to rishirich/aicoe-osc-demo that referenced this issue Aug 10, 2022
Contains modified inferencing coded for both Relevance and KPI Inferencing phases of the inference pipeline to add benchmarking steps.
Both models have been benchmarked thoroughly and includes the following metrics:
1) Relevance Model:
    - Total Number of Data Points Processed
    - Total Inference Time
    - Average Number of Pages Per PDF
    - Average Inference Time Per PDF
    - Minimum Inference Time of PDF
    - Maximum Inference Time of PDF
    - Std of Inference Times of PDFs
    - Average Time Per Data Point Processed
    - Average Data Points Processed Per Second
2) KPI Model:
    - Total Number of Data Points Processed
    - Total Inference Time
    - Average Inference Time Per CSV
    - Minimum Inference Time of CSV
    - Maximum Inference Time of CSV
    - Std of Inference Times of CSVs
    - Average Time Per Data Point Processed
    - Average Data Points Processed Per Second

Signed-off-by: [Rishikesh Gawade](https://github.com/rishirich/)

Signed-off-by: [Rishikesh Gawade](https://github.com/rishirich/)
rishirich added a commit to rishirich/aicoe-osc-demo that referenced this issue Aug 10, 2022
Contains modified inferencing coded for both Relevance and KPI Inferencing phases of the inference pipeline to add benchmarking steps.
Both models have been benchmarked thoroughly and includes the following metrics:
1) Relevance Model:
    - Total Number of Data Points Processed
    - Total Inference Time
    - Average Number of Pages Per PDF
    - Average Inference Time Per PDF
    - Minimum Inference Time of PDF
    - Maximum Inference Time of PDF
    - Std of Inference Times of PDFs
    - Average Time Per Data Point Processed
    - Average Data Points Processed Per Second
2) KPI Model:
    - Total Number of Data Points Processed
    - Total Inference Time
    - Average Inference Time Per CSV
    - Minimum Inference Time of CSV
    - Maximum Inference Time of CSV
    - Std of Inference Times of CSVs
    - Average Time Per Data Point Processed
    - Average Data Points Processed Per Second

Signed-off-by: Rishikesh Gawade <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sparsification Indicates that the issue exists to achieve model sparsification.
Projects
None yet
Development

No branches or pull requests

3 participants