Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_kpi_extraction is buggy with 145 training files #183

Open
Shreyanand opened this issue Jul 14, 2022 · 2 comments
Open

train_kpi_extraction is buggy with 145 training files #183

Shreyanand opened this issue Jul 14, 2022 · 2 comments
Labels
bug Something isn't working nlp-internal Indicates that the issue exists to improve the internal NLP model and it's code productization Indicates that the issue exists to indicate improvements for pipelines, images, trino, superset, etc

Comments

@Shreyanand
Copy link
Member

The train_kpi_extraction notebook errors out when we run it with 145 files. The pipeline is here and the screenshot shows the error.
Screenshot from 2022-07-14 12-49-15

@Shreyanand Shreyanand added bug Something isn't working productization Indicates that the issue exists to indicate improvements for pipelines, images, trino, superset, etc nlp-internal Indicates that the issue exists to improve the internal NLP model and it's code labels Jul 14, 2022
@MichaelTiemannOSC
Copy link
Contributor

I don't know about this screenshot, but there's a bug I found in the (very old) farm code. The setting of pred_id in farm/modeling/prediction_head.py should only use basket.id_internal if basket.id_external is None. I'm reporting this fact in such a hokey way because the GitHub code for the version supported by Red Hat is not at my fingertips. Without this fix, we get JSON errors when the id_external is zero and the id_internal is some compound thing like "188-0".

    def to_qa_preds(self, top_preds, no_ans_gaps, baskets):
        """ Groups Span objects together in a QAPred object  """
        ret = []

        # Iterate over each set of document level prediction
        for pred_d, no_ans_gap, basket in zip(top_preds, no_ans_gaps, baskets):

            # Unpack document offsets, clear text and id
            token_offsets = basket.samples[0].tokenized["document_offsets"]
            pred_id = basket.id_external if basket.id_external is not None else basket.id_internal

            # These options reflect the different input dicts that can be assigned to the basket
            # before any kind of normalization or preprocessing can happen
            question_names = ["question_text", "qas", "questions"]
            doc_names = ["document_text", "context", "text"]

            document_text = try_get(doc_names, basket.raw)
            question = self.get_question(question_names, basket.raw)
            ground_truth = self.get_ground_truth(basket)

            curr_doc_pred = QAPred(id=pred_id,
                                   prediction=pred_d,
                                   context=document_text,
                                   question=question,
                                   token_offsets=token_offsets,
                                   context_window_size=self.context_window_size,
                                   aggregation_level="document",
                                   ground_truth_answer=ground_truth,
                                   no_answer_gap=no_ans_gap)

            ret.append(curr_doc_pred)
        return ret

@Shreyanand
Copy link
Member Author

Shreyanand commented Jul 20, 2022

@MichaelTiemannOSC I tried disabling the multiprocessing and it seemed to have solved the problem here as well. Although it has been >24hrs since I started training it and it's 75% done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working nlp-internal Indicates that the issue exists to improve the internal NLP model and it's code productization Indicates that the issue exists to indicate improvements for pipelines, images, trino, superset, etc
Projects
None yet
Development

No branches or pull requests

2 participants