train_kpi_extraction is buggy with 145 training files #183

Shreyanand · 2022-07-14T16:52:58Z

The train_kpi_extraction notebook errors out when we run it with 145 files. The pipeline is here and the screenshot shows the error.

MichaelTiemannOSC · 2022-07-18T05:25:42Z

I don't know about this screenshot, but there's a bug I found in the (very old) farm code. The setting of pred_id in farm/modeling/prediction_head.py should only use basket.id_internal if basket.id_external is None. I'm reporting this fact in such a hokey way because the GitHub code for the version supported by Red Hat is not at my fingertips. Without this fix, we get JSON errors when the id_external is zero and the id_internal is some compound thing like "188-0".

    def to_qa_preds(self, top_preds, no_ans_gaps, baskets):
        """ Groups Span objects together in a QAPred object  """
        ret = []

        # Iterate over each set of document level prediction
        for pred_d, no_ans_gap, basket in zip(top_preds, no_ans_gaps, baskets):

            # Unpack document offsets, clear text and id
            token_offsets = basket.samples[0].tokenized["document_offsets"]
            pred_id = basket.id_external if basket.id_external is not None else basket.id_internal

            # These options reflect the different input dicts that can be assigned to the basket
            # before any kind of normalization or preprocessing can happen
            question_names = ["question_text", "qas", "questions"]
            doc_names = ["document_text", "context", "text"]

            document_text = try_get(doc_names, basket.raw)
            question = self.get_question(question_names, basket.raw)
            ground_truth = self.get_ground_truth(basket)

            curr_doc_pred = QAPred(id=pred_id,
                                   prediction=pred_d,
                                   context=document_text,
                                   question=question,
                                   token_offsets=token_offsets,
                                   context_window_size=self.context_window_size,
                                   aggregation_level="document",
                                   ground_truth_answer=ground_truth,
                                   no_answer_gap=no_ans_gap)

            ret.append(curr_doc_pred)
        return ret

Shreyanand · 2022-07-20T18:26:39Z

@MichaelTiemannOSC I tried disabling the multiprocessing and it seemed to have solved the problem here as well. Although it has been >24hrs since I started training it and it's 75% done.

Shreyanand added bug Something isn't working productization Indicates that the issue exists to indicate improvements for pipelines, images, trino, superset, etc nlp-internal Indicates that the issue exists to improve the internal NLP model and it's code labels Jul 14, 2022

Shreyanand mentioned this issue Aug 1, 2022

Disable multiprocessing for kpi-training #189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_kpi_extraction is buggy with 145 training files #183

train_kpi_extraction is buggy with 145 training files #183

Shreyanand commented Jul 14, 2022

MichaelTiemannOSC commented Jul 18, 2022

Shreyanand commented Jul 20, 2022 •

edited

Loading

train_kpi_extraction is buggy with 145 training files #183

train_kpi_extraction is buggy with 145 training files #183

Comments

Shreyanand commented Jul 14, 2022

MichaelTiemannOSC commented Jul 18, 2022

Shreyanand commented Jul 20, 2022 • edited Loading

Shreyanand commented Jul 20, 2022 •

edited

Loading