Skip to content

Latest commit

 

History

History
157 lines (125 loc) · 4.69 KB

README.md

File metadata and controls

157 lines (125 loc) · 4.69 KB

DestT5 (Correcting Semantic Parses with Natural Language through Dynamic Schema Encoding)

Dataset & code for DestT5 (NLP for ConvAI, ACL 2023)

Link to Paper 📝

Model Diagram

If you use this dataset or repository, please cite the following paper:

@inproceedings{glenn2023correcting,
  author = {Parker Glenn, Parag Pravin Dakle, Preethi Raghavan},
  title = "Correcting Semantic Parses with Natural Language through Dynamic Schema Encoding",
  booktitle = "Proceedings of the 5th Workshop on NLP for Conversational AI",
  publisher = "Association for Computational Linguistics",
  year = "2023"
}

Performance

Below we display the exact-match (EM%) and execution-accuracy (EX%) of DestT5 on the SPLASH dataset, as well as the auxiliary test sets available in the NLEdit codebase.

Seq2Struct (SPLASH) EditSQL TaBERT RAT-SQL T5-Large
DestT5 (parkervg/destt5-schema-prediction with parkervg/destt5-text2sql) EM% 53.43 31.82 31.47 28.37 26.1
EX% 56.86 40.3 28.84 36.53 30.43

T5-large Dataset

The file data/splash-t5-3vnuv1vf.json contains 112 annotations for interactive semantic parsing.

Given randomly selected errors on the Spider dataset by tscholak/3vnuv1vf, natural language feedback is given to correct the erroneous parse.

Model Training

Our codebase is based off the great implementation of Picard. Specifically, we make the following updates to the DataTrainingArguments at seq2seq/utils/dataset.py to re-create the experiments described in the paper.

use_gold_concepts: bool = field(
        default=False,
        metadata={
            "help": "Whether or not to serialize input only with columns/tables/values present in the gold query."
        },
    )

use_serialization_file: Optional[List[str]] = field(
    default=None,
    metadata={
        "help": "If specified, points to the output of a T5 concept prediction model. Uses predictions as serialization to current text-to-sql model"
    },
)

include_explanation: Optional[bool] = field(
    default=False,
    metadata={
        "help": "Boolean defining whether to serialize explanation in SPLASH training"
    },
)

include_question: Optional[bool] = field(
    default=False,
    metadata={
        "help": "Boolean defining whether to serialize question in SPLASH training"
    },
)

splash_train_with_spider: Optional[bool] = field(
    default=False,
    metadata={
        "help": "Boolean defining whether to interleave Spider train set with Splash train"
    },
)

shuffle_splash_feedback: Optional[bool] = field(
    default=False,
    metadata={
        "help": "Test to see if model is actually using feedback, by running evaluation on test set with shuffled feedback"
    },
)

shuffle_splash_question: Optional[bool] = field(
    default=False,
    metadata={
        "help": "Test to see if model is actually using question, by running evaluation on test set with shuffled questions"
    },
)

task_type: Optional[str] = field(
    default="text2sql",
    metadata={"help": "One of text2sql, schema_prediction"},
)

spider_eval_on_splash: Optional[bool] = field(
    default=False,
    metadata={"help": "Whether we're running a Spider model on SPLASH. Only use question, in that case."},
)

Usage

First, clone the repo.

This repo uses submodules, we can install them with the following commands.

git submodule init
git submodule update

Then, create a destt5 conda env with the following command.

conda env create --file env.yml

Download Datasets

This work requires both the Spider dataset and the Splash dataset.

First, download Spider.zip here. Place this file in seq2seq/datasets/spider.

Then

Then, to run the training for DestT5, run the following command.

python -m seq2seq.run_seq2seq ./seq2seq/configs/question/text2sql-t5-base-schema-generator.json