Skip to content

A repo for language reconstruction experiments designed for generating coherent and fluent natural language from synthetic texts

License

Notifications You must be signed in to change notification settings

d-gurgurov/Language-Reconstruction-LowRes-Languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Low-Resource Machine Translation with Intermediary Language from Language Reconstruction Model

Project Overview

This project addresses the challenge of translating between low-resource languages (LRLs) using an intermediary language, leveraging existing data and models. The approach involves a multi-stage process where a low-resource machine translation (MT) model is improved by incorporating synthetic data generated through intermediary translations.

Approach

  1. Initial Translation: Use an English-to-LRL MT model to translate a portion of parallel data (PDfirsthalf) to an intermediary "BadLRL" representation.
  2. Model Training for Refinement (Reconstruction Model): Train an MT model to translate from "BadLRL" to LRL, aiming to refine the quality of translations from "BadLRL" to "good" LRL.
    • Generate synthetic "BadLRL" data by scrambling LRL sentences or translating them incorrectly from English.
  3. Synthetic Data Generation: Utilize the two previous systems to create re-fined synthetic parallel data (PDsynthetic).
  4. Final Model Training: Train the final MT model using both the second half of the parallel data (PDsecondhalf) and the re-fined synthetic data (PDsynthetic).

Schematics

BadLRL Process

Literature Review

Most Relevant Sources

  1. Improving Neural Machine Translation Models with Monolingual Data
    Sennrich et al. (2016)

    • Key Contribution: Back-translation original paper.
    1. Pairing monolingual training data with an automatic backtranslation which is treated as additional parallel training data.
    2. Use the trained model to translate monolingual sentences from the target language back into the source language. This creates synthetic parallel data, which can be used to increase the amount of training data.
  2. Bi-Directional Differentiable Input Reconstruction for Low-Resource Neural Machine Translation
    Niu et al. (2019)

    • Key Contribution: Proposes a bi-directional NMT model that learns to reconstruct the original input from the translation.
    1. The authors propose a bi-directional NMT model that incorporates a reconstruction task to better utilize the limited parallel data.
    2. Input reconstruction. The model learns to translate in both directions (source to target and target to source) and also tries to reconstruct the original input from the translation.
      • The model first translates the input sentence from the source language to the target language.
      • Then, it attempts to reconstruct the original input by translating the target language output back to the source language
    3. This method helps to extract more information from limited parallel data.
  3. Trivial Transfer Learning for Low-Resource Neural Machine Translation
    Kocmi and Bojar (2018)

    • Key Contribution: Describes the transfer learning approach where a well-trained high-resource MT model is adapted to low-resource languages.
    1. NMT model for the high-resource language pair is trained until convergence. This model is called “parent”.
    2. After that, the child model is trained without any restart, i.e. only by changing the training corpora to the low-resource language pair.
    3. The same hyperparameters are used.
  4. Iterative Back-Translation for Neural Machine Translation
    Hoang et al. (2018)

    • Key Contribution: Extends basic back-translation by iterating the process to progressively improve synthetic data and translation models.
    1. Iterative back-translation extends the basic back-translation method by repeating the process multiple times to progressively improve the quality of the synthetic data and the translation models.
    2. Back-translated data is used to build better translation systems in forward and backward directions, which in turn is used to reback-translate monolingual data.

Additional Relevant Sources

  1. Understanding Back-Translation at Scale
    Edunov et al. (2018)

    • Key Contribution: Analyzes the effects of different back-translation techniques and strategies, including sampling and noise addition, which can inform the synthetic data generation process in this project.
    1. Back-translation usually uses beam search or greedy search.
    2. Both lead to less-rich translations.
    3. Sampling from the model distribution as well as adding noise to beam search outputs.
  2. Enhancement of Encoder and Attention Using Target Monolingual Corpora in Neural Machine Translation
    Imamura et al. (2018)

    • Key Contribution: Proposes methods to enhance MT models using target language monolingual data.
    1. Extends the method proposed by Sennrich et al. (2016) to enhance the encoder and attention using target monolingual corpora.
    2. Proposed method generates multiple source sentences by sampling when each target sentence is translated back.
  3. Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT
    Chronopoulou et al. (2020)

    • Key Contribution: Explores the fine-tuning of pre-trained language models for low-resource language translation.
    1. The monolingual LM is fine-tuned on both languages and is then used to initialize a Unsupervised NMT model.
    2. To reuse the pretrained LM, they have to modify its predefined vocabulary, to account for the new language.
    3. A novel vocabulary extension method is proposed.
  4. Copied Monolingual Data Improves Low-Resource Neural Machine Translation
    Currey et al. (2017)

    • Key Contribution: Discusses copying monolingual data to the source side.
    1. Proposed copying monolingual data to the source side for low-resource NMT.
    2. Monolingual data on both sides. Target and source are identical.
  5. Neural Proto-Language Reconstruction
    Cui et al. (2024)

    • Key Contribution: Introduces VAE-Transformer for proto-language reconstruction and data augmentation.
    1. Data Augmentation: The authors develop methods to predict missing data in the dataset, which helps improve the model's performance and stability.
    2. VAE-Transformer: They enhance the Transformer model by adding a Variational Autoencoder (VAE) structure. This addition helps the model better capture the relationships between daughter languages and their proto-forms.
      • VAEs create a more regularized latent space compared to standard autoencoders.
    3. Neural Machine Translation: They adapt techniques from neural machine translation to the task of proto-language reconstruction
  6. Simple and Effective Noisy Channel Modeling for Neural Machine Translation
    Yee et al. (2019)

    • Key Contribution: Explores noisy channel models to utilize unpaired data.
      1. Direct models cannot naturally take advantage of unpaired data.
      2. Channel model and language model utilized instead.
      3. Standard sequence to sequence models are a simple parameterization for the channel probability that naturally exploits the entire source.
  7. A Survey on Text Generation Using Generative Adversarial Networks
    De Rosa et al. (2019)

    • Key Contribution: Reviews GAN-based methods for text generation, might give us some insights into generating high-quality synthetic data.
  8. Neural Machine Translation for Low-resource Languages: A Survey Panathunga et al. (2023)

    • Key Contribution: Reviews most synthetic generation methods for high- and low-resource languages.

Side questions:

  1. (?) Could the "BadLRL to LRL" model be trained jointly with the main translation model
  2. (?) Conditional Variational Auto-Encoders for creating good synthetic corpus
    • Encoder: - Compresses the input data into a latent space - This latent space is represented as a probability distribution (usually Gaussian)
    • Latent Space: - A compact representation of the input - Captures important features of the data
    • Decoder: - Takes a sample from the latent space and the condition - Generates output data based on these inputs
  3. (?) Diffusion-like models for language reconstruction (Refer to diffusion_lm)
  4. (?) SacreBLEU mostly used for evaluation these days???

Data Sources

About

A repo for language reconstruction experiments designed for generating coherent and fluent natural language from synthetic texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published