Concepts:
- Transformers
- Shallow heuristics
- Narrow training distribution
Points of interest:
- Ability to 'reason'
- Different steps of training
- Transfer learning
LMs: factual knowledge + reasoning + contextual tracking.
- useful to train and evaluate Language Models (LMs) smaller than the state-of-the-art models (below 10 million total parameters (SLMs)) or have much simpler architectures (with only one transformer block).
- demonstrates reasoning capabilities.
- multidimensional score for the model (grammar, creativity, instruction-following) =/= from very structured standard benchmarks.
- useful for low-resource / specialized domains teams + provides a new perspective on the capabilities of LMs
- generative models trained on TinyStories show similar behaviors to Larger Language Models (LLMs).
- conducting extensive experiments on different hyperparameters, architectures, and training methods reveals insights into the performance and quality of these models even with limited computational resources.
- models trained on TinyStories appear to be substantially more interpretable than larger ones, with clear attention patterns and meaningful neuron activations.
- visualization and analysis of attention and activation maps provide insights into the generation process and story content, enhancing our understanding of how these models operate.
- models trained on TinyStories can produce results comparable to much larger models like GPT2-XL, demonstrating the effectiveness of this approach in generating high-quality text.
- introduced in Attention is all you need
- neural network architecture primarily used for natural language processing.
- key feature: attention mechanism, allowing it to capture complex relationships in sequential data.
- excel in tasks like machine translation and text generation.
- famous models such as GPT and BERT employ transformer architectures.
- simple rules or low-complexity methods used to solve problems or make decisions.
- can be quick to apply but may also yield approximate results.
- Data collection.
- Preprocessing: cleaning and formatting the text data, including tokenization (breaking text into individual words or subwords), handling special characters, and potentially applying techniques like stemming or lemmatization. a. stemming: reducing words to their root by removing suffixes. e.g. "eating", "eats", "eaten" become "eat". Fast but may produce imperfect results as it does not consider context. b. lemmatization: reducing words to their canonical form. e.g. "better" become "good", "running" become "run".
- Token embedding: converting tokens into vectors that can be understood by the model. Involves techniques like word embeddings (e.g., Word2Vec, GloVe) or subword embeddings (e.g., Byte Pair Encoding, SentencePiece).
- Model training: involves feeding the tokenized and embedded text into the model and adjusting its parameters (e.g., weights in neural networks) iteratively to minimize prediction errors. a. optimization algorithms: improve the model's ability to generate coherent and contextually relevant text, while minimizing errors and maximizing language understanding.
- Evaluation: assessing the performance of the trained model using various metrics such as perplexity, accuracy, or BLEU score (Bilingual Evaluation Understudy).
- Fine-tuning (optional): fine-tuning the pre-trained model on a specific task or domain to improve its performance for a particular application.
- Deployment.
- transferring a pre-trained model's learned knowledge to a new related task or domain.
- pre-trained model is used as a starting point.
- allows saving time + resources.
Generate a train (size: 70) & a test dataset (size: 30):
Dtrain = { (xi, yi): 1 <= i <= 100 ^ (i % 10) \notin { 1, 3, 7 }}
Dtest = { (xi, yi): 1 <= i <= 100 ^ (i % 10) \in { 1, 3, 7 }}
Difficulty: ⭐ Duration: 30 minutes
! Decode the tokenizer.pad_token to add it to our synthetic completions.
📚 Doc:
- Hugging Face 🤗 Create dataset
- stackoverflow
Reasons for avoiding generate() function:
- Customization of the generation process
- Performance
- Flexibility: ability to add additional features of custom preprocessing steps to generation process
- Control over model + behavior
- If there are needs during inference and training stages (generate() can only be used at inference time
- Create a dataset_batches where the step=batch_size. Advantages = optimization+parallelism, PyTorch/TensorFlow are optimized for processing batches of data in parallel.
- Iterate on batches and for each get prompts/completions.
- Activation:
- Training stage -> learn model params
- Calculated + used to adjust model params to minimize the loss on the training data
- Desactivation:
- Inference / evaluation stages
- No parameter adjustments are made because the parameters have already been learned during training.
Difficulty: ⭐⭐ Duration: 1h
📚 Doc:
- Hugging face 🤗 Evaluate predictions
- Hugging face 🤗 Utilities for Tokenizers: understand PreTrainedTokenizerBase, params + returns
- Hugging Face 🤗 Inference
- Inference PyTorch Models
- Init DummyModel class constructor
- Implement a customized forward method
Difficulty: ⭐⭐⭐⭐ Duration: 2h
📚 Doc:
Difficulty: ⭐⭐⭐ Duration: 1h
Hyperparameters: set to control the training process + can influence performance of the model. It can be:
- learning rate
- batch size
- number of epochs
- optimizer
- regularization parameters
- dropout rate
- model architecture choices: number of layers in a neural network, number of neurons/layer, ...
What I tried:
- change learning rate, epochs, batch_size
📚 Doc:
- Hugging face 🤗 Causal language modeling | Train
- Hugging face 🤗 Evaluate - A library for easily evaluating machine learning models and datasets
- Hugging face 🤗 Evaluate - transformers
- Stackoverflow - Using huggingface transformers trainer method for hugging face datasets
- Error using transformers Trainer - remove_unused_columns=False
- PyTorch 🔥 torch.optim
- Exploring a task that requires a balance between the nature of the words: nouns, verbs, adjectives, numbers.
- Task 1: OK
- Task 2: truncate completions to improve DummyModel's accuracy
- Task 3:
- constructor
- search more if forward params are useful
- logits instantiation / initialization / storage
- Task 4: develop training stage
- See the impact of a different dataset (size & quality)
- Test hypothesis with differents SLMs
Doc:
Difficulty: ⭐ Duration: 30 minutes