🌳 Tip 🌳
For more tips on training neural networks, check out:
- A Recipe for Training Neural Networks (Karpathy 2019)
- NLP's Clever Hans Moment has Arrived (Heinzerling 2019): an excellent writeup on trying to understand what exactly your neural network learns, and techniques to ensure that your model works correctly with textual data.
- An overview of gradient descent optimization algorithms (Ruder 2016)
- [E] When building a neural network, should you overfit or underfit it first?
- [E] Write the vanilla gradient update.
- Neural network in simple Numpy. 26. [E] Write in plain NumPy the forward and backward pass for a two-layer feed-forward neural network with a ReLU layer in between. 27. [M] Implement vanilla dropout for the forward and backward pass in NumPy.
- Activation functions.
28. [E] Draw the graphs for sigmoid, tanh, ReLU, and leaky ReLU.
29. [E] Pros and cons of each activation function.
30. [E] Is ReLU differentiable? What to do when it’s not differentiable?
31. [M] Derive derivatives for sigmoid function
$$\sigma(x)$$ when$$x$$ is a vector. - [E] What’s the motivation for skip connection in neural works?
- Vanishing and exploding gradients. 32. [E] How do we know that gradients are exploding? How do we prevent it? 33. [E] Why are RNNs especially susceptible to vanishing and exploding gradients?
- [M] Weight normalization separates a weight vector’s norm from its gradient. How would it help with training?
- [M] When training a large neural network, say a language model with a billion parameters, you evaluate your model on a validation set at the end of every epoch. You realize that your validation loss is often lower than your train loss. What might be happening?
- [E] What criteria would you use for early stopping?
- [E] Gradient descent vs SGD vs mini-batch SGD.
- [H] It’s a common practice to train deep learning models using epochs: we sample batches from data without replacement. Why would we use epochs instead of just sampling data with replacement?
- [M] Your model’ weights fluctuate a lot during training. How does that affect your model’s performance? What to do about it?
- Learning rate. 34. [E] Draw a graph number of training epochs vs training error for when the learning rate is: 1. too high 2. too low 3. acceptable. 35. [E] What’s learning rate warmup? Why do we need it?
- [E] Compare batch norm and layer norm.
- [M] Why is squared L2 norm sometimes preferred to L2 norm for regularizing neural networks?
- [E] Some models use weight decay: after each gradient update, the weights are multiplied by a factor slightly less than 1. What is this useful for?
- It’s a common practice for the learning rate to be reduced throughout the training. 36. [E] What’s the motivation? 37. [M] What might be the exceptions?
- Batch size. 38. [E] What happens to your model training when you decrease the batch size to 1? 39. [E] What happens when you use the entire training data in a batch? 40. [M] How should we adjust the learning rate as we increase or decrease the batch size?
- [M] Why is Adagrad sometimes favored in problems with sparse gradients?
- Adam vs. SGD. 41. [M] What can you say about the ability to converge and generalize of Adam vs. SGD? 42. [M] What else can you say about the difference between these two optimizers?
- [M] With model parallelism, you might update your model weights using the gradients from each machine asynchronously or synchronously. What are the pros and cons of asynchronous SGD vs. synchronous SGD?
- [M] Why shouldn’t we have two consecutive linear layers in a neural network?
- [M] Can a neural network with only RELU (non-linearity) act as a linear classifier?
- [M] Design the smallest neural network that can function as an XOR gate.
- [E] Why don’t we just initialize all weights in a neural network to zero?
- Stochasticity. 43. [M] What are some sources of randomness in a neural network? 44. [M] Sometimes stochasticity is desirable when training neural networks. Why is that?
- Dead neuron. 45. [E] What’s a dead neuron? 46. [E] How do we detect them in our neural network? 47. [M] How to prevent them?
- Pruning. 48. [M] Pruning is a popular technique where certain weights of a neural network are set to 0. Why is it desirable? 49. [M] How do you choose what to prune from a neural network?
- [H] Under what conditions would it be possible to recover training data from the weight checkpoints?
- [H] Why do we try to reduce the size of a big trained model through techniques such as knowledge distillation instead of just training a small model from the beginning?
This book was created by Chip Huyen with the help of wonderful friends. For feedback, errata, and suggestions, the author can be reached here. Copyright ©2021 Chip Huyen.