-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer Day: Annotated miniGPT #179
Comments
this part here is not very clear https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L88 the use of Lambda is a bit confusing. In general, once we have stuff running, we should change the short name to long self-descriptive names. |
It seems like you could replace both |
I would convert this class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = nn.ModuleDict(dict(
c_fc = nn.Linear(config.n_embd, 4 * config.n_embd),
c_proj = nn.Linear(4 * config.n_embd, config.n_embd),
act = NewGELU(),
dropout = nn.Dropout(config.resid_pdrop),
))
m = self.mlp
self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlpf(self.ln_2(x))
return x into class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
self.layer_norm_1 = nn.LayerNorm(config.n_embd)
self.dropout_1 = nn.Dropout(config.resid_pdrop)
self.causal_multi_head_attention = CausalSelfAttention(config)
self.layer_norm_2 = nn.LayerNorm(config.n_embd)
self.dropout_2 = nn.Dropout(config.att_pdrop)
self.mlp = nn.ModuleDict(dict(
c_fc = nn.Linear(config.n_embd, 4 * config.n_embd),
c_proj = nn.Linear(4 * config.n_embd, config.n_embd),
act = NewGELU()
))
def feed_forward(self, x):
return self.mlp.c_proj(self.mlp.act(self.mlp.c_fc(x)))
def forward(self, x):
# NOTE: We need to remove dropout from CausalSelfAttention
x = x + self.dropout_1(self.causal_multi_head_attention(self.layer_norm_1(x)))
x = x + self.dropout_2(self.feed_forward(self.layer_norm_2(x)))
return x |
Can you update with the |
It would look something like this: class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = nn.Sequential(
nn.Linear(config.n_embd, 4 * config.n_embd),
NewGELU(),
nn.Linear(4 * config.n_embd, config.n_embd),
nn.Dropout(config.resid_pdrop)
)
def forward(self, x):
block_in = x
# x = x + self.attn(self.ln_1(x))
attn_out = block_in + self.attn(self.ln_1(block_in))
# x = x + self.mlp(self.ln_2(x))
block_out = attn_out + self.mlp(self.ln_2(attn_out))
return block_out If you do it this way, note that there's no need to distinguish between By the same token, maybe there are clearer names for |
Naming convention of @ramon-astudillo looks good, if we want to avoid from numerical indices we can use pre/post_attention e.g replace self.layer_norm_1 with self.layer_norm_pre_att and self.layer_norm_2 with self.layer_norm_post_att. I will start to make changes locally then we can merge after testing with notebooks. |
this one integrates both mine and @bpopeters class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
# Causal Multi Head Attention
# NOTE: We need to remove dropout from CausalSelfAttention
self.causal_multi_head_attention = CausalSelfAttention(config)
# Feeed Forward
self.feed_forward = nn.Sequential(
nn.Linear(config.n_embd, 4 * config.n_embd),
NewGELU(),
nn.Linear(4 * config.n_embd, config.n_embd),
)
# Dropout and Layer normalization
self.layer_norm_1 = nn.LayerNorm(config.n_embd)
self.dropout_1 = nn.Dropout(config.resid_pdrop)
self.layer_norm_2 = nn.LayerNorm(config.n_embd)
self.dropout_2 = nn.Dropout(config.att_pdrop)
def forward(self, x):
block_in = x
attn_out = block_in + self.dropout_1(self.causal_multi_head_attention(self.layer_norm_1(block_in)))
block_out = attn_out + self.dropout_2(self.feed_forward(self.layer_norm_2(attn_out)))
return block_out |
@grigvardanyan you can keep changes in a branch and then, once we have a running version from @venelink and @gonmelo we can start moving parts there piece by piece to ensure nothing breaks |
I tested changed version of Casual self attention Layer it works fine I can push it now. e.g #TODO Complete Transformer Layer self.causal_multi_head_attention = # self.feed_forward = nn.Sequential(OrderedDict([ self.layer_norm_1 = # |
Objective: Improve clarity of miniGPT through comments or code modifications that do not alter functionality.
Branch: https://github.com/LxMLS/lxmls-toolkit/tree/transformer-day
Expected Finishing date: Ideally before June 12 meeting. If not during that week.
The text was updated successfully, but these errors were encountered: