-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace custom transformer implementation with x-transformers #77
Conversation
task_queue_manager, | ||
checkpoint=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so init is separate from build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Init is indeed separate from build.
Build is done the same way at start of training, when resuming training, and when translating (although the last will only build the necessary subset of components).
Setting the initial value for the parameters is different in all three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, worth documenting for people interested in using HF models for init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly ok, but needs some cleaning to make it less of a nightmare for others to work with it. i left both comments and review comments.
-copy_attn >> ${LOG_FILE} 2>&1 | ||
[ "$?" -eq 0 ] || error_exit | ||
echo "Succeeded" | tee -a ${LOG_FILE} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to have stg of the sort, even if minimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no copy attention anymore.
We should add a few new cases here, though.
mammoth/tests/test_beam_search.py
Outdated
@@ -316,6 +385,7 @@ def test_beam_is_done_when_n_best_beams_eos_using_min_length(self): | |||
beam.update_finished() | |||
self.assertTrue(beam.done) | |||
|
|||
@unittest.skip('attention no longer returned') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do have a keyword about returning attn in the model though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, x-transformers is able to return the attention.
However, it is returned in a different format from the previous implementation, and the use cases for it have been removed. I decided to remove the return instead of reimplementing it.
Maybe the test should be removed as well. Unless someone feels like reimplementing this feature, and fixing the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then remove the test, I'd rather we avoid keeping relics from unsupported features in our codebase — makes it hard to know what we actually support at a glance
@@ -148,13 +172,17 @@ def output(self, step, num_steps, learning_rate, start, metadata=None): | |||
meta_str = '' | |||
if num_steps > 0: | |||
step_fmt = "%s/%5d" % (step_fmt, num_steps) | |||
acc = self.accuracy() | |||
acc_str = f'{acc:6.2f}' if acc is not None else '--' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have cases where acc/ppl are none? this is weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ppl
shouldn't be None
, unless the model blew up.
I think we check for it somewhere else. If not, we should. Stopping a model blowout isn't the responsibility of the stats logger.
acc
will be None
when you don't want to waste compute on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, we don't have anything stopping a model blowout.
All this isn't very "halt and catch fire"-y.
@@ -0,0 +1,391 @@ | |||
import click |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should we sell this as part of our package?
i am not convinced that this tool makes a lot of sense as part of our lib — it's a cool inhouse tidbit, but i'd either explicitly define it in a test package or just not include it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to use it to create a quickstart that can be run directly from the repo without needing to download any additional resources. If we don't want to do that, then we can remove the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe move to stg like an example
directory? where you could also save *.sh
files for explicit config-config calls, wrappers examples, etc.
External dependencies for layer architectures #56
Supports label smoothing out of the box.
Because the strided sequence is offset by the value specified in the config for that corpus, the strided line numbers are not divisible by the stride (unless offset is zero). Taking the original offset into account allows verifying that the loaded line number is reachable with the given stride.
Translation runs, and the output is somehow related to the input. However, it definitely doesn't work correctly. For example, you get (x * 2) - 1 rows of output for x rows of input.
Greedy decoder tests are still turned off
This structure allows retrieval of AdaptedAttentionLayers using the layer_stack_index and the xcoder_id. The existing structure requires knowing a task_id in which the component is used. This task_id can be difficult to acquire in some contexts.
scikit-learn==1.2.0 tqdm==4.66.2
1b3eaf9
to
ad9de67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works, it's good enough for me. I'm continuing the convo on a few key points but this is more or less mergeable as is
elif opts.reset_optim == 'keep_states': | ||
# Reset options, keep optimizer. | ||
optim_state_dict = ckpt_state_dict | ||
return optim_opts, optim_state_dict | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't recall using those options in recent past, but i didn't touch optim restarting (that was joseph and raul, iirc?)
@@ -0,0 +1,391 @@ | |||
import click |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe move to stg like an example
directory? where you could also save *.sh
files for explicit config-config calls, wrappers examples, etc.
mammoth/tests/test_beam_search.py
Outdated
@@ -316,6 +385,7 @@ def test_beam_is_done_when_n_best_beams_eos_using_min_length(self): | |||
beam.update_finished() | |||
self.assertTrue(beam.done) | |||
|
|||
@unittest.skip('attention no longer returned') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then remove the test, I'd rather we avoid keeping relics from unsupported features in our codebase — makes it hard to know what we actually support at a glance
@@ -148,13 +172,17 @@ def output(self, step, num_steps, learning_rate, start, metadata=None): | |||
meta_str = '' | |||
if num_steps > 0: | |||
step_fmt = "%s/%5d" % (step_fmt, num_steps) | |||
acc = self.accuracy() | |||
acc_str = f'{acc:6.2f}' if acc is not None else '--' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, we don't have anything stopping a model blowout.
All this isn't very "halt and catch fire"-y.
task_queue_manager, | ||
checkpoint=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, worth documenting for people interested in using HF models for init
return tmp_layer_types, tmp_layer_structs, tmp_layer_dropouts | ||
|
||
|
||
class LoraAdapterLayer(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't support lora previously, did we? if so, might be good to explicitly mention in the global description of the PR
mammoth/opts.py
Outdated
action='store_true', | ||
help="Dump samples when building vocab. Warning: this may slow down the process.", | ||
) | ||
if build_vocab_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still support a vocab building script? i thought this had been nuked, but apparently not well enough?
@@ -584,6 +538,12 @@ def _add_train_general_opts(parser): | |||
choices=['none', 'all', 'states', 'keep_states'], | |||
help="Optimization resetter when train_from.", | |||
) | |||
group.add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this is the best thing to add in an official lib, but hey, whatever. Happy to discuss it if you want a bone to be picked
@@ -312,11 +383,31 @@ def test_returns_correct_scores_non_deterministic_topp(self): | |||
valid_score_dist_1 = torch.log_softmax(torch.tensor([6.0, 5.0, 4.0, 3.0, 2.0, 1.0]), dim=0) | |||
valid_score_dist_2 = torch.log_softmax(torch.tensor([6.0, 1.0]), dim=0) | |||
eos_idx = 2 | |||
src_len = 67 | |||
lengths = torch.randint(0, 30, (batch_sz,)) | |||
samp = GreedySearch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this is fairly had to read and would really benefit from kwargs
"-x_transformers", | ||
"--x_transformers", | ||
help="For a complete list of options, see" | ||
" https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally, point to the github @ commit hash for the pinned version?
raise ValueError( | ||
f'"{unsupported_key}" is not supported in Mammoth.' | ||
) | ||
# Mammoth has a different default value than x-transformers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be in the docs somewhere — perhaps in the help of the corresponding flag in opts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to have a 102 or whatever showcasing x-transformers config
This flag can be passed though to x-transformers using the x_transformer_opts dict, there is no need for special handling. The flag is much less visible now, and a user noticing that --transformer_ff has been removed will need to dig deep to find it and learn the changed parameterization. Not sure where to document this. Maybe we need a new page for migration tips.
c343807
to
8064500
Compare
In addition to replacing the transformer implementation, many secondary changes were needed:
Note that the x-transformers migration enables many new pieces of functionality directly from x-transformers, which won't be listed here. In addition
The code runs (training and translation), but further testing is necessary to ensure correct functioning.
Closes #56