Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace custom transformer implementation with x-transformers #77

Merged
merged 38 commits into from
Sep 30, 2024

Conversation

Waino
Copy link
Collaborator

@Waino Waino commented Sep 16, 2024

In addition to replacing the transformer implementation, many secondary changes were needed:

  • Custom embeddings were removed, because x-transformers includes its own embeddings as an integral part of the model.
  • Changes in model structure necessitated a rewrite of the old saving and loading code.
  • The decoding (greedy and beam search) required extensive modifications.

Note that the x-transformers migration enables many new pieces of functionality directly from x-transformers, which won't be listed here. In addition

  • We now support LoRA adapters in addition to the old extra FF layer adapters. These are feasible with x-transformers, but would have been challenging before.

The code runs (training and translation), but further testing is necessary to ensure correct functioning.

Closes #56

task_queue_manager,
checkpoint=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so init is separate from build?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init is indeed separate from build.
Build is done the same way at start of training, when resuming training, and when translating (although the last will only build the necessary subset of components).
Setting the initial value for the parameters is different in all three.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, worth documenting for people interested in using HF models for init

Copy link
Collaborator

@TimotheeMickus TimotheeMickus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly ok, but needs some cleaning to make it less of a nightmare for others to work with it. i left both comments and review comments.

mammoth/modules/adapters.py Outdated Show resolved Hide resolved
mammoth/opts.py Outdated Show resolved Hide resolved
-copy_attn >> ${LOG_FILE} 2>&1
[ "$?" -eq 0 ] || error_exit
echo "Succeeded" | tee -a ${LOG_FILE}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to have stg of the sort, even if minimal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no copy attention anymore.

We should add a few new cases here, though.

@@ -316,6 +385,7 @@ def test_beam_is_done_when_n_best_beams_eos_using_min_length(self):
beam.update_finished()
self.assertTrue(beam.done)

@unittest.skip('attention no longer returned')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do have a keyword about returning attn in the model though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, x-transformers is able to return the attention.
However, it is returned in a different format from the previous implementation, and the use cases for it have been removed. I decided to remove the return instead of reimplementing it.

Maybe the test should be removed as well. Unless someone feels like reimplementing this feature, and fixing the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then remove the test, I'd rather we avoid keeping relics from unsupported features in our codebase — makes it hard to know what we actually support at a glance

mammoth/train_single.py Outdated Show resolved Hide resolved
mammoth/utils/statistics.py Show resolved Hide resolved
@@ -148,13 +172,17 @@ def output(self, step, num_steps, learning_rate, start, metadata=None):
meta_str = ''
if num_steps > 0:
step_fmt = "%s/%5d" % (step_fmt, num_steps)
acc = self.accuracy()
acc_str = f'{acc:6.2f}' if acc is not None else '--'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have cases where acc/ppl are none? this is weird

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ppl shouldn't be None, unless the model blew up.
I think we check for it somewhere else. If not, we should. Stopping a model blowout isn't the responsibility of the stats logger.

acc will be None when you don't want to waste compute on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we don't have anything stopping a model blowout.
All this isn't very "halt and catch fire"-y.

@@ -0,0 +1,391 @@
import click
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we sell this as part of our package?

i am not convinced that this tool makes a lot of sense as part of our lib — it's a cool inhouse tidbit, but i'd either explicitly define it in a test package or just not include it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to use it to create a quickstart that can be run directly from the repo without needing to download any additional resources. If we don't want to do that, then we can remove the script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe move to stg like an example directory? where you could also save *.sh files for explicit config-config calls, wrappers examples, etc.

mammoth/inputters/dataset.py Outdated Show resolved Hide resolved
mammoth/distributed/components.py Show resolved Hide resolved
@Waino Waino self-assigned this Sep 23, 2024
Supports label smoothing out of the box.
Because the strided sequence is offset by the value specified in the
config for that corpus, the strided line numbers are not divisible by
the stride (unless offset is zero).
Taking the original offset into account allows verifying that the loaded
line number is reachable with the given stride.
Translation runs, and the output is somehow related to the input.
However, it definitely doesn't work correctly.
For example, you get (x * 2) - 1 rows of output for x rows of input.
Greedy decoder tests are still turned off
This structure allows retrieval of AdaptedAttentionLayers using the
layer_stack_index and the xcoder_id.
The existing structure requires knowing a task_id in which the component
is used. This task_id can be difficult to acquire in some contexts.
@Waino Waino force-pushed the feat/x-transformers branch from 1b3eaf9 to ad9de67 Compare September 30, 2024 06:45
Copy link
Collaborator

@TimotheeMickus TimotheeMickus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works, it's good enough for me. I'm continuing the convo on a few key points but this is more or less mergeable as is

elif opts.reset_optim == 'keep_states':
# Reset options, keep optimizer.
optim_state_dict = ckpt_state_dict
return optim_opts, optim_state_dict
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't recall using those options in recent past, but i didn't touch optim restarting (that was joseph and raul, iirc?)

@@ -0,0 +1,391 @@
import click
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe move to stg like an example directory? where you could also save *.sh files for explicit config-config calls, wrappers examples, etc.

@@ -316,6 +385,7 @@ def test_beam_is_done_when_n_best_beams_eos_using_min_length(self):
beam.update_finished()
self.assertTrue(beam.done)

@unittest.skip('attention no longer returned')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then remove the test, I'd rather we avoid keeping relics from unsupported features in our codebase — makes it hard to know what we actually support at a glance

@@ -148,13 +172,17 @@ def output(self, step, num_steps, learning_rate, start, metadata=None):
meta_str = ''
if num_steps > 0:
step_fmt = "%s/%5d" % (step_fmt, num_steps)
acc = self.accuracy()
acc_str = f'{acc:6.2f}' if acc is not None else '--'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we don't have anything stopping a model blowout.
All this isn't very "halt and catch fire"-y.

task_queue_manager,
checkpoint=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, worth documenting for people interested in using HF models for init

return tmp_layer_types, tmp_layer_structs, tmp_layer_dropouts


class LoraAdapterLayer(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we didn't support lora previously, did we? if so, might be good to explicitly mention in the global description of the PR

mammoth/opts.py Outdated
action='store_true',
help="Dump samples when building vocab. Warning: this may slow down the process.",
)
if build_vocab_only:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still support a vocab building script? i thought this had been nuked, but apparently not well enough?

@@ -584,6 +538,12 @@ def _add_train_general_opts(parser):
choices=['none', 'all', 'states', 'keep_states'],
help="Optimization resetter when train_from.",
)
group.add(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is the best thing to add in an official lib, but hey, whatever. Happy to discuss it if you want a bone to be picked

mammoth/opts.py Outdated Show resolved Hide resolved
@@ -312,11 +383,31 @@ def test_returns_correct_scores_non_deterministic_topp(self):
valid_score_dist_1 = torch.log_softmax(torch.tensor([6.0, 5.0, 4.0, 3.0, 2.0, 1.0]), dim=0)
valid_score_dist_2 = torch.log_softmax(torch.tensor([6.0, 1.0]), dim=0)
eos_idx = 2
src_len = 67
lengths = torch.randint(0, 30, (batch_sz,))
samp = GreedySearch(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this is fairly had to read and would really benefit from kwargs

"-x_transformers",
"--x_transformers",
help="For a complete list of options, see"
" https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py ."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, point to the github @ commit hash for the pinned version?

raise ValueError(
f'"{unsupported_key}" is not supported in Mammoth.'
)
# Mammoth has a different default value than x-transformers,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be in the docs somewhere — perhaps in the help of the corresponding flag in opts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to have a 102 or whatever showcasing x-transformers config

This flag can be passed though to x-transformers using the
x_transformer_opts dict, there is no need for special handling.
The flag is much less visible now, and a user noticing that
--transformer_ff has been removed will need to dig deep to find it
and learn the changed parameterization.

Not sure where to document this. Maybe we need a new page for migration
tips.
@Waino Waino mentioned this pull request Sep 30, 2024
@Waino Waino force-pushed the feat/x-transformers branch from c343807 to 8064500 Compare September 30, 2024 11:48
@Waino Waino merged commit 794bdf8 into main Sep 30, 2024
2 checks passed
@Waino Waino deleted the feat/x-transformers branch September 30, 2024 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

External dependencies for layer architectures
2 participants