Replace custom transformer implementation with x-transformers #77

Waino · 2024-09-16T13:45:29Z

In addition to replacing the transformer implementation, many secondary changes were needed:

Custom embeddings were removed, because x-transformers includes its own embeddings as an integral part of the model.
Changes in model structure necessitated a rewrite of the old saving and loading code.
The decoding (greedy and beam search) required extensive modifications.

Note that the x-transformers migration enables many new pieces of functionality directly from x-transformers, which won't be listed here. In addition

We now support LoRA adapters in addition to the old extra FF layer adapters. These are feasible with x-transformers, but would have been challenging before.

The code runs (training and translation), but further testing is necessary to ensure correct functioning.

Closes #56

mammoth/bin/train.py

mammoth/bin/translate.py

mammoth/model_builder.py

TimotheeMickus · 2024-09-16T15:13:06Z

mammoth/model_builder.py

    task_queue_manager,
-    checkpoint=None,


so init is separate from build?

Init is indeed separate from build.
Build is done the same way at start of training, when resuming training, and when translating (although the last will only build the necessary subset of components).
Setting the initial value for the parameters is different in all three.

ok, worth documenting for people interested in using HF models for init

mammoth/models/model.py

TimotheeMickus

mostly ok, but needs some cleaning to make it less of a nightmare for others to work with it. i left both comments and review comments.

mammoth/modules/adapters.py

mammoth/opts.py

TimotheeMickus · 2024-09-16T15:18:05Z

mammoth/tests/pull_request_chk.sh

-            -copy_attn >> ${LOG_FILE} 2>&1
-[ "$?" -eq 0 ] || error_exit
-echo "Succeeded" | tee -a ${LOG_FILE}
-


would be great to have stg of the sort, even if minimal

There is no copy attention anymore.

We should add a few new cases here, though.

TimotheeMickus · 2024-09-16T15:20:00Z

mammoth/tests/test_beam_search.py

@@ -316,6 +385,7 @@ def test_beam_is_done_when_n_best_beams_eos_using_min_length(self):
                beam.update_finished()
                self.assertTrue(beam.done)

+    @unittest.skip('attention no longer returned')


you do have a keyword about returning attn in the model though?

Yes, x-transformers is able to return the attention.
However, it is returned in a different format from the previous implementation, and the use cases for it have been removed. I decided to remove the return instead of reimplementing it.

Maybe the test should be removed as well. Unless someone feels like reimplementing this feature, and fixing the test.

Then remove the test, I'd rather we avoid keeping relics from unsupported features in our codebase — makes it hard to know what we actually support at a glance

mammoth/train_single.py

mammoth/utils/statistics.py

TimotheeMickus · 2024-09-16T15:41:25Z

mammoth/utils/statistics.py

@@ -148,13 +172,17 @@ def output(self, step, num_steps, learning_rate, start, metadata=None):
        meta_str = ''
        if num_steps > 0:
            step_fmt = "%s/%5d" % (step_fmt, num_steps)
+        acc = self.accuracy()
+        acc_str = f'{acc:6.2f}' if acc is not None else '--'


do we have cases where acc/ppl are none? this is weird

ppl shouldn't be None, unless the model blew up.
I think we check for it somewhere else. If not, we should. Stopping a model blowout isn't the responsibility of the stats logger.

acc will be None when you don't want to waste compute on it.

well, we don't have anything stopping a model blowout.
All this isn't very "halt and catch fire"-y.

TimotheeMickus · 2024-09-16T15:43:23Z

tools/generate_synth_data.py

@@ -0,0 +1,391 @@
+import click


why should we sell this as part of our package?

i am not convinced that this tool makes a lot of sense as part of our lib — it's a cool inhouse tidbit, but i'd either explicitly define it in a test package or just not include it

I would like to use it to create a quickstart that can be run directly from the repo without needing to download any additional resources. If we don't want to do that, then we can remove the script.

then maybe move to stg like an example directory? where you could also save *.sh files for explicit config-config calls, wrappers examples, etc.

mammoth/inputters/dataset.py

mammoth/distributed/components.py

External dependencies for layer architectures #56

Supports label smoothing out of the box.

Because the strided sequence is offset by the value specified in the config for that corpus, the strided line numbers are not divisible by the stride (unless offset is zero). Taking the original offset into account allows verifying that the loaded line number is reachable with the given stride.

Translation runs, and the output is somehow related to the input. However, it definitely doesn't work correctly. For example, you get (x * 2) - 1 rows of output for x rows of input.

Merged upstream lucidrains/x-transformers#267

Greedy decoder tests are still turned off

This structure allows retrieval of AdaptedAttentionLayers using the layer_stack_index and the xcoder_id. The existing structure requires knowing a task_id in which the component is used. This task_id can be difficult to acquire in some contexts.

scikit-learn==1.2.0 tqdm==4.66.2

TimotheeMickus

If it works, it's good enough for me. I'm continuing the convo on a few key points but this is more or less mergeable as is

TimotheeMickus · 2024-09-16T20:25:31Z

mammoth/utils/optimizers.py

            elif opts.reset_optim == 'keep_states':
                # Reset options, keep optimizer.
-                optim_state_dict = ckpt_state_dict
-        return optim_opts, optim_state_dict
+                pass


i don't recall using those options in recent past, but i didn't touch optim restarting (that was joseph and raul, iirc?)

TimotheeMickus · 2024-09-30T06:52:11Z

tools/generate_synth_data.py

@@ -0,0 +1,391 @@
+import click


then maybe move to stg like an example directory? where you could also save *.sh files for explicit config-config calls, wrappers examples, etc.

TimotheeMickus · 2024-09-30T06:55:04Z

mammoth/tests/test_beam_search.py

@@ -316,6 +385,7 @@ def test_beam_is_done_when_n_best_beams_eos_using_min_length(self):
                beam.update_finished()
                self.assertTrue(beam.done)

+    @unittest.skip('attention no longer returned')


Then remove the test, I'd rather we avoid keeping relics from unsupported features in our codebase — makes it hard to know what we actually support at a glance

TimotheeMickus · 2024-09-30T06:56:51Z

mammoth/utils/statistics.py

@@ -148,13 +172,17 @@ def output(self, step, num_steps, learning_rate, start, metadata=None):
        meta_str = ''
        if num_steps > 0:
            step_fmt = "%s/%5d" % (step_fmt, num_steps)
+        acc = self.accuracy()
+        acc_str = f'{acc:6.2f}' if acc is not None else '--'


well, we don't have anything stopping a model blowout.
All this isn't very "halt and catch fire"-y.

TimotheeMickus · 2024-09-30T06:57:28Z

mammoth/model_builder.py

    task_queue_manager,
-    checkpoint=None,


ok, worth documenting for people interested in using HF models for init

TimotheeMickus · 2024-09-30T07:11:38Z

mammoth/modules/adapters.py

+        return tmp_layer_types, tmp_layer_structs, tmp_layer_dropouts
+
+
+class LoraAdapterLayer(nn.Module):


we didn't support lora previously, did we? if so, might be good to explicitly mention in the global description of the PR

TimotheeMickus · 2024-09-30T07:12:44Z

mammoth/opts.py

-            action='store_true',
-            help="Dump samples when building vocab. Warning: this may slow down the process.",
-        )
+    if build_vocab_only:


do we still support a vocab building script? i thought this had been nuked, but apparently not well enough?

TimotheeMickus · 2024-09-30T07:13:50Z

mammoth/opts.py

@@ -584,6 +538,12 @@ def _add_train_general_opts(parser):
        choices=['none', 'all', 'states', 'keep_states'],
        help="Optimization resetter when train_from.",
    )
+    group.add(


not sure this is the best thing to add in an official lib, but hey, whatever. Happy to discuss it if you want a bone to be picked

mammoth/opts.py

TimotheeMickus · 2024-09-30T07:15:44Z

mammoth/tests/test_greedy_search.py

@@ -312,11 +383,31 @@ def test_returns_correct_scores_non_deterministic_topp(self):
                valid_score_dist_1 = torch.log_softmax(torch.tensor([6.0, 5.0, 4.0, 3.0, 2.0, 1.0]), dim=0)
                valid_score_dist_2 = torch.log_softmax(torch.tensor([6.0, 1.0]), dim=0)
                eos_idx = 2
+                src_len = 67
                lengths = torch.randint(0, 30, (batch_sz,))
                samp = GreedySearch(


minor: this is fairly had to read and would really benefit from kwargs

TimotheeMickus · 2024-09-30T08:29:34Z

mammoth/opts.py

+        "-x_transformers",
+        "--x_transformers",
+        help="For a complete list of options, see"
+        " https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py ."


ideally, point to the github @ commit hash for the pinned version?

TimotheeMickus · 2024-09-30T08:31:06Z

mammoth/utils/parse.py

+                raise ValueError(
+                    f'"{unsupported_key}" is not supported in Mammoth.'
+                )
+        # Mammoth has a different default value than x-transformers,


needs to be in the docs somewhere — perhaps in the help of the corresponding flag in opts?

TimotheeMickus · 2024-09-30T08:32:01Z

docs/source/examples/train_mammoth_101.md

would be great to have a 102 or whatever showcasing x-transformers config

This flag can be passed though to x-transformers using the x_transformer_opts dict, there is no need for special handling. The flag is much less visible now, and a user noticing that --transformer_ff has been removed will need to dig deep to find it and learn the changed parameterization. Not sure where to document this. Maybe we need a new page for migration tips.

Waino requested review from jrvc and TimotheeMickus September 16, 2024 13:45