Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about MQAR eval of Based #29

Open
Hprairie opened this issue Oct 1, 2024 · 3 comments
Open

Question about MQAR eval of Based #29

Hprairie opened this issue Oct 1, 2024 · 3 comments

Comments

@Hprairie
Copy link

Hprairie commented Oct 1, 2024

Hey, thanks for the great work. I could be wrong, but I feel like there is a disconnect between what is mentioned in the Based paper and what is used in the Figure 2 config for MQAR eval. In the paper, Based is displayed as an architecture with a mixture of linear attention layers and sliding window layers, however, in the MQAR experiment configs there appear to be no sliding window attention layers. It is just a mix of BaseConv layers and linear Attention layers. In the appendix you note that BaseConv can improve performance, which I am assuming is why they are used in these experiments. However, if it's the case that sliding windows were not used, I'm curious if you have any ablations on the usage of small sliding windows for MQAR?

Again I really appreciate the work, super cool stuff to think about!

@Hprairie
Copy link
Author

Hprairie commented Oct 2, 2024

Also, unrelated to the based model, I think I'm a little confused about the MQAR task in general when you pass random_non_queries=True. As the implementation does the following:

    if random_non_queries:
        inputs[inputs == 0] = torch.randint(vocab_size, size=inputs.shape)[inputs == 0]

we can see that this has a very high chance of overwriting a QV pair in the "context" section, especially with long sequences such as the test set of based_figure_2. Doesn't this create a significant mismatch between the train and test sets? I don't think this detracts away from the idea that linear attention struggles with the task, however, the task then becomes much different.

@AsadMir10
Copy link

If you already have trained model and data saved as .npy files (e.g., from the attention-360 model and PILE as the dataset) and want to use the Zoology repo to test MQAR results directly on this data, it’s a bit complex, as Zoology isn't designed for direct compatibility with pre-existing .npy files. It’s mainly set up for generating synthetic data.

@Hprairie
Copy link
Author

Sorry yeah my question isn't about that, it was more about how synthetic data is being generated and the differences between the based models during MQAR eval and pretraining eval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants