-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about MQAR eval of Based #29
Comments
Also, unrelated to the based model, I think I'm a little confused about the MQAR task in general when you pass if random_non_queries:
inputs[inputs == 0] = torch.randint(vocab_size, size=inputs.shape)[inputs == 0] we can see that this has a very high chance of overwriting a QV pair in the "context" section, especially with long sequences such as the test set of based_figure_2. Doesn't this create a significant mismatch between the train and test sets? I don't think this detracts away from the idea that linear attention struggles with the task, however, the task then becomes much different. |
If you already have trained model and data saved as .npy files (e.g., from the attention-360 model and PILE as the dataset) and want to use the Zoology repo to test MQAR results directly on this data, it’s a bit complex, as Zoology isn't designed for direct compatibility with pre-existing .npy files. It’s mainly set up for generating synthetic data. |
Sorry yeah my question isn't about that, it was more about how synthetic data is being generated and the differences between the based models during MQAR eval and pretraining eval. |
Hey, thanks for the great work. I could be wrong, but I feel like there is a disconnect between what is mentioned in the Based paper and what is used in the Figure 2 config for MQAR eval. In the paper, Based is displayed as an architecture with a mixture of linear attention layers and sliding window layers, however, in the MQAR experiment configs there appear to be no sliding window attention layers. It is just a mix of BaseConv layers and linear Attention layers. In the appendix you note that BaseConv can improve performance, which I am assuming is why they are used in these experiments. However, if it's the case that sliding windows were not used, I'm curious if you have any ablations on the usage of small sliding windows for MQAR?
Again I really appreciate the work, super cool stuff to think about!
The text was updated successfully, but these errors were encountered: