Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

Query on Alphabet Consistency Across Different Scales of ESM2 Models [8M, 35M, 150M, 650M, 3B, 150B] #668

Open
CNwangbin opened this issue Mar 8, 2024 · 0 comments

Comments

@CNwangbin
Copy link

I am currently exploring the potential of leveraging the ESM2 series of models for a project involving protein sequence analysis. Given the diversity in the scale of models available, I have a specific question that I hope you could help clarify.
Could you please confirm if all these variants of the ESM2 models use an identical alphabet for encoding protein sequences into tokens? Essentially, I am interested in understanding whether the token sequences generated from the same protein sequence would be identical across these different model scales.
The reason behind this inquiry is to ensure that our preprocessing pipeline remains consistent and compatible when utilizing multiple versions of the ESM2 models for comparative analysis.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant