Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors #60

Open
LinaDongXMU opened this issue Jan 10, 2025 · 4 comments

Comments

@LinaDongXMU
Copy link

Hello authors,
Thanks for building rxnmapper for atom-atom mapping assignments. When testing on lots of items of data, I fould the errors as following:

Some weights of the model checkpoint at /miniconda3/envs/rxnmapper/lib/python3.6/site-packages/rxnmapper/models/transformers/albert_heads_8_uspto_all_1310k were not used when initializing AlbertModel: ['predictions.decoder.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.dense.weight', 'predictions.decoder.bias', 'predictions.LayerNorm.weight', 'predictions.LayerNorm.bias']

  • This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors

I also have an item as the example of items that had such errors as following:
CC(C)CC@HC(=O)NC@@HC(=O)NC@@HC(=O)NC@HC(C)C>>CC@HC(=O)O

I do need your help to do atom-atom mapping on such items. I'm looking forward to your answer, and thanks!

@LinaDongXMU
Copy link
Author

CC(C)C[C@H](NC(=O)C@HNC(=O)C@HNC(=O)[C@ @h](NC(=O)C@HNC(=O)[C@H](Cc1 cnc[nH]1)NC(=O)C@HNC(=O)CNC(=O)C@HNC(=O)[C@H](CC( C)C)NC(=O)C@HNC(=O) C@HNC(=O)C@HNC(=O)[C@@h](NC(=O)C @@HCc1ccccc1)C(C)C)C(C)C)C(=O)NC@ @HC(=O)NC@@HC(=O)N[C@H](C(=O)NC@ @HC(=O)NCC(=O)NC@@HC (=O)NC@@HC(=O)NCC(=O)NC@@HC(=O)NC@@H C(=O)N[C@@h](Cc1cc c(O)cc1)C(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)NC@@HC(=O)N C@@HC(=O)O)C@@HO)C(C)C>> C[C@H](NC(=O)C@HNC(=O)[C@@h]1CCCN1C(=O)[C@@h](NC(=O)C @HNC(=O)[C@H](Cc1 ccccc1)NC(=O)C@HNC(=O)CNC(=O)C@HNC(=O)C(N)CCC(= O)O)C@@HO)C(=O)O

@avaucher
Copy link
Member

Hi @LinaDongXMU,

The warning about some model weights not being used is fine, you can ignore it.

The first SMILES you posted, CC(C)C[C@H]C(=O)N[C@@H]C(=O)N[C@@H]C(=O)N[C@H]C(C)C>>C[C@H]C(=O)O, works for me. Can you post the code you used?

The second one (the long one) has more tokens than the model can handle (587, the model works up to 512). A new model with longer context would need to be trained for that.

@LinaDongXMU
Copy link
Author

Thanks for answering. The first SMILES is the same as the second one, but I don't know why it is not fully displayed. So I sent it again. For the new model you mentioned, do you mean that I need to provide longer training data for retraining a new model and then test this one? In fact, I have lots of data terms that are longer then 512 (they all have errors as shown above) but they are not mapped, and I don't know how to deal with them (make them mapped) and can you give me some suggestions about it?

@avaucher
Copy link
Member

Indeed, a new model would need to be retrained, with a longer context window.

To map long SMILES strings, some alternatives (out of the top of my head) could be:

  • use other atom-mapping software
  • cut off parts of the reactants that you know are not involved in the reaction, to make it fit into 512 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants