Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors #60

LinaDongXMU · 2025-01-10T06:46:03Z

Hello authors,
Thanks for building rxnmapper for atom-atom mapping assignments. When testing on lots of items of data, I fould the errors as following:

Some weights of the model checkpoint at /miniconda3/envs/rxnmapper/lib/python3.6/site-packages/rxnmapper/models/transformers/albert_heads_8_uspto_all_1310k were not used when initializing AlbertModel: ['predictions.decoder.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.dense.weight', 'predictions.decoder.bias', 'predictions.LayerNorm.weight', 'predictions.LayerNorm.bias']

This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors

I also have an item as the example of items that had such errors as following:
CC(C)CC@HC(=O)NC@@HC(=O)NC@@HC(=O)NC@HC(C)C>>CC@HC(=O)O

I do need your help to do atom-atom mapping on such items. I'm looking forward to your answer, and thanks!

LinaDongXMU · 2025-01-10T06:48:09Z

CC(C)C[C@H](NC(=O)C@HNC(=O)C@HNC(=O)[C@ @h](NC(=O)C@HNC(=O)[C@H](Cc1 cnc[nH]1)NC(=O)C@HNC(=O)CNC(=O)C@HNC(=O)[C@H](CC( C)C)NC(=O)C@HNC(=O) C@HNC(=O)C@HNC(=O)[C@@h](NC(=O)C @@HCc1ccccc1)C(C)C)C(C)C)C(=O)NC@ @HC(=O)NC@@HC(=O)N[C@H](C(=O)NC@ @HC(=O)NCC(=O)NC@@HC (=O)NC@@HC(=O)NCC(=O)NC@@HC(=O)NC@@H C(=O)N[C@@h](Cc1cc c(O)cc1)C(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)NC@@HC(=O)N C@@HC(=O)O)C@@HO)C(C)C>> C[C@H](NC(=O)C@HNC(=O)[C@@h]1CCCN1C(=O)[C@@h](NC(=O)C @HNC(=O)[C@H](Cc1 ccccc1)NC(=O)C@HNC(=O)CNC(=O)C@HNC(=O)C(N)CCC(= O)O)C@@HO)C(=O)O

avaucher · 2025-01-18T10:15:49Z

Hi @LinaDongXMU,

The warning about some model weights not being used is fine, you can ignore it.

The first SMILES you posted, CC(C)C[C@H]C(=O)N[C@@H]C(=O)N[C@@H]C(=O)N[C@H]C(C)C>>C[C@H]C(=O)O, works for me. Can you post the code you used?

The second one (the long one) has more tokens than the model can handle (587, the model works up to 512). A new model with longer context would need to be trained for that.

LinaDongXMU · 2025-01-18T10:52:56Z

Thanks for answering. The first SMILES is the same as the second one, but I don't know why it is not fully displayed. So I sent it again. For the new model you mentioned, do you mean that I need to provide longer training data for retraining a new model and then test this one? In fact, I have lots of data terms that are longer then 512 (they all have errors as shown above) but they are not mapped, and I don't know how to deal with them (make them mapped) and can you give me some suggestions about it?

avaucher · 2025-01-18T13:30:03Z

Indeed, a new model would need to be retrained, with a longer context window.

To map long SMILES strings, some alternatives (out of the top of my head) could be:

use other atom-mapping software
cut off parts of the reactants that you know are not involved in the reaction, to make it fit into 512 tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors #60

Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors #60

LinaDongXMU commented Jan 10, 2025

LinaDongXMU commented Jan 10, 2025

avaucher commented Jan 18, 2025

LinaDongXMU commented Jan 18, 2025

avaucher commented Jan 18, 2025

Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors #60

Token indices sequence length is longer than the specified maximum sequence length for this model (587 > 512). Running this sequence through the model will result in indexing errors #60

Comments

LinaDongXMU commented Jan 10, 2025

LinaDongXMU commented Jan 10, 2025

avaucher commented Jan 18, 2025

LinaDongXMU commented Jan 18, 2025

avaucher commented Jan 18, 2025