Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'lemma' #48

Open
Bachstelze opened this issue May 26, 2022 · 4 comments
Open

KeyError: 'lemma' #48

Bachstelze opened this issue May 26, 2022 · 4 comments

Comments

@Bachstelze
Copy link

Following the code from https://trankit.readthedocs.io/en/latest/training.html#training-a-lemmatizer i get a KeyError: 'lemma':

Setting up training config...
Initialized lemmatizer trainer
Training dictionary-based lemmatizer

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

[<ipython-input-9-a90867cc5ef3>](https://localhost:8080/#) in <module>()
     11 
     12 # start training
---> 13 trainer.train()

3 frames

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in train(self)
    680             self._train_posdep()
    681         elif self._task == 'lemmatize':
--> 682             self._train_lemma()
    683         elif self._task == 'ner':
    684             self._train_ner()

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in _train_lemma(self)
    581 
    582     def _train_lemma(self):
--> 583         self._lemma_model.train()
    584 
    585     def _train_ner(self):

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in train(self)
    379             self.config.logger.info("Training dictionary-based lemmatizer")
    380             self.trainer.train_dict(
--> 381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
    383                          type(token[ID]) == tuple and len(token[ID]) == 2)])

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in <listcomp>(.0)
    381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
--> 383                          type(token[ID]) == tuple and len(token[ID]) == 2)])
    384             dev_preds = self.trainer.predict_dict(
    385                 [[token[TEXT], token[UPOS]] for sentence in self.dev_batch.doc for token in sentence if

KeyError: 'lemma'

The recent version from https://github.com/UniversalDependencies/UD_Thai-PUD is used as trainings and development data.

@Bachstelze
Copy link
Author

There are no Lemmas in the training data. So there can't be lemmatizer?! Can't i use the the other parts of the pipeline?
When i run

from trankit import Pipeline
p = Pipeline(lang='customized', cache_dir='./save_dir')

the following error occurs:

BadZipFile: File is not a zip file

@gcelano
Copy link

gcelano commented Aug 26, 2024

I get the same error when trying to train the lemmatizer:

Setting up training config...
Initialized lemmatizer trainer
Training dictionary-based lemmatizer
Traceback (most recent call last):
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/custom_train00.py", line 15, in <module>
    trainer.train()
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/tpipeline.py", line 683, in train
    self._train_lemma()
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/tpipeline.py", line 584, in _train_lemma
    self._lemma_model.train()
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/models/lemma_model.py", line 381, in train
    [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
  File "/home/celano/Documents/parser_Ancient_Greek_Latin/trankit-master-lemmatizer/trankit/models/lemma_model.py", line 381, in <listcomp>
    [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
KeyError: 'lemma'

@GioDH18
Copy link

GioDH18 commented Jan 5, 2025

I am also getting this error, even though the .conllu file I am loading has the lemmas in the second column, as I think should be expected. Has anyone found a solution to this error? Is it a problem with the training data or Trankit itself?

@GioDH18
Copy link

GioDH18 commented Jan 5, 2025

Never mind, it appears that the lemmatization pipeline has issues handling "_" in the lemma slot of conllus. I ended up just deleting these sentences from consideration. I don't know if that is the same issue others have faced, but I hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants