Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found #6

Open
AbdiHaryadi opened this issue Jan 5, 2024 · 0 comments

Comments

@AbdiHaryadi
Copy link

I tried to run an example notebook from examples/load_indonlg.ipynb to test IndoGPT model. It ran on Colaboratory with Python 3.10. This version is consistent with setup.py. Here is all of my code cells:

!git clone https://github.com/indobenchmark/indobenchmark-toolkit.git
!pip install /content/indobenchmark-toolkit/.
import os, sys
sys.path.append("/content/indobenchmark-toolkit")
import torch
from transformers import GPT2LMHeadModel
from src.indobenchmark import IndoNLGTokenizer
from torch.utils.data import DataLoader
def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())
%%time
gpt_model = GPT2LMHeadModel.from_pretrained("indobenchmark/indogpt")
gpt_tokenizer = IndoNLGTokenizer.from_pretrained("indobenchmark/indogpt")
gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
gpt_out = gpt_model.generate(**gpt_input)
gpt_tokenizer.decode(gpt_out[0]) # <-- Error exists here.

However, the last cell triggered an error. Here is the full traceback:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-6-d94e6e0a3dd3>](https://localhost:8080/#) in <cell line: 3>()
      1 gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
      2 gpt_out = gpt_model.generate(**gpt_input)
----> 3 gpt_tokenizer.decode(gpt_out[0])

3 frames

[/usr/local/lib/python3.10/dist-packages/indobenchmark/tokenization_indonlg.py](https://localhost:8080/#) in decode(self, inputs, skip_special_tokens)
    343 
    344     def decode(self, inputs, skip_special_tokens=False):
--> 345         outputs = super().decode(inputs, skip_special_tokens=skip_special_tokens)
    346         return outputs.replace(' ','').replace('', ' ')
    347 

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3744         token_ids = to_py_obj(token_ids)
   3745 
-> 3746         return self._decode(
   3747             token_ids=token_ids,
   3748             skip_special_tokens=skip_special_tokens,

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens, **kwargs)
   1022                 current_sub_text.append(token)
   1023         if current_sub_text:
-> 1024             sub_texts.append(self.convert_tokens_to_string(current_sub_text))
   1025 
   1026         if spaces_between_special_tokens:

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in convert_tokens_to_string(self, tokens)
    987 
    988     def convert_tokens_to_string(self, tokens: List[str]) -> str:
--> 989         return " ".join(tokens)
    990 
    991     def _decode(

TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found

I also tried to use pip install indobenchmark-toolkit instead of clone from GitHub. Still, the result was same as before. Any solution for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant