IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found #6

AbdiHaryadi · 2024-01-05T14:13:38Z

I tried to run an example notebook from examples/load_indonlg.ipynb to test IndoGPT model. It ran on Colaboratory with Python 3.10. This version is consistent with setup.py. Here is all of my code cells:

!git clone https://github.com/indobenchmark/indobenchmark-toolkit.git

!pip install /content/indobenchmark-toolkit/.

import os, sys
sys.path.append("/content/indobenchmark-toolkit")
import torch
from transformers import GPT2LMHeadModel
from src.indobenchmark import IndoNLGTokenizer
from torch.utils.data import DataLoader

def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())

%%time
gpt_model = GPT2LMHeadModel.from_pretrained("indobenchmark/indogpt")
gpt_tokenizer = IndoNLGTokenizer.from_pretrained("indobenchmark/indogpt")

gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
gpt_out = gpt_model.generate(**gpt_input)
gpt_tokenizer.decode(gpt_out[0]) # <-- Error exists here.

However, the last cell triggered an error. Here is the full traceback:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-6-d94e6e0a3dd3>](https://localhost:8080/#) in <cell line: 3>()
      1 gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
      2 gpt_out = gpt_model.generate(**gpt_input)
----> 3 gpt_tokenizer.decode(gpt_out[0])

3 frames

[/usr/local/lib/python3.10/dist-packages/indobenchmark/tokenization_indonlg.py](https://localhost:8080/#) in decode(self, inputs, skip_special_tokens)
    343 
    344     def decode(self, inputs, skip_special_tokens=False):
--> 345         outputs = super().decode(inputs, skip_special_tokens=skip_special_tokens)
    346         return outputs.replace(' ','').replace('▁', ' ')
    347 

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3744         token_ids = to_py_obj(token_ids)
   3745 
-> 3746         return self._decode(
   3747             token_ids=token_ids,
   3748             skip_special_tokens=skip_special_tokens,

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens, **kwargs)
   1022                 current_sub_text.append(token)
   1023         if current_sub_text:
-> 1024             sub_texts.append(self.convert_tokens_to_string(current_sub_text))
   1025 
   1026         if spaces_between_special_tokens:

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in convert_tokens_to_string(self, tokens)
    987 
    988     def convert_tokens_to_string(self, tokens: List[str]) -> str:
--> 989         return " ".join(tokens)
    990 
    991     def _decode(

TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found

I also tried to use pip install indobenchmark-toolkit instead of clone from GitHub. Still, the result was same as before. Any solution for this?

The text was updated successfully, but these errors were encountered:

AbdiHaryadi mentioned this issue Jan 5, 2024

IndoGPT: Fix AddedToken error in IndoNLGTokenizer.convert_ids_to_tokens #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found #6

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found #6

AbdiHaryadi commented Jan 5, 2024

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found #6

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found #6

Comments

AbdiHaryadi commented Jan 5, 2024