Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于tokenizer.tokenize的疑问 #86

Open
lsx0930 opened this issue Jun 26, 2022 · 1 comment
Open

关于tokenizer.tokenize的疑问 #86

lsx0930 opened this issue Jun 26, 2022 · 1 comment

Comments

@lsx0930
Copy link

lsx0930 commented Jun 26, 2022

看过tf的tokenizer的代码,输入的是句子或者单个char,返回的是单个句子或者单个char
而torch的输入输入的是句子或者单个char,返回的是单个句子list或者单个char的list

重要的问题是,如果输入的单个char本身是unk类型的字符,pytorch的tokenizer.tokenize(char) 居然返回的为空而不是[UNK]? 好奇pytorch为啥这样搞,这样直接导致训练数据x和label没办法对齐了......

@jenfung
Copy link

jenfung commented Sep 27, 2022

请问解决了吗,我也对齐不了……

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants