How to set Sudachipy split_mode in the JapaneseTokenizer #12871

ryanheise · 2023-07-29T04:29:50Z

ryanheise
Jul 29, 2023

I think the answer in #8027 (comment) is now out of date.

I notice there is a more recent config option called split_mode, so I tried this:

nlp = spacy.load("ja_core_news_lg", config={"nlp.tokenizer.split_mode":"B"})

However, it had no effect and still behaved as if split mode "A" were in effect.

I think I have the correct config key here, but is anyone able to point out what I'm doing wrong?

Also, @polm in #8027 mentioned that:

since the word vectors and models all depend on the tokenizer mode, I wouldn't expect to be able to use the pretrained models this way.

Hopefully this is also out of date now that the config option is exposed with the intention of allowing you to choose among different split modes?

ryanheise · 2023-07-29T05:02:44Z

ryanheise
Jul 29, 2023
Author

OK, so I did some debugging, and the option is definitely being received by the tokenizer, but I noticed that try_sudachi_import is being called twice. The first time it gets split_mode "B" but the second time it gets split_mode None. The first is called from JapaneseTokenizer.__init__ and the second is called from from_disk.

If I edit the spaCy code in from_disk to set split_mode to "B", it works correctly.

Is this a bug? Or if not, how would I set split_mode to "B" without editing the spaCy code directly?

0 replies

adrianeboyd · 2023-07-31T06:25:40Z

adrianeboyd
Jul 31, 2023

You don't want to change the split_mode in trained pipelines like ja_core_news_lg because none of the pipeline components have been trained with this tokenization and you can up with nonsense output for a lot of cases. The split_mode setting is saved with the pipeline and loaded when the tokenizer is loaded from disk.

If you have a new blank pipeline, you can pick whichever split_mode you'd like and train new components using that tokenization:

nlp = spacy.blank("ja", config={"nlp.tokenizer.split_mode":"B"})

4 replies

ryanheise Jul 31, 2023
Author

Thanks. Hmm, the results were looking good enough to me, but I wonder if your comment also applies to retokenization. If I were to leave the default split mode that it was trained on but then retokenize by merging tokens in a fashion that would resemble split mode B, would that also be something that I don't want to do for the same reason?

adrianeboyd Jul 31, 2023

If you retokenize after running the existing pipeline components and you're happy with the results, then that should be fine. What matters in terms of getting the expected output for the trained pipeline components is the tokenization in the doc at the point in the pipeline when they are run.

I mean, in general you can do whatever you want with the tokenization as long as you're happy with the results. After loading the pipeline, you can replace nlp.tokenizer however you'd like and still run the pipeline with the existing pipeline components. (You just can't do this easily through config overrides. The way settings are stored in both config.cfg and the individual saved components is confusing and we're aware of this, but we haven't found a better solution yet.)

ryanheise Jul 31, 2023
Author

What matters in terms of getting the expected output for the trained pipeline components is the tokenization in the doc at the point in the pipeline when they are run.

Would the pipeline be invoked again on retokenize? If it's not, then I'm curious where the default pos and dep tags for the newly merged tokens come from.

Although if it is running the pipeline again, and I'm retokenizing in the same way that split mode B would have done, would that effectively give the same results as if I had replaced the tokenizer with split mode B after loading? Since if they are effectively equivalent approaches, then it would seem that replacing the tokenizer after load would be more efficient than retokenizing.

adrianeboyd Jul 31, 2023

Retokenizing only modifies a Doc, no pipeline components are run. The retokenizer doesn't know anything about which pipeline(s) were involved in creating the doc originally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set Sudachipy split_mode in the JapaneseTokenizer #12871

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to set Sudachipy split_mode in the JapaneseTokenizer #12871

ryanheise Jul 29, 2023

Replies: 2 comments · 4 replies

ryanheise Jul 29, 2023 Author

adrianeboyd Jul 31, 2023

ryanheise Jul 31, 2023 Author

adrianeboyd Jul 31, 2023

ryanheise Jul 31, 2023 Author

adrianeboyd Jul 31, 2023

ryanheise
Jul 29, 2023

Replies: 2 comments 4 replies

ryanheise
Jul 29, 2023
Author

adrianeboyd
Jul 31, 2023

ryanheise Jul 31, 2023
Author

ryanheise Jul 31, 2023
Author