en_core_web_trf doesn't train when use the same custom .spacy which works in en_core_web_lg #13278

Meiling-Sun · 2024-01-26T19:34:42Z

Meiling-Sun
Jan 26, 2024

I use my custom data and create the train.spacy. train.spacy.has 50 docs, but each doc has more than 60000 tokens, because the annotation based on the document level. I use this train.spacy successfully train en_core_web_lg model. But when i use the same train.spacy file to en_core_web_trf model, even though there is no error, but it looks model didn't do anything. I would like to ask if there is max_length of token in each doc in en_core_web_trf? or what is the reason of this kind of error? the output looks as following, and there is no model_best output.

[2024-01-26 19:01:59,830] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
ℹ Saving to output directory:
/scratch/global_1/msun/output_gpu_acc_chunk10
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2024-01-26 19:02:02,738] [INFO] Set up nlp object from config
[2024-01-26 19:02:02,748] [DEBUG] Loading corpus from path: ../spacy_files/half_half/test.spacy
[2024-01-26 19:02:02,748] [DEBUG] Loading corpus from path: ../spacy_files/half_half/train.spacy
[2024-01-26 19:02:02,749] [INFO] Pipeline: ['transformer', 'ner']
[2024-01-26 19:02:02,749] [INFO] Resuming training for: ['ner', 'transformer']
[2024-01-26 19:02:02,753] [INFO] Created vocabulary
[2024-01-26 19:02:02,754] [INFO] Finished initializing nlp object
[2024-01-26 19:02:02,754] [INFO] Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
[2024-01-26 19:02:02,763] [DEBUG] Loading corpus from path: ../spacy_files/half_half/test.spacy
[2024-01-26 19:02:02,764] [DEBUG] Loading corpus from path: ../spacy_files/half_half/train.spacy
[2024-01-26 19:02:02,813] [DEBUG] Removed existing output directory: /scratch/global_1/msun/output_gpu_acc_chunk10/model-last
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E # LOSS TRANS... LOSS NER ENTS_F ENTS_P ENTS_R SCORE

✔ Saved pipeline to output directory
/scratch/global_1/msun/output_gpu_acc_chunk10/model-last

debug data results are following:
============================ Data file validation ============================
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: en
Training pipeline: transformer, ner
50 training docs
52 evaluation docs
✔ No overlap between training and evaluation data
✘ Low number of examples to train a new pipeline (50)

============================== Vocab & Vectors ==============================
ℹ 503866 total word(s) in the data (40884 unique)
ℹ No word vectors present in the package

========================== Named Entity Recognition ==========================
ℹ 3 label(s)
0 missing value(s) (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities crossing sentence boundaries

================================== Summary ==================================
✔ 7 checks passed
✘ 1 error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

en_core_web_trf doesn't train when use the same custom .spacy which works in en_core_web_lg #13278

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

en_core_web_trf doesn't train when use the same custom .spacy which works in en_core_web_lg #13278

Meiling-Sun Jan 26, 2024

Replies: 0 comments

Meiling-Sun
Jan 26, 2024