Training the fulltext model #1240

martasoricetti · 2025-01-30T10:54:22Z

Operating System and architecture (arm64, amd64, x86, etc.)

No response

What is your Java version

No response

Log and information

Good morning,
I'm currently working for developing a tool that starting from a PDF returns a JSON file containing its citations, together with the citations' contexts (the sentence where the in-text reference pointer appears) and the sections' titles. For the extraction of the citations the starting point is Grobid and I'm trying to increase its performance with the training of the segmentation and of the fulltext models in particular.
I've developed a script focused on the output of my tool to evaluate its performance, and I've obtained the following results for the fulltext model:

grobid fulltext default model: F1score - 0.8191;
fulltext model trained with my corpus: F1 score: 0.7988.

I'm surprised because the training texts available in the Grobid git repository are just 40, while the training dataset that I used is composed by 84 academic papers coming from different disciplines. Furthermore, I suppose that the fulltext grobid base model is more generic, while the training dataset that I used is specific for my needs. I just wanted to know if the grobid base model has been trained using just the texts available in the repository in grobid-trainer/resources/fulltext/corpus or if others documents have been used. Thank you

Further information

No response

The text was updated successfully, but these errors were encountered:

lfoppiano · 2025-02-03T12:36:33Z

Hi @martasoricetti, I haven't done the last training for the fulltext, If I'm not mistaken, we've moved to the fully open access data a while ago.
Meanwhile, could you give some more details on how you did evaluate? Did you evaluate the labels of the models itself or the end XML? Could you also give some information on the composition of your test set?

kermitt2 · 2025-02-03T19:27:48Z

Hello ! The full text model is the only model trained with some additional non-sharable training data. So you're correct not all training data to create this model is under grobid-trainer/resources/fulltext/corpus, but we cannot share these annotated full texts. The idea was to move to fully open access data, but we were not efficient at all in adding new annotated full texts :)

What you can do @martasoricetti is to use incremental training starting from the existing model and further train it with your additional training data, see https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/#train-and-evaluation-separately-and-using-more-parameters-full-mode - then you will get normally the best of both and should improve your evaluation over the current full text model.

lfoppiano · 2025-02-04T04:25:51Z

Drat! Sorry for the incorrect information :-) Thanks @kermitt2 for the correction 🙂

martasoricetti · 2025-02-04T08:48:11Z

Thank you for your answers! I've tried to follow your suggestion and use incremental training but the system crashed. I'm running Grobid on an Ubuntu environment (Ubuntu 22.04.5 LTS), with the following version of Java: openjdk version "17.0.13" 2024-10-15
OpenJDK Runtime Environment (build 17.0.13+11-Ubuntu-2ubuntu122.04)
OpenJDK 64-Bit Server VM (build 17.0.13+11-Ubuntu-2ubuntu122.04, mixed mode, sharing).

marta@marta-CREFG-XX:~/Scrivania/grobid-0.8.1$ java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-trainer/build/libs/grobid-trainer-0.8.1-onejar.jar 0 fulltext -gH grobid-home -i
path2GbdHome=grobid-home   path2GbdProperties=grobid-home/config/grobid.properties
sourceTEIPathLabel: /home/marta/Scrivania/grobid-0.8.1/grobid-home/../grobid-trainer/resources/dataset/fulltext/corpus/tei
sourceRawPathLabel: /home/marta/Scrivania/grobid-0.8.1/grobid-home/../grobid-trainer/resources/dataset/fulltext/corpus/raw
trainingOutputPath: /home/marta/Scrivania/grobid-0.8.1/grobid-home/tmp/fulltext10577192399332200381.train
evalOutputPath: null
84 tei files
	epsilon: 1.0E-4
	window: 20
	nb max iterations: 1500
	nb threads: 20
	incremental training from: /home/marta/Scrivania/grobid-0.8.1/grobid-home/models/fulltext/model.wapiti
* Load previous model
* Load patterns
* Load training data
* Resync the model
* Summary
    nb train:    84
    nb labels:   23
    nb blocks:   2567369
    nb features: 59049993
* Train the model with l-bfgs
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000070327600a2a1, pid=6514, tid=6629
#
# JRE version: OpenJDK Runtime Environment (17.0.13+11) (build 17.0.13+11-Ubuntu-2ubuntu122.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.13+11-Ubuntu-2ubuntu122.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libwapiti.so+0xa2a1]  grd_subemp+0x61
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/marta/Scrivania/grobid-0.8.1/core.6514)
#
# An error report file with more information is saved as:
# /home/marta/Scrivania/grobid-0.8.1/hs_err_pid6514.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-17
#
Annullato (core dump creato)

hs_err_pid6514.log

martasoricetti · 2025-02-04T08:49:32Z

Hello ! The full text model is the only model trained with some additional non-sharable training data. So you're correct not all training data to create this model is under grobid-trainer/resources/fulltext/corpus, but we cannot share these annotated full texts. The idea was to move to fully open access data, but we were not efficient at all in adding new annotated full texts :)

What you can do @martasoricetti is to use incremental training starting from the existing model and further train it with your additional training data, see https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/#train-and-evaluation-separately-and-using-more-parameters-full-mode - then you will get normally the best of both and should improve your evaluation over the current full text model.

Is it possible to know just the number of texts you used for training the fulltext?

lfoppiano · 2025-02-04T10:45:48Z

marta@marta-CREFG-XX:~/Scrivania/grobid-0.8.1$ java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-

How much RAM do you have? I'm not sure is a OOM (if you run sudo dmesg, you should see something that might indicate the the process was killed, see an example here), but I would recommend you to increase the -Xmx1024m by three times at least.

martasoricetti · 2025-02-04T14:07:07Z

marta@marta-CREFG-XX:~/Scrivania/grobid-0.8.1$ java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-

How much RAM do you have? I'm not sure is a OOM (if you run sudo dmesg, you should see something that might indicate the the process was killed, see an example here), but I would recommend you to increase the -Xmx1024m by three times at least.

This is my total value or RAM 16091040 kB and I've already tried to increase the -Xmx1024m but the result doesn't change.I also checked dmesg, but I don't see any OOM (Out of Memory) errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training the fulltext model #1240

Training the fulltext model #1240

martasoricetti commented Jan 30, 2025

lfoppiano commented Feb 3, 2025

kermitt2 commented Feb 3, 2025

lfoppiano commented Feb 4, 2025

martasoricetti commented Feb 4, 2025

martasoricetti commented Feb 4, 2025

lfoppiano commented Feb 4, 2025

martasoricetti commented Feb 4, 2025

Training the fulltext model #1240

Training the fulltext model #1240

Comments

martasoricetti commented Jan 30, 2025

Operating System and architecture (arm64, amd64, x86, etc.)

What is your Java version

Log and information

Further information

lfoppiano commented Feb 3, 2025

kermitt2 commented Feb 3, 2025

lfoppiano commented Feb 4, 2025

martasoricetti commented Feb 4, 2025

martasoricetti commented Feb 4, 2025

lfoppiano commented Feb 4, 2025

martasoricetti commented Feb 4, 2025