Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training the fulltext model #1240

Open
martasoricetti opened this issue Jan 30, 2025 · 7 comments
Open

Training the fulltext model #1240

martasoricetti opened this issue Jan 30, 2025 · 7 comments

Comments

@martasoricetti
Copy link

Operating System and architecture (arm64, amd64, x86, etc.)

No response

What is your Java version

No response

Log and information

Good morning,
I'm currently working for developing a tool that starting from a PDF returns a JSON file containing its citations, together with the citations' contexts (the sentence where the in-text reference pointer appears) and the sections' titles. For the extraction of the citations the starting point is Grobid and I'm trying to increase its performance with the training of the segmentation and of the fulltext models in particular.
I've developed a script focused on the output of my tool to evaluate its performance, and I've obtained the following results for the fulltext model:

  • grobid fulltext default model: F1score - 0.8191;
  • fulltext model trained with my corpus: F1 score: 0.7988.

I'm surprised because the training texts available in the Grobid git repository are just 40, while the training dataset that I used is composed by 84 academic papers coming from different disciplines. Furthermore, I suppose that the fulltext grobid base model is more generic, while the training dataset that I used is specific for my needs. I just wanted to know if the grobid base model has been trained using just the texts available in the repository in grobid-trainer/resources/fulltext/corpus or if others documents have been used. Thank you

Further information

No response

@lfoppiano
Copy link
Collaborator

Hi @martasoricetti, I haven't done the last training for the fulltext, If I'm not mistaken, we've moved to the fully open access data a while ago.
Meanwhile, could you give some more details on how you did evaluate? Did you evaluate the labels of the models itself or the end XML? Could you also give some information on the composition of your test set?

@kermitt2
Copy link
Owner

kermitt2 commented Feb 3, 2025

Hello ! The full text model is the only model trained with some additional non-sharable training data. So you're correct not all training data to create this model is under grobid-trainer/resources/fulltext/corpus, but we cannot share these annotated full texts. The idea was to move to fully open access data, but we were not efficient at all in adding new annotated full texts :)

What you can do @martasoricetti is to use incremental training starting from the existing model and further train it with your additional training data, see https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/#train-and-evaluation-separately-and-using-more-parameters-full-mode - then you will get normally the best of both and should improve your evaluation over the current full text model.

@lfoppiano
Copy link
Collaborator

Drat! Sorry for the incorrect information :-) Thanks @kermitt2 for the correction 🙂

@martasoricetti
Copy link
Author

Thank you for your answers! I've tried to follow your suggestion and use incremental training but the system crashed. I'm running Grobid on an Ubuntu environment (Ubuntu 22.04.5 LTS), with the following version of Java: openjdk version "17.0.13" 2024-10-15
OpenJDK Runtime Environment (build 17.0.13+11-Ubuntu-2ubuntu122.04)
OpenJDK 64-Bit Server VM (build 17.0.13+11-Ubuntu-2ubuntu122.04, mixed mode, sharing).

marta@marta-CREFG-XX:~/Scrivania/grobid-0.8.1$ java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-trainer/build/libs/grobid-trainer-0.8.1-onejar.jar 0 fulltext -gH grobid-home -i
path2GbdHome=grobid-home   path2GbdProperties=grobid-home/config/grobid.properties
sourceTEIPathLabel: /home/marta/Scrivania/grobid-0.8.1/grobid-home/../grobid-trainer/resources/dataset/fulltext/corpus/tei
sourceRawPathLabel: /home/marta/Scrivania/grobid-0.8.1/grobid-home/../grobid-trainer/resources/dataset/fulltext/corpus/raw
trainingOutputPath: /home/marta/Scrivania/grobid-0.8.1/grobid-home/tmp/fulltext10577192399332200381.train
evalOutputPath: null
84 tei files
	epsilon: 1.0E-4
	window: 20
	nb max iterations: 1500
	nb threads: 20
	incremental training from: /home/marta/Scrivania/grobid-0.8.1/grobid-home/models/fulltext/model.wapiti
* Load previous model
* Load patterns
* Load training data
* Resync the model
* Summary
    nb train:    84
    nb labels:   23
    nb blocks:   2567369
    nb features: 59049993
* Train the model with l-bfgs
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000070327600a2a1, pid=6514, tid=6629
#
# JRE version: OpenJDK Runtime Environment (17.0.13+11) (build 17.0.13+11-Ubuntu-2ubuntu122.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.13+11-Ubuntu-2ubuntu122.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libwapiti.so+0xa2a1]  grd_subemp+0x61
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/marta/Scrivania/grobid-0.8.1/core.6514)
#
# An error report file with more information is saved as:
# /home/marta/Scrivania/grobid-0.8.1/hs_err_pid6514.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-17
#
Annullato (core dump creato)

hs_err_pid6514.log

@martasoricetti
Copy link
Author

Hello ! The full text model is the only model trained with some additional non-sharable training data. So you're correct not all training data to create this model is under grobid-trainer/resources/fulltext/corpus, but we cannot share these annotated full texts. The idea was to move to fully open access data, but we were not efficient at all in adding new annotated full texts :)

What you can do @martasoricetti is to use incremental training starting from the existing model and further train it with your additional training data, see https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/#train-and-evaluation-separately-and-using-more-parameters-full-mode - then you will get normally the best of both and should improve your evaluation over the current full text model.

Is it possible to know just the number of texts you used for training the fulltext?

@lfoppiano
Copy link
Collaborator

marta@marta-CREFG-XX:~/Scrivania/grobid-0.8.1$ java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-

How much RAM do you have? I'm not sure is a OOM (if you run sudo dmesg, you should see something that might indicate the the process was killed, see an example here), but I would recommend you to increase the -Xmx1024m by three times at least.

@martasoricetti
Copy link
Author

marta@marta-CREFG-XX:~/Scrivania/grobid-0.8.1$ java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-

How much RAM do you have? I'm not sure is a OOM (if you run sudo dmesg, you should see something that might indicate the the process was killed, see an example here), but I would recommend you to increase the -Xmx1024m by three times at least.

This is my total value or RAM 16091040 kB and I've already tried to increase the -Xmx1024m but the result doesn't change.I also checked dmesg, but I don't see any OOM (Out of Memory) errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants