-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training the fulltext model #1240
Comments
Hi @martasoricetti, I haven't done the last training for the fulltext, If I'm not mistaken, we've moved to the fully open access data a while ago. |
Hello ! The full text model is the only model trained with some additional non-sharable training data. So you're correct not all training data to create this model is under What you can do @martasoricetti is to use incremental training starting from the existing model and further train it with your additional training data, see https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/#train-and-evaluation-separately-and-using-more-parameters-full-mode - then you will get normally the best of both and should improve your evaluation over the current full text model. |
Drat! Sorry for the incorrect information :-) Thanks @kermitt2 for the correction 🙂 |
Thank you for your answers! I've tried to follow your suggestion and use incremental training but the system crashed. I'm running Grobid on an Ubuntu environment (Ubuntu 22.04.5 LTS), with the following version of Java: openjdk version "17.0.13" 2024-10-15
|
Is it possible to know just the number of texts you used for training the fulltext? |
How much RAM do you have? I'm not sure is a OOM (if you run sudo dmesg, you should see something that might indicate the the process was killed, see an example here), but I would recommend you to increase the |
This is my total value or RAM 16091040 kB and I've already tried to increase the -Xmx1024m but the result doesn't change.I also checked dmesg, but I don't see any OOM (Out of Memory) errors. |
Operating System and architecture (arm64, amd64, x86, etc.)
No response
What is your Java version
No response
Log and information
Good morning,
I'm currently working for developing a tool that starting from a PDF returns a JSON file containing its citations, together with the citations' contexts (the sentence where the in-text reference pointer appears) and the sections' titles. For the extraction of the citations the starting point is Grobid and I'm trying to increase its performance with the training of the
segmentation
and of thefulltext
models in particular.I've developed a script focused on the output of my tool to evaluate its performance, and I've obtained the following results for the fulltext model:
I'm surprised because the training texts available in the Grobid git repository are just 40, while the training dataset that I used is composed by 84 academic papers coming from different disciplines. Furthermore, I suppose that the fulltext grobid base model is more generic, while the training dataset that I used is specific for my needs. I just wanted to know if the grobid base model has been trained using just the texts available in the repository in
grobid-trainer/resources/fulltext/corpus
or if others documents have been used. Thank youFurther information
No response
The text was updated successfully, but these errors were encountered: