Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add entry point for extracting datasets from TEI #4

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
5e8ecfd
make gradle build and add github actions
lfoppiano Mar 29, 2024
d545b5d
read grobid-home from configuration
lfoppiano Mar 29, 2024
33648de
disable superfluous tests
lfoppiano Mar 29, 2024
49a07b6
fix build
lfoppiano Mar 29, 2024
5d2872e
add simple test on analyzer to get started
lfoppiano Mar 29, 2024
8bc2987
enable jacoco report
lfoppiano Mar 29, 2024
fd84d88
fix build docker
lfoppiano Mar 29, 2024
ffb5bea
disable docker build for the moment
lfoppiano Mar 29, 2024
bb48f37
add parameter to enable/disable sentence segmentation for TEI processing
lfoppiano Apr 18, 2024
f05f68b
Update docker build (#1)
lfoppiano Apr 26, 2024
981ac95
implement tei processing for datasets
lfoppiano Apr 26, 2024
d668625
fix output JSON streaming
lfoppiano Apr 26, 2024
33d4f13
Merge branch 'master' into add-tei-processing-dataset
lfoppiano May 1, 2024
288850f
add the rest of the processing
lfoppiano May 2, 2024
12dcc37
disable broken tests
lfoppiano May 2, 2024
23c2dd5
add XML JATS entry point
lfoppiano May 2, 2024
0213c78
add CC-BY sample documents
lfoppiano May 2, 2024
52ffc23
revert to the original port
lfoppiano May 2, 2024
4448437
enable TEI processing in UI - javascript joy
lfoppiano May 2, 2024
4aad23d
correct parameter
lfoppiano May 2, 2024
6989335
attach URLs obtained from Grobid's TEI
lfoppiano May 6, 2024
7f0cdd5
fix frontend
lfoppiano May 7, 2024
1c5ff72
fix github action
lfoppiano May 7, 2024
4cd7390
fix wrong ifs - thanks intellij!
lfoppiano May 9, 2024
df86b81
avoid exception when entities are empty
lfoppiano May 9, 2024
843463c
avoid injecting null stuff
lfoppiano May 9, 2024
1b1da5f
reduce the timeout for checking the disambiguation service
lfoppiano May 12, 2024
75dd711
fix the convention for sentence segmentation and enable it
lfoppiano May 20, 2024
758f418
update examples
lfoppiano May 21, 2024
91fe70d
add sequence (sentence, paragraph) identifier in each mention
lfoppiano May 21, 2024
cc1cd2a
Fix sentence switch
lfoppiano May 21, 2024
c58502e
Fix incorrect xpath on children
lfoppiano May 23, 2024
6977bda
Cleanup text when extracting from XML, normalise unicode character, r…
lfoppiano Jun 4, 2024
cc01140
Fix bug in the xpaths that were used wrongly to select sentences or p…
lfoppiano Jun 4, 2024
3c3af44
Try to get possible sections in the <back> in which the das is hidden…
lfoppiano Jun 4, 2024
7b6fe06
update to grobid 0.8.1, and catch up other changes
lfoppiano Sep 14, 2024
2162720
retrieve URLs from the TEI XML in all the sections that are of interest
lfoppiano Oct 13, 2024
a2b5bbb
update github actions
lfoppiano Oct 13, 2024
e3a4890
fix xpath to fall back into div into TEI/back
lfoppiano Oct 13, 2024
371f520
cleanup
lfoppiano Oct 13, 2024
1483aab
fix reference mapping
lfoppiano Oct 13, 2024
4ab67a6
fix references extraction
lfoppiano Oct 14, 2024
774dd78
fix regression
lfoppiano Oct 22, 2024
b18454b
cosmetics
lfoppiano Oct 22, 2024
962f7eb
fix regressions in the way we attach references from TEI
lfoppiano Oct 22, 2024
3b343c6
allow xml:id to be string using a wrapper that generates integer to m…
lfoppiano Jan 1, 2025
f58c493
fix extraction of urls that are not well formed (supplementary-materi…
lfoppiano Jan 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions .github/workflows/ci-build-manual.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
name: Build and push a development version on docker

on:
workflow_dispatch:
inputs:
custom_tag:
type: string
description: Docker image tag
required: true
default: "latest-develop"

jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

docker-build:
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /opt/hostedtoolcache
sudo rm -rf /opt/google/chrome
sudo rm -rf /opt/microsoft/msedge
sudo rm -rf /opt/microsoft/powershell
sudo rm -rf /opt/pipx
sudo rm -rf /usr/lib/mono
sudo rm -rf /usr/local/julia*
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/lib/node_modules
sudo rm -rf /usr/local/share/chromium
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/share/swift
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v6
with:
dockerfile: Dockerfile.datastet
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/datastet
registry: docker.io
pushImage: true
tags: |
latest-develop, ${{ github.event.inputs.custom_tag}}
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
71 changes: 71 additions & 0 deletions .github/workflows/ci-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: Build unstable

on: [push]

concurrency:
group: gradle
# cancel-in-progress: true


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

- name: Test with Gradle Jacoco and Coveralls
run: ./gradlew test jacocoTestReport coveralls --no-daemon

- name: Coveralls GitHub Action
uses: coverallsapp/github-action@v2
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
format: jacoco

docker-build:
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /opt/hostedtoolcache
sudo rm -rf /opt/google/chrome
sudo rm -rf /opt/microsoft/msedge
sudo rm -rf /opt/microsoft/powershell
sudo rm -rf /opt/pipx
sudo rm -rf /usr/lib/mono
sudo rm -rf /usr/local/julia*
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/lib/node_modules
sudo rm -rf /usr/local/share/chromium
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/share/swift
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v6
with:
dockerfile: Dockerfile.datastet
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/datastet
registry: docker.io
pushImage: ${{ github.event_name != 'pull_request' }}
tags: latest-develop
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
32 changes: 32 additions & 0 deletions .github/workflows/ci-integration-manual.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Run integration tests manually

on:
# push:
# branches:
# - master
workflow_dispatch:

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout grobid home
uses: actions/checkout@v4
with:
repository: kermitt2/grobid
path: ./grobid
- name: Checkout Datastet
uses: actions/checkout@v4
with:
path: ./grobid/datastet
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build and run integration tests
working-directory: ./grobid/datastet
run: ./gradlew copyModels integration --no-daemon

74 changes: 74 additions & 0 deletions .github/workflows/ci-release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Build release

on:
workflow_dispatch:
push:
tags:
- 'v*'

concurrency:
group: docker
cancel-in-progress: true


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

- name: Test with Gradle Jacoco and Coveralls
run: ./gradlew test jacocoTestReport coveralls --no-daemon

- name: Coveralls GitHub Action
uses: coverallsapp/github-action@v2
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
format: jacoco


docker-build:
needs: [build]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- name: Set tags
id: set_tags
run: |
DOCKER_IMAGE=lfoppiano/datastet
VERSION=""
if [[ $GITHUB_REF == refs/tags/v* ]]; then
VERSION=${GITHUB_REF#refs/tags/v}
fi
if [[ $VERSION =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
TAGS="${VERSION}"
else
TAGS="latest"
fi
echo "TAGS=${TAGS}"
echo ::set-output name=tags::${TAGS}
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v6
with:
dockerfile: Dockerfile.local
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/datastet
registry: docker.io
pushImage: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.set_tags.outputs.tags }}
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
Loading