Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update benchmark dataset #11

Merged
merged 56 commits into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
a2c2492
first commit
doncamilom Nov 29, 2024
94c258d
split - working
doncamilom Nov 29, 2024
d2a3056
add task class
doncamilom Nov 29, 2024
ca2e86a
add evaluator, and logger. update task
doncamilom Nov 29, 2024
9d53610
add basetype and update task to load from graph
doncamilom Dec 2, 2024
637f7b1
format
doncamilom Dec 2, 2024
439eeca
update tree metrics calc
doncamilom Dec 2, 2024
fd60893
add basetypes: every model for this task should output like this
doncamilom Dec 2, 2024
3ea8d68
formatting
doncamilom Dec 2, 2024
ea2a3aa
update docs test
doncamilom Dec 2, 2024
4cb6c3f
move files around
doncamilom Dec 2, 2024
e6725da
add metrics class
doncamilom Dec 2, 2024
de2cdde
add class for metrics and doccstrs
doncamilom Dec 2, 2024
2ba75ef
update actions
doncamilom Dec 2, 2024
68e7c8a
done updateing classes. some things to fix
doncamilom Dec 2, 2024
0a01309
adding tests
doncamilom Dec 3, 2024
13cade0
fixing metrics to pass tests
doncamilom Dec 3, 2024
18e802d
make testing bit faster
doncamilom Dec 3, 2024
6c09190
fix iso test
doncamilom Dec 3, 2024
937ab0a
fix prune function test
doncamilom Dec 3, 2024
f2375c7
fix another test <> metric
doncamilom Dec 3, 2024
be311a8
fix test - exact match
doncamilom Dec 3, 2024
ed74372
fixed one more test
doncamilom Dec 3, 2024
6934e39
finished tests
doncamilom Dec 3, 2024
32382ca
clean up
doncamilom Dec 3, 2024
6ec8674
code to evaluate pre-run methods
doncamilom Dec 3, 2024
2e41f77
add preprocessing to compare graphs based on names. maybe will have t…
doncamilom Dec 3, 2024
6a52876
add preprocessing step into Task - ensures graphs can be compared
doncamilom Dec 4, 2024
47e9376
add wandb reporting into GOSyBench calss
doncamilom Dec 4, 2024
6a99c06
update results notebook
doncamilom Dec 4, 2024
67ef05e
remove stuff from api
doncamilom Dec 4, 2024
749c9bd
rm graph preprocess from task
doncamilom Dec 4, 2024
06ff0b6
linting
doncamilom Dec 4, 2024
c3b8a66
rm unnecesary
doncamilom Dec 4, 2024
39d11b1
clean out the stuff from jasyntho in api.py
doncamilom Dec 4, 2024
459301c
add optimized solver to find long path with smiles (+ tests)
doncamilom Dec 6, 2024
8d6a5b7
api jasyntho TODO
doncamilom Dec 6, 2024
a3d8342
format
doncamilom Dec 6, 2024
b65f9e6
rm extra function
doncamilom Dec 6, 2024
fbff8d9
scrip to calculate stats of dataset. simple call to packages api
doncamilom Dec 6, 2024
4ade52b
fix calc max smiles path
doncamilom Dec 6, 2024
44c6da4
WIP adding the ground truth graphs
doncamilom Dec 6, 2024
fdc70bc
Merge branch 'main' of https://github.com/schwallergroup/syn2act into…
doncamilom Dec 6, 2024
68b945a
update smiles for one paper
doncamilom Dec 6, 2024
38ef800
update smiles
doncamilom Dec 6, 2024
3eebd1b
update package -> transition to gosybench
doncamilom Dec 9, 2024
b7b28dd
test
doncamilom Dec 9, 2024
d7b8911
update workflows to install pkg
doncamilom Dec 9, 2024
80d7225
update workflow
doncamilom Dec 9, 2024
2df8dfe
update workflows
doncamilom Dec 9, 2024
ad8d70f
hm
doncamilom Dec 9, 2024
a0025dd
s
doncamilom Dec 9, 2024
996a8f1
Update README.md
doncamilom Dec 10, 2024
e3acb29
update names
doncamilom Dec 10, 2024
5eb3f95
update txo
doncamilom Dec 10, 2024
e3da370
update version
doncamilom Dec 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,12 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: pip install tox
run:
pip install tox
- name: Install test dependencies
run: pip install .[tests]
- name: Install jasyntho dependencies
run: pip install .[jasyntho]
- name: Test with pytest and generate coverage file
run:
tox run -e py
Expand Down
6 changes: 3 additions & 3 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
cff-version: 1.0.2
message: "If you use this software, please cite it as below."
title: "jasyntho"
title: "gosybench"
authors:
- name: "Andres M Bran"
version: 0.0.1-dev
version: 0.0.1
doi:
url: "https://github.com/schwallergroup/jasyntho"
url: "https://github.com/schwallergroup/gosybench"
26 changes: 0 additions & 26 deletions MANIFEST.in

This file was deleted.

199 changes: 98 additions & 101 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,128 +1,139 @@
<!--
<p align="center">
<img src="https://github.com/schwallergroup/jasyntho/raw/main/docs/source/logo.png" height="150">
<img src="https://github.com/schwallergroup/gosybench/raw/main/docs/source/logo.png" height="150">
</p>
-->

<h1 align="center">
jasyntho
GOSyBench
</h1>


[![tests](https://github.com/schwallergroup/jasyntho/actions/workflows/tests.yml/badge.svg)](https://github.com/schwallergroup/jasyntho)
[![DOI:10.1101/2020.07.15.204701](https://zenodo.org/badge/DOI/10.48550/arXiv.2304.05376.svg)](https://doi.org/10.48550/arXiv.2304.05376)
[![PyPI](https://img.shields.io/pypi/v/jasyntho)](https://img.shields.io/pypi/v/jasyntho)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/jasyntho)](https://img.shields.io/pypi/pyversions/jasyntho)
[![Documentation Status](https://readthedocs.org/projects/jasyntho/badge/?version=latest)](https://jasyntho.readthedocs.io/en/latest/?badge=latest)
[![tests](https://github.com/schwallergroup/gosybench/actions/workflows/tests.yml/badge.svg)](https://github.com/schwallergroup/gosybench)
[![DOI:10.18653/v1/2024.langmol-1.9](https://zenodo.org/badge/DOI/10.18653/v1/2024.langmol-1.9.svg)](https://aclanthology.org/2024.langmol-1.9/)
[![PyPI](https://img.shields.io/pypi/v/gosybench)](https://img.shields.io/pypi/v/gosybench)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gosybench)](https://img.shields.io/pypi/pyversions/gosybench)
[![Documentation Status](https://readthedocs.org/projects/gosybench/badge/?version=latest)](https://gosybench.readthedocs.io/en/latest/?badge=latest)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Cookiecutter template from @SchwallerGroup](https://img.shields.io/badge/Cookiecutter-schwallergroup-blue)](https://github.com/schwallergroup/liac-repo)
[![Learn more @SchwallerGroup](https://img.shields.io/badge/Learn%20%0Amore-schwallergroup-blue)](https://schwallergroup.github.io)


A library for extraction of implicit scientific insights from total synthesis documents.
A benchmark for Knowledge Graph Extraction from Total Synthesis documents.

## 💪 Getting Started

Extracting the full synthetic sequence from a paper's SI

```python
from jasyntho import SynthTree
from gosybench.basetypes import STree
from gosybench.evaluate import GOSyBench
from gosybench.metrics import GraphEval, TreeMetrics

doc_src = 'tests/examples/synth_SI_sub.pdf' # Src doc is typically an SI
stree = SynthTree(doc_src, OPENAI_API_KEY) # Extract data and create synthetic tree

mtree = stree.merged_trees # Synthetic sequence
def test_method(path: str) -> STree:
# Define your method for KGE here.
return STree(products=[], graph=nx.DiGraph())

gosybench = GOSyBench(
project="my-eval",
describe=TreeMetrics(),
metrics=GraphEval(),
)

# TODO: Create visualization
print(mtree)
# Evaluate
gosybench.evaluate(test_method)
```

## 🚀 Installation

The most recent code and data can be installed directly from GitHub with:

```bash
21
├── 22
│   ├── S1
│   │   ├── cyclohexane
│   │   └── MeMgBr
│   ├── HBr
│   ├── DCM
...
$ pip install git+https://github.com/schwallergroup/gosybench.git
```

Running segmentation of a single synthesis paragraph
Optionally, you can install **Jasyntho**, our package for KGE.

```python
from jasyntho.segment import SegFlanT5

paragraph = (
"To a rapidly stirred solution of saturated aqueous ammonium hydroxide (50 mL) and ice in a 0 deg. C. bath was added "
"2,4-dichloro-5-nitropyrimidine (6.0 g, 31 mmol) in portions. The resulting yellow foamy mixture was allowed to stir "
"for 30 min, at which point the precipitate was isolated by filtration. The solid was rinsed several times with ice-cold "
"water and once with ice cold ethanol to give a peach-colored solid. The crude solid was purified by adsorption onto 18 g "
"silica gel, followed by silica gel chromatography, eluting with 0-20% MeOH/dichloromethane to give "
"2-chloro-5-nitropyrimidin-4-amine as an off-white solid. MS (ES+): 175 (M+H)+; Calc. for C4H3ClN4O2=174.55."
)
```bash
$ pip install "git+https://github.com/schwallergroup/gosybench.git#egg=gosybench[jasyntho]"
```

segment = SegFlanT5()
segm_prg = segment(paragraph)

print(segm_prg)
```
---

Produces
## 🚀 Advanced Usage

<details>
<summary>See advanced usage.</summary>
<br>


## 🌱 Jasyntho

Jasyntho is a package for Knowledge Graph Extraction of Total Syntheses.
It relies on LLMs for some core functionalities.

Make sure to create an `.env` file with the API keys of the LLM providers you want to use:
```bash
[
{
'text segment': "'To a rapidly stirred solution of saturated aqueous ammonium hydroxide (50 mL) and ice in a 0 deg. C. bath was added 2,4-dichloro-5-nitropyrimidine (6.0 g, 31 mmol) in portions. The resulting yellow foamy mixture was allowed to stir for 30 min, at which point the precipitate was isolated by filtration.'",
'text class': 'reaction set-up',
'step order': '1'
},
{
'text segment': "'The solid was rinsed several times with ice-cold water and once with ice cold ethanol to give a peach-colored solid.'",
'text class': 'work-up',
'step order': '2'
},
{
'text segment': "'The crude solid was purified by adsorption onto 18 g silica gel, followed by silica gel chromatography, eluting with 0-20% MeOH/dichloromethane to give 2-chloro-5-nitropyrimidin-4-amine as an off-white solid.'",
'text class': 'purification',
'step order': '3'
},
{
'text segment': "'MS (ES+): 175 (M+H)+; Calc. for C4H3ClN4O2=174.55.'",
'text class': 'analysis',
'step order': '4'
}
]
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
```

## 🚀 Installation

<!-- Uncomment this section after your first ``tox -e finish``
The most recent release can be installed from
[PyPI](https://pypi.org/project/jasyntho/) with:
Download the paper you want to extract in a directory like this

```shell
$ pip install jasyntho
```bash
jacs.9b12546
├── doi.txt
├── paper.pdf
└── si_0.pdf
```
-->

The most recent code and data can be installed directly from GitHub with:
```paper.pdf``` is the main article, and ```si_0.pdf``` is the Supplementary Information of that article.

```bash
$ pip install git+https://github.com/schwallergroup/jasyntho.git
Then, use Jasyntho like:

```python

from jasyntho import SynthTree

tree = SynthTree.from_dir(path)
tree.rxn_extract = ExtractReaction(llm=model)

tree.raw_prods = await tree.async_extract_rss(
mode=method, si_select=si_select
)
tree.products = [p for p in tree.raw_prods if not p.isempty()]
tree.full_g = tree.get_full_graph(tree.products)
```


### Command Line Interface
</details>

The jasyntho command line tool is automatically installed. It can
be used from the shell with the `--help` flag to show all subcommands:

```shell
$ jasyntho --help
## ✅ Citation

Andres M Bran, Zlatko Jončev, and Philippe Schwaller. 2024. Knowledge Graph Extraction from Total Synthesis Documents. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 74–84, Bangkok, Thailand. Association for Computational Linguistics.
```bibtex
@inproceedings{m-bran-etal-2024-knowledge,
title = "Knowledge Graph Extraction from Total Synthesis Documents",
author = "M Bran, Andres and Jon{\v{c}}ev, Zlatko and Schwaller, Philippe",
booktitle = "Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)",
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.langmol-1.9",
doi = "10.18653/v1/2024.langmol-1.9",
pages = "74--84",
}
```

> TODO show the most useful thing the CLI does! The CLI will have documentation auto-generated
> by `sphinx`.










## 🛠️ For Developers
Expand All @@ -134,28 +145,14 @@ $ jasyntho --help
## 👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See
[CONTRIBUTING.md](https://github.com/schwallergroup/jasyntho/blob/master/.github/CONTRIBUTING.md) for more information on getting involved.
[CONTRIBUTING.md](https://github.com/schwallergroup/gosybench/blob/master/.github/CONTRIBUTING.md) for more information on getting involved.

## 👋 Attribution

### ⚖️ License

The code in this package is licensed under the MIT License.

<!--
### 📖 Citation

Citation goes here!
-->

<!--
### 🎁 Support

This project has been supported by the following organizations (in alphabetical order):

- [Harvard Program in Therapeutic Science - Laboratory of Systems Pharmacology](https://hits.harvard.edu/the-program/laboratory-of-systems-pharmacology/)

-->

<!--
### 💰 Funding
Expand Down Expand Up @@ -185,8 +182,8 @@ The final section of the README is for if you want to get involved by making a c
To install in development mode, use the following:

```bash
$ git clone git+https://github.com/schwallergroup/jasyntho.git
$ cd jasyntho
$ git clone git+https://github.com/schwallergroup/gosybench.git
$ cd gosybench
$ pip install -e .
```

Expand All @@ -199,15 +196,15 @@ run reproducibly with:
$ tox
```

Additionally, these tests are automatically re-run with each commit in a [GitHub Action](https://github.com/schwallergroup/jasyntho/actions?query=workflow%3ATests).
Additionally, these tests are automatically re-run with each commit in a [GitHub Action](https://github.com/schwallergroup/gosybench/actions?query=workflow%3ATests).

### 📖 Building the Documentation

The documentation can be built locally using the following:

```shell
$ git clone git+https://github.com/schwallergroup/jasyntho.git
$ cd jasyntho
$ git clone git+https://github.com/schwallergroup/gosybench.git
$ cd gosybench
$ tox -e docs
$ open docs/build/html/index.html
```
Expand All @@ -230,7 +227,7 @@ $ tox -e finish
This script does the following:

1. Uses [Bump2Version](https://github.com/c4urself/bump2version) to switch the version number in the `setup.cfg`,
`src/jasyntho/version.py`, and [`docs/source/conf.py`](docs/source/conf.py) to not have the `-dev` suffix
`src/gosybench/version.py`, and [`docs/source/conf.py`](docs/source/conf.py) to not have the `-dev` suffix
2. Packages the code in both a tar archive and a wheel using [`build`](https://github.com/pypa/build)
3. Uploads to PyPI using [`twine`](https://github.com/pypa/twine). Be sure to have a `.pypirc` file configured to avoid the need for manual input at this
step
Expand Down
19 changes: 19 additions & 0 deletions scripts/benchmarks/dataset_stats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""Compute statistics for the benchmark dataset."""

from gosybench.evaluate import GOSyBench
from gosybench.logger import setup_logger
from gosybench.metrics import TreeMetrics

logger = setup_logger(__package__)


def main():
gosybench = GOSyBench(
project="GOSyBench-stats",
describe=TreeMetrics(),
)
gosybench.evaluate(None)


if __name__ == "__main__":
main()
Loading
Loading