update structure

lfoppiano · Nov 8, 2023 · b0bea09 · b0bea09
1 parent e933322
commit b0bea09
Show file tree

Hide file tree

Showing 14 changed files with 916 additions and 66 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,6 @@
 [![Documentation Status](https://readthedocs.org/projects/supermat/badge/?version=latest)](https://supermat.readthedocs.io/en/latest/?badge=latest)
 [![Build unstable](https://github.com/lfoppiano/SuperMat/actions/workflows/ci-build.yml/badge.svg)](https://github.com/lfoppiano/SuperMat/actions/workflows/ci-build.yml)
 
-
 # SuperMat 
 SuperMat (Superconductors Material) dataset is a manually **linked** **annotated** dataset of superconductors related materials and properties. 
 
@@ -13,66 +12,73 @@ SuperMat (Superconductors Material) dataset is a manually **linked** **annotated
         - Sources are referenced in the [Bibliographic](data/biblio) data
         - :warning: The annotations are not public due to copyright, however 
           - :fire: SuperMat can be considerd one of the few un-biased dataset for LLMs evaluation :fire: 
-    - Tabular version of the linked annotated entities in the dataset [CSV](data/csv/SuperMat-1.0.csv) (*)
+    - CSV of the linked annotated entities in the dataset [CSV](data/csv/SuperMat-1.0.csv) (*)
     - Material data for segmenting inorganic material names
  - Annotation guidelines:
     - [Online version](https://supermat.readthedocs.io)
     - [Changelog](docs/CHANGELOG.md)
     - [Source](docs), 
  - [Transformation scripts](super_mat/converters)
-    - [tsv2xml](super_mat/converters/tsv2xml.py) / [xml2tsv](super_mat/converters/xml2tsv.py): Transformation from an to the INCEpTION TSV 3.2 format
-    - [xmlSupermat2csv](super_mat/converters/xmlSupermat2csv.py): Converts the corpus into the CSV (*) tabular format
+    - [tsv2xml](scripts/tsv2xml.py) / [xml2tsv](scripts/xml2tsv.py): Transformation from an to the INCEpTION TSV 3.2 format
+    - [xmlSupermat2csv](scripts/xmlSupermat2csv.py): Converts the corpus into the CSV (*) tabular format
  - Analysis Jupyter Notebooks:
-    - [dataset-analysis-labelling.ipynb](super_mat/dataset-analysis-labelling.ipynb)
-    - [dataset-analysis-linking.ipynb](super_mat/dataset-analysis-linking.ipynb)
-    - [dataset-analysis-papers.ipynb](super_mat/dataset-analysis-papers.ipynb)
- 
-## Dataset information
+    - [dataset-analysis-labelling.ipynb](scripts/jupyter/dataset-analysis-labelling.ipynb)
+    - [dataset-analysis-linking.ipynb](scripts/jupyter/dataset-analysis-linking.ipynb)
+    - [dataset-analysis-papers.ipynb](scripts/jupyter/dataset-analysis-papers.ipynb)
+
+Feel free to contact us for any information. 
 
 ## Reference
 
 If you use the data, please consider citing the related paper: 
 
-```
+```bibtex
 @article{doi:10.1080/27660400.2021.1918396,
-author = {Luca Foppiano and Sae Dieb and Akira Suzuki and Pedro Baptista de Castro and Suguru Iwasaki and Azusa Uzuki and Miren Garbine Esparza Echevarria and Yan Meng and Kensei Terashima and Laurent Romary and Yoshihiko Takano and Masashi Ishii},
-title = {SuperMat: construction of a linked annotated dataset from superconductors-related publications},
-journal = {Science and Technology of Advanced Materials: Methods},
-volume = {1},
-number = {1},
-pages = {34-44},
-year  = {2021},
-publisher = {Taylor & Francis},
-doi = {10.1080/27660400.2021.1918396},
-
-URL = { 
-        https://doi.org/10.1080/27660400.2021.1918396
-    
-},
-eprint = { 
-        https://doi.org/10.1080/27660400.2021.1918396
-    
-}
-
+   author = {Luca Foppiano and Sae Dieb and Akira Suzuki and Pedro Baptista de Castro and Suguru Iwasaki and Azusa Uzuki and Miren Garbine Esparza Echevarria and Yan Meng and Kensei Terashima and Laurent Romary and Yoshihiko Takano and Masashi Ishii},
+   title = {SuperMat: construction of a linked annotated dataset from superconductors-related publications},
+   journal = {Science and Technology of Advanced Materials: Methods},
+   volume = {1},
+   number = {1},
+   pages = {34-44},
+   year  = {2021},
+   publisher = {Taylor & Francis},
+   doi = {10.1080/27660400.2021.1918396},
+
+   URL = { 
+           https://doi.org/10.1080/27660400.2021.1918396
+   },
+   eprint = { 
+           https://doi.org/10.1080/27660400.2021.1918396   
+   }
 }
 ```
 
 ## Usage
 
-### Conversion tools
+### Getting started
 
 To use the scripts and analysis data 
 
-> conda create --name SuperMat pip 
+```bash
+conda create --name SuperMat pip
+pip install -r requirements.txt 
+```
+
+### Conversion tools
+
+```bash
+python scripts/tsv2xml.py --help
+```
 
-> pip install -r requirements.txt 
 
 ### Analysis tools 
 
 The analysis tools provide statistics and information from the dataset, they also run consistency checks of the format and content. 
 Results can be seen directly on the repository. 
-
-> jupyter-lab 
+
+```bash
+jupyter-lab
+```
 
 
 ### Annotation guidelines
@@ -81,10 +87,11 @@ We use reStructured TExt using the utility [Sphinx](https://www.sphinx-doc.org/e
 
 To build this documentation locally, we recommend to create a virtual environment such as `virtualenv` or `conda`:  
 
-> conda create -name guidelines 
-> conda activate guidelines
->
-> conda install sphinx 
+```bash 
+conda create -name guidelines 
+conda activate guidelines
+conda install sphinx
+``` 
 
 #### Build HTML site
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -23,7 +23,6 @@ readme = "README.md"
 dynamic = ['version']
 
 [tool.setuptools]
-py-modules = ['supermat']
 include-package-data = false
 
 [tool.setuptools_scm]

diff --git a/requirements.txt b/requirements.txt
@@ -5,4 +5,5 @@ matplotlib
 gensim
 pandas
 regex
-bump-my-version
+bump-my-version
+supermat
diff --git a/supermat/dataset-analysis-labelling.ipynb → .../jupyter/dataset-analysis-labelling.ipynb b/supermat/dataset-analysis-labelling.ipynb → .../jupyter/dataset-analysis-labelling.ipynb
diff --git a/supermat/dataset-analysis-linking.ipynb → ...ts/jupyter/dataset-analysis-linking.ipynb b/supermat/dataset-analysis-linking.ipynb → ...ts/jupyter/dataset-analysis-linking.ipynb
diff --git a/supermat/dataset-analysis-papers.ipynb → ...pts/jupyter/dataset-analysis-papers.ipynb b/supermat/dataset-analysis-papers.ipynb → ...pts/jupyter/dataset-analysis-papers.ipynb
diff --git a/supermat/converters/tsv2xml.py → scripts/tsv2xml.py b/supermat/converters/tsv2xml.py → scripts/tsv2xml.py
@@ -1,4 +1,3 @@
-# transform tei annotation into prodigy annotations
 import argparse
 import os
 from html import escape

diff --git a/scripts/xml2csv.py b/scripts/xml2csv.py
@@ -0,0 +1,240 @@
+import argparse
+import csv
+import os
+from pathlib import Path
+
+from bs4 import BeautifulSoup, Tag
+
+from supermat.supermat_tei_parser import get_children_list_grouped
+
+paragraph_id = 'paragraph_id'
+
+
+def write_on_file(fw, filename, sentenceText, dic_token):
+    links = len([token for token in dic_token if token[5] != '_'])
+    has_links = 0 if links == 0 else 1
+    fw.writerow([filename, sentenceText, has_links])
+
+
+def process_file(finput):
+    filename = Path(finput).name.split(".superconductors")[0]
+    with open(finput, encoding='utf-8') as fp:
+        doc = fp.read()
+
+    # mod_tags = re.finditer(r'(</\w+>) ', doc)
+    # for mod in mod_tags:
+    #     doc = doc.replace(mod.group(), ' ' + mod.group(1))
+    #     print(doc)
+    soup = BeautifulSoup(doc, 'xml')
+
+    paragraphs_grouped = get_children_list_grouped(soup)
+
+    dic_dest_relationships = {}
+    dic_source_relationships = {}
+    ient = 1
+    i = 0
+    for para_id, paragraph in enumerate(paragraphs_grouped):
+        for sent_id, sentence in enumerate(paragraph):
+            j = 0
+            for item in sentence.contents:
+                if type(item) is Tag:
+                    if 'type' not in item.attrs:
+                        raise Exception("RS without type is invalid. Stopping")
+                    entity_class = item.attrs['type']
+                    entity_text = item.text
+
+                    if len(item.attrs) > 0:
+                        if 'xml:id' in item.attrs:
+                            if item.attrs['xml:id'] not in dic_dest_relationships:
+                                dic_dest_relationships[item.attrs['xml:id']] = [i + 1, j + 1, ient, entity_text,
+                                                                                entity_class, para_id, sent_id]
+
+                        if 'corresp' in item.attrs:
+                            if (i + 1, j + 1) not in dic_source_relationships:
+                                dic_source_relationships[i + 1, j + 1] = [item.attrs['corresp'].replace('#', ''), ient,
+                                                                          entity_text, entity_class, para_id, sent_id]
+                    j += 1
+                ient += 1
+            i += 1
+
+    output = []
+    output_idx = []
+
+    struct = {
+        'id': None,
+        'filename': None,
+        'passage_id': None,
+        'material': None,
+        'tcValue': None,
+        'pressure': None,
+        'me_method': None,
+        'sentence': None
+    }
+    mapping = {}
+
+    for label in list(struct.keys()):
+        if label not in mapping:
+            mapping[label] = {}
+
+        for par_num, token_num in dic_source_relationships:
+            source_item = dic_source_relationships[par_num, token_num]
+            source_entity_id = source_item[1]
+            source_id = str(par_num) + '-' + str(token_num)
+            source_text = source_item[2]
+            source_label = source_item[3]
+
+            # destination_xml_id: Use this to pick up information from dic_dest_relationship
+            destination_xml_id = source_item[0]
+
+            for des in destination_xml_id.split(","):
+                destination_item = dic_dest_relationships[str(des)]
+
+                destination_id = destination_item[2]
+                destination_text = destination_item[3]
+                destination_label = destination_item[4]
+                destination_para = destination_item[5]
+                destination_sent = destination_item[6]
+                if destination_label != label:
+                    continue
+
+                # try:
+                #     relationship_name = get_relationship_name(source_label, destination_label)
+                # except Exception as e:
+                #     return []
+
+                if source_label not in mapping:
+                    mapping[source_label] = {}
+
+                if destination_id in mapping[destination_label]:
+                    indexes_in_output_table = mapping[destination_label][destination_id]
+                    for index_in_output_table in indexes_in_output_table:
+                        if source_label in output[index_in_output_table]:
+                            row_copy = {key: value for key, value in output[index_in_output_table].items()}
+                            row_copy[destination_label] = destination_text
+                            row_copy[source_label] = source_text
+                            row_copy['filename'] = filename
+                            row_copy[paragraph_id] = destination_para
+                            output.append(row_copy)
+                            # output.append({destination_label: destination_text, source_label: source_text})
+                        else:
+                            output[index_in_output_table][source_label] = source_text
+                elif source_entity_id in mapping[source_label]:
+                    indexes_in_output_table = mapping[source_label][source_entity_id]
+                    for index_in_output_table in indexes_in_output_table:
+                        if destination_label in output[index_in_output_table]:
+                            # output.append({destination_label: destination_text, source_label: source_text})
+                            # if source_label in output[index_in_output_table]:
+                            #     output.append({destination_label: destination_text, source_label: source_text})
+                            # else:
+                            row_copy = {key: value for key, value in output[index_in_output_table].items()}
+                            row_copy[source_label] = source_text
+                            row_copy[destination_label] = destination_text
+                            row_copy['filename'] = filename
+                            row_copy[paragraph_id] = destination_para
+                            output.append(row_copy)
+                        else:
+                            output[index_in_output_table][destination_label] = destination_text
+                else:
+                    output.append({
+                        destination_label: destination_text,
+                        source_label: source_text,
+                        'filename': filename,
+                        paragraph_id: destination_para})
+                    output_idx.append({
+                        destination_label: destination_id,
+                        source_label: source_id,
+                        'filename': filename,
+                        paragraph_id: destination_para
+                    })
+
+                current_index = len(output) - 1
+                if destination_id not in mapping[destination_label]:
+                    mapping[destination_label][destination_id] = set()
+                    mapping[destination_label][destination_id].add(current_index)
+                else:
+                    mapping[destination_label][destination_id].add(current_index)
+
+                if source_entity_id not in mapping[source_label]:
+                    mapping[source_label][source_entity_id] = set()
+                    mapping[source_label][source_entity_id].add(current_index)
+                else:
+                    mapping[source_label][source_entity_id].add(current_index)
+
+    return output
+
+
+def writeOutput(data, output_path, format):
+    delimiter = '\t' if format == 'tsv' else ','
+    fw = csv.writer(open(output_path, encoding='utf-8', mode='w'), delimiter=delimiter, quotechar='"')
+    columns = ['id', 'filename', paragraph_id, 'material', 'tcValue', 'pressure', 'me_method']
+    fw.writerow(columns)
+    for d in data:
+        fw.writerow([d[c] if c in d else '' for c in columns])
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description="Converter XML (Supermat) to a tabular values (CSV, TSV)")
+
+    parser.add_argument("--input", help="Input file or directory", required=True)
+    parser.add_argument("--output", help="Output directory", required=True)
+    parser.add_argument("--recursive", action="store_true", default=False,
+                        help="Process input directory recursively. If input is a file, this parameter is ignored.")
+    parser.add_argument("--format", default='csv', choices=['tsv', 'csv'],
+                        help="Output format.")
+    parser.add_argument("--filter", default='all', choices=['all', 'oa', 'non-oa'],
+                        help='Extract data from a certain type of licenced documents')
+
+    args = parser.parse_args()
+
+    input = args.input
+    output = args.output
+    recursive = args.recursive
+    format = args.format
+    filter = args.filter
+
+    if os.path.isdir(input):
+        path_list = []
+
+        if recursive:
+            for root, dirs, files in os.walk(input):
+                for file_ in files:
+                    if not file_.lower().endswith(".xml"):
+                        continue
+
+                    if filter == 'oa':
+                        if '-CC' not in file_:
+                            continue
+                    elif filter == 'non-oa':
+                        if '-CC' in file_:
+                            continue
+
+                    abs_path = os.path.join(root, file_)
+                    path_list.append(abs_path)
+
+        else:
+            path_list = Path(input).glob('*.xml')
+
+        data_sorted = []
+        for path in path_list:
+            print("Processing: ", path)
+            file_data = process_file(path)
+            data = sorted(file_data, key=lambda k: k[paragraph_id])
+            data_sorted.extend(data)
+
+        if os.path.isdir(str(output)):
+            output_path = os.path.join(output, "output") + "." + format
+        else:
+            parent_dir = Path(output).parent
+            output_path = os.path.join(parent_dir, "output." + format)
+
+    elif os.path.isfile(input):
+        input_path = Path(input)
+        data = process_file(input_path)
+        data_sorted = sorted(data, key=lambda k: k[paragraph_id])
+        output_filename = input_path.stem
+        output_path = os.path.join(output, str(output_filename) + "." + format)
+
+    data = [{**record, **{"id": idx}} for idx, record in enumerate(data_sorted)]
+
+    writeOutput(data, output_path, format)