Unicode issue - Non english characters as question mark #9

punchouty · 2018-07-25T18:21:23Z

I am able to download, decompress, load and run successfully. But there is one big issue - non english characters in crossrefworks.json are replaced by question mark '?'.

Below is example of one such record.

{ "_id" : { "$oid" : "58d96fec0c62134f84023a29" }, "indexed" : { "date-parts" : [ [ 2016, 10, 25 ] ], "date-time" : "2016-10-25T11:26:19Z", "timestamp" : { "$numberLong" : "1477394779953" } }, "reference-count" : 97, "publisher" : "SAGE Publications", "issue" : "5", "content-domain" : { "domain" : [], "crossmark-restriction" : false }, "short-container-title" : [ "European Journal of Cardiovascular Prevention & Rehabilitation" ], "cited-count" : 0, "published-print" : { "date-parts" : [ [ 2006, 10 ] ] }, "DOI" : "10.1097/01.hjr.0000224482.95597.7a", "type" : "journal-article", "created" : { "date-parts" : [ [ 2006, 9, 22 ] ], "date-time" : "2006-09-22T08:19:38Z", "timestamp" : { "$numberLong" : "1158913178000" } }, "page" : "687-694", "source" : "CrossRef", "title" : [ "ESC Study Group of Sports Cardiology Position Paper on adverse cardiovascular effects of doping in athletes" ], "prefix" : "http://id.crossref.org/prefix/10.1177", "volume" : "13", "author" : [ { "given" : "Asterios", "family" : "Deligiannis", "affiliation" : [] }, { "given" : "Hans", "family" : "Bj??rnstad", "affiliation" : [] }, { "given" : "Francois", "family" : "Carre", "affiliation" : [] }, { "given" : "Hein", "family" : "Heidb??chel", "affiliation" : [] }, { "given" : "Evangelia", "family" : "Kouidi", "affiliation" : [] }, { "given" : "Nicole M.", "family" : "Panhuyzen-Goedkoop", "affiliation" : [] }, { "given" : "Fabio", "family" : "Pigozzi", "affiliation" : [] }, { "given" : "Wilhelm", "family" : "Sch??nzer", "affiliation" : [] }, { "given" : "Luc", "family" : "Vanhees", "affiliation" : [] } ], "member" : "http://id.crossref.org/member/179", "container-title" : [ "European Journal of Cardiovascular Prevention & Rehabilitation" ], "original-title" : [], "deposited" : { "date-parts" : [ [ 2011, 7, 28 ] ], "date-time" : "2011-07-28T15:46:48Z", "timestamp" : { "$numberLong" : "1311868008000" } }, "score" : 1, "subtitle" : [ "" ], "short-title" : [], "issued" : { "date-parts" : [ [ 2006, 10 ] ] }, "URL" : "http://dx.doi.org/10.1097/01.hjr.0000224482.95597.7a", "ISSN" : [ "1741-8267" ], "citing-count" : 97, "subject" : [ "Medicine(all)" ] }

The text was updated successfully, but these errors were encountered:

dhimmel · 2018-07-25T18:26:17Z

non english characters in crossrefworks.json are replaced by question mark

Yuck some sort of encoding issue at some point in the pipeline. At what stage exactly did you notice the question marks?

The API call for the above record does seem to encode unicode characters properly in the name (and not Heidb??chel).

punchouty · 2018-07-25T18:35:57Z

The issue is in downloaded file from figshare. To reproduce, run the following java code on decompressed file -

package com.racloop.crossref;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;

public class FileRead {

    public static void main(String[] args) {
        String filePath = "/Volumes/Seagate/crossrefworks.json";
        BufferedReader fileReader = null;
        try {
            fileReader =
                    new BufferedReader(new FileReader(filePath));
            String line = fileReader.readLine();
            int index = 0;
            while (line != null) {
                index++;
                line = fileReader.readLine();
                if(line.contains("58d96fec0c62134f84023a29")) {
                    System.out.println(line);
                    break;
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                fileReader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

dhimmel · 2018-07-25T19:05:48Z

I've confirmed the question marks using.

xzcat data/mongo-export/crossref-works.json.xz | grep "58d96fec0c62134f84023a29"

I'll look into where these question marks first infiltrated the pipeline.

dhimmel · 2018-07-25T19:21:00Z

It looks like the question marks are part of our MongoDB database (and did not get inserted by mongoexport), according to the following python:

import pymongo
client = pymongo.MongoClient('localhost', 27017)                       
crossref_db = client.crossref
result = crossref_db.works.find({'DOI': '10.1097/01.hjr.0000224482.95597.7a'})
result['author']

Which returns:

[{'given': 'Asterios', 'family': 'Deligiannis', 'affiliation': []},
 {'given': 'Hans', 'family': 'Bj??rnstad', 'affiliation': []},
 {'given': 'Francois', 'family': 'Carre', 'affiliation': []},
 {'given': 'Hein', 'family': 'Heidb??chel', 'affiliation': []},
 {'given': 'Evangelia', 'family': 'Kouidi', 'affiliation': []},
 {'given': 'Nicole M.', 'family': 'Panhuyzen-Goedkoop', 'affiliation': []},
 {'given': 'Fabio', 'family': 'Pigozzi', 'affiliation': []},
 {'given': 'Wilhelm', 'family': 'Sch??nzer', 'affiliation': []},
 {'given': 'Luc', 'family': 'Vanhees', 'affiliation': []}]

So the question is now whether our pipeline clobbered the unicode characters or whether there was an upstream issue that we inherited.

dhimmel · 2018-07-25T19:32:54Z

Note that I have found an instance where unicode content was encoded properly:

crossref_db.works.find_one({'DOI': '10.7717/peerj.100'})['author']

See "Angélica" below:

 {'given': 'Angélica L.',
  'family': 'Gonzalez',
  'affiliation': [{'name': 'Department of Zoology, University of British Columbia, Vancouver, Canada'}]},

I'm thinking that the issue with the metadata for https://doi.org/10.1097/01.hjr.0000224482.95597.7a was an upstream issue. If we take a look at the current metadata:

curl --location --silent https://api.crossref.org/works/10.1097/01.hjr.0000224482.95597.7a | python -m json.tool

I've pasted some of the relevant timestamp fields:

        "created": {
            "date-parts": [
                [
                    2006,
                    9,
                    22
                ]
            ],
            "date-time": "2006-09-22T08:19:38Z",
            "timestamp": 1158913178000
        },

        "deposited": {
            "date-parts": [
                [
                    2017,
                    12,
                    29
                ]
            ],
            "date-time": "2017-12-29T04:03:00Z",
            "timestamp": 1514520180000
        },

        "indexed": {
            "date-parts": [
                [
                    2018,
                    5,
                    3
                ]
            ],
            "date-time": "2018-05-03T03:51:58Z",
            "timestamp": 1525319518794
        },

I'm not sure what indexed means, but it looks like the publisher re-deposited the information to Crossref on 2017-12-29, at which point I'm guessing they fixed the author names. You'll find that publishers often deposit incorrect information, and this may be a case where they have improved there metadata since we queried Crossref. If this specific record is of importance to you, it may be fix in @bnewbold's slightly more recent Crossref dump using this codebase at https://archive.org/download/crossref_doi_dump_201801.

bnewbold · 2018-07-25T21:19:42Z

For DOI 10.1097/01.hjr.0000224482.95597.7a, the 2018-01 dump seems to have "correct" unicode characters (not question marks).

dhimmel · 2018-07-25T22:55:31Z

the 2018-01 dump seems to have "correct" unicode characters (not question marks).

Glad that the error was not on our end!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode issue - Non english characters as question mark #9

Unicode issue - Non english characters as question mark #9

punchouty commented Jul 25, 2018 •

edited by dhimmel

Loading

dhimmel commented Jul 25, 2018 •

edited

Loading

punchouty commented Jul 25, 2018 •

edited by dhimmel

Loading

dhimmel commented Jul 25, 2018

dhimmel commented Jul 25, 2018

dhimmel commented Jul 25, 2018

bnewbold commented Jul 25, 2018 •

edited

Loading

dhimmel commented Jul 25, 2018

Unicode issue - Non english characters as question mark #9

Unicode issue - Non english characters as question mark #9

Comments

punchouty commented Jul 25, 2018 • edited by dhimmel Loading

dhimmel commented Jul 25, 2018 • edited Loading

punchouty commented Jul 25, 2018 • edited by dhimmel Loading

dhimmel commented Jul 25, 2018

dhimmel commented Jul 25, 2018

dhimmel commented Jul 25, 2018

bnewbold commented Jul 25, 2018 • edited Loading

dhimmel commented Jul 25, 2018

punchouty commented Jul 25, 2018 •

edited by dhimmel

Loading

dhimmel commented Jul 25, 2018 •

edited

Loading

punchouty commented Jul 25, 2018 •

edited by dhimmel

Loading

bnewbold commented Jul 25, 2018 •

edited

Loading