-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to cite specs #179
Comments
A Digital Object Identifier (DOI) for GA4GH standards would be nice and would lend some extra stability to the process. I'm not sure what is involved in minting one's own or getting someone else to mint them for GA4GH. Another option would be to deposit them at Zenodo. |
I think this is something to push upstream to GA4GH coordinators as we should have the same mechanism available to all published documents. I like the DOI approach (or DOI plus something else). |
Agree with @jkbonfield and @michaelmhoffman. I can see why a DOI approach would make sense and also having a consistent approach to our documents. |
To summarise and expand on the discussion on slack: I think when we have discussed this issue in the past, the recommendation has been to cite the appropriate URL1 and/or an appropriate paper such as one of these primordial papers. To the extent that DOIs are useful identifiers for use in papers' lists of citations,2 and to the extent that it is useful for papers to directly reference specification documents,3 it would be useful to mint DOIs for format and protocol specifications. it would be best to have a pan-GA4GH approach to this, so 👍 to addressing this via e.g. @ga4gh/TASC. I believe LSG has not done anything in this space to date; I wonder whether any other work streams have. Options for GA4GH to investigate would appear to be:
Another question to be investigated would be whether a direct DOI identifies e.g. “the SAM format” as a platonic ideal, or whether we would want to mint DOIs corresponding to particular versions of specifications. If the latter, note that “which editions of specifications are the important ones, and how do we refer to them (permanently)” is an ongoing discussion already and has different answers depending on how work streams have organised their work. Footnotes
|
The following also came up in the slack conversation:
IMHO DOIs would not add a lot of value in other contexts. In particular, there would not be any point in writing out a specification DOI in the headers of an e.g. BED file. A file can be determined to be a BED file by other means (perhaps via a magic number in future; today, by recognising the tabular data; via filename extension), so adding an explicit DOI as well as other BED-specific magic number headers wouldn't really add anything. One of the few similar things in other file formats would be XHMTL-style DTDs, which HTML5 has moved away from. |
I think this is actually two-fold: author credit and specification document linkage. If we need to cite a specific version of a standard it ought to aleady be in the file headers ( For standards with existing papers, the authors almost certainly want their work cited. So for SAM/BAM it's the original paper. For CRAM 3.0 and earlier it'd probably need to be the pre-CRAM EBI paper (Fritz et al) in lieu of anything better and for CRAM 3.1 it'd be my own paper. For BED I guess it's BEDtools. For VCF it'd be Petr's 2011 paper. However those papers typically are for purposes of tracking author credit and aren't specifically citing the version of the standard being used by the file. Quite often they're way out of date too. Plus if we're looking at this from a credit perspective, often the people doing the legwork now aren't the same ones who originally published. Eg citations to the VCF spec don't give credit to any of the current specification maintainers. At the very least each standard should probably come with a citation section outlining the preferred mechanism to cite it. This can include past papers, but also should recommend a modern citeable object where they substantially differ in authorship. We ought to lodge a new version every time there is a formal update to the specification version number. One possibility is looking into protocols.io. I don't know if they specifically accept file formats or network protocols, but it does feel like a good fit to me and they have linkage within their protocols to other protocols, so this feels like a foundation level for them. |
There is a UCSC Genome Browser paper where BED is originally mentioned,
which I treat as the canonical journal article reference for BED. We also
describe the drafting of the GA4GH BED standard in the Acidbio paper so one
could make a case for citing that too if you are using the GA4GH BED spec.
There are a lot of advantages to using DOIs for a centralized authoritative
reference to GA4GH specifications or other documents, without regard to the
cultural role of DOIs in academic credit. It’s actually academic credit
where thought of this, however—our faculty annual activity report asks for
a list of documents (mostly journal articles) with a column for the
document DOI. Having one would be very helpful here and in many other
contexts. It would also be a lot easier to track citations to DOIs through
existing mechanisms, whether they be scholarly citations or altmetrics.
|
Structured CrossRef metadata may be useful to authoritatively record some of the things people are asking about in the GA4GH Connect session on Product Approval. cc @susanfairley |
Raising this issue with TASC: ga4gh/TASC#39 |
The "are DOIs the right solution for identifiers" in the biomedical domain, was addressed as part of the FORCE11 Data Citation Implementation Pilot The paper published by the identifiers group is at https://www.nature.com/articles/sdata201829 and an accompanying editorial at https://www.nature.com/articles/sdata201895 . Another paper outlining the work with publishers is at https://pubmed.ncbi.nlm.nih.gov/30457573/ The use of compact identifiers as an alternate to DOIs might be compared with the DOI approach above. The work of the group was driven by the Joint Declaration of Data Citation Principles (JDDCP)h |
These are both about sets of data as deposited in data repositories. They don't appear to discuss standards specification documents, which may have their own considerations that may be different from the considerations for data sets. |
Data citation is key to reproducible science, but that is a different nuance to assigning credit where it's due which is the traditional role of paper citations. Unless there is evidence that grant funding agencies are tracking data citations as well as document citations, then it seems to not be a good fit. Specifically in the past we have had comments from people working on GA4GH specifications that the time they can contribute is limited because they're in an academic position where their "worth" is judged by funding agencies on paper citations and if work isn't towards a paper then it won't be valued or judged by the people that ultimately pay their wages. This is a rather stark and sad state of affairs which I wish wasn't so short-sighted, but it can be a real problem for some people. So my ideal would be to ensure that everyone working on updating GA4GH specifications can do so in the knowledge that significant contributions will lead to citations that may be acceptable to grant funding agencies. There is potentially a conversation to be had with them, but the easiest path is a minimal journal submission. Something that's not a fully fledged peer review with months of round-trips faffary, and more along the lines of deposit and get a DOI. Perhaps the "publish immediately with subsequent open/on-going peer-review" model fits; more of a social network style of publishing, or something specific such as protocols.io (hard to see how that integrates with a file format though which could be used in any number of ways). |
Reposting some things I added in chat today in the TASC call. One of the identifier types created by some of those involved with the FORCE11 effort is RRIDs. Research Resource Ids. They got some traction with publishers. Ideas behind RRIDs is to make citable any resource used by a researcher. They have cell lines, plasmids, anitibodies and organisms. And "Tools and Resources" which are where standards might sit. Using RRIDs here’s how one would cite samtools RRID:SCR_002105 Clearly some ambiguity for the cram example. Is it referencing the format or the toolkit? So probably less than a perfect solution out of the box. But how can we do a “yes, and” with it? I.e. build on it rather than starting from ground-zero ourselves. Also their idea of a "tool/resource" doesn't appear to have encompassed standards Standards not explicitly there. But service resource is there which is close. That would probably need disambiguation of the service spec and an instance/implementation of the spec. So does one walk away if they haven't addressed standards? Open questions |
When we submitted our Samtools-update, Bcftools and Htslib papers last year to GigaScience it was a hard requirement that we also submit the RRID numbers. It turned out these already existed, so we just used them, but I shared your concerns about the quality of the meta-data being listed. I don't think we bothered to login and claim ownership though as it was a bit of a can of worms we didn't have time to open. |
Great to hear a real experience with it. Thanks for sharing @jkbonfield . Gigascience was always along for the ride with people involved in the FORCE11 and Research Data Alliance work on citation. We need these ground breakers, and to sustain and support one another in best practices like this. |
A standard is not a "tool" or "resource" like a piece of software or a cell line. A standard is a document. There is a long history of citing standards as documents. There are even normative international and national standards that specify how a standard is to be cited as a document, in a bibliography. For example, ISO 6901 describes this in section 8.11.4, "Standards" which is a subdivision of 8.11, "Reports in series and similar information resources". National Information Standards Organization standards have an ISSN and an ISBN. In my experience, RRIDs are not used for documents, because we have other ways to cite documents and other identifiers for them. In particular, when RRIDs are used, it can replace a traditional citation to a document in the bibliography. Encouraging this would be an error here and decrease the visibility of GA4GH standards. GA4GH should not be introducing novelty to citing standards documents by RRID. We should instead use the established ways people identify documents. Footnotes
|
I think the real issue is how to easily generate documents that can be cited in the traditional ways. I agree RRID may not be appropriate for documentation. Publishing in mainstream journals is seriously hard work, taking months of time up. Having something gnarly to periodically wade through like that can be a barrier to getting new people on board. There are however more "light-touch" journals that publish immediately and have more social-media style ongoing review. Some are even dedicated to minimalist work, such as getting a citeable DOI for a new software release. They may not be directly suitable to a standards body though. If GA4GH has it's own official mechanism or a collaboration with someone doing a similar job, for generating permanent citeable DOIs related to each specification version, then they could be officially cited as documents along with getting credit to the authors (and accounting for ongoing turnover of people involved). This doesn't remove the ability to publish in mainstream journals if people wish - it's just a choice people can make. |
How can I cite the current specs?
For instance, SAM was defined in http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp352, but the specifications have changed a lot since them. In addition, the SAM tags definitions are also in active development.
Thanks in advance!
The text was updated successfully, but these errors were encountered: