Skip to content

Notes on search process

Nic edited this page Aug 12, 2022 · 1 revision

The overall goal is to identify software that was produced by NSF funding, so that we can

  • measure quality / accessibility / sustainability of that software
  • contact and survey / interview PIs about their planned and actual software development w/ NSF funding
  • use Augur metrics to describe the features of development that correlate with sustainable software

The first step is to identify software that was funded through grants by NSF. There are two approaches:

  1. Search abstracts looking for promises of software development. This can include the proposal or the outcomes statement. Eventually we might verify / evaluate the software that was reported in outcomes - but that is not necessary at this stage
  2. Search open-source repositories where researchers might reference the software being supported by an NSF grant. Similar to approach 1 - we might eventually go back and evaluate that software but for now we just want to identify it.

With either approach - we will miss a large portion of the total software that was produced under grant funded research. For that reason - we will do an email survey (which is not without its massive downsides) - but that is in the future.

For approach 1:

  • We want to identify awards that were likely to produce software. Obv the only way to do this is to identify where promises (in the abstract) or where reports (outcomes) of software are explicitly mentioned.
  • RegEx vs Classifier- This is why we are labeling award abstracts and outcomes

Next steps for Approach 1:

  • CISE subset to label
  • Software might be any of the following: source code, executables, tools, 'software infrastructure', or software services (e.g. processing pipeline) - software might be defined or described simply as programs, procedures, or routines that are to be executed computationally.
    • Things that might be software, but we should flag to discuss: Models, algorithms, databases,
    • Things that are not software: websites, data

We need ~50 examples of software promises to start testing classifier vs regEx - but lets not get lost in the details. We want to identify awards not develop the best classifier in the world.

Next steps approach 2:

  • Identity likely repositories
    • Github, gitlab, PyPi, Cran, etc
  • Come up with systematic way to search
  • Come up with structure to record data (We might record this manually, we might do it with scraping - it depends on how many we might find in repositories)
Clone this wiki locally