Skip to content

Commit

Permalink
Merge pull request #948 from bressler1995/open-science-101
Browse files Browse the repository at this point in the history
Address minor consistency differences between MOOC GitHub
  • Loading branch information
bressler95tops authored Dec 14, 2024
2 parents 9182feb + c01504c commit a1ae8a8
Show file tree
Hide file tree
Showing 10 changed files with 155 additions and 80 deletions.
2 changes: 1 addition & 1 deletion Module_1/Lesson_2/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ In 2022 though, NASA decided to fund a challenge open to the public to develop n

### Quality and Diversity of Scholarly Communications

Furthermore, open science improves the state of scientific literature. Scientific journals have traditionally faced the severe issue of publication bias, where journal articles overwhelmingly feature novel and positive results, according to a 2018 [study](https://pubmed.ncbi.nlm.nih.gov/30523135/). This results in a state where scientific results in certain disciplines published may have a number of exaggerated effects, or even be "false positives" (wrongly claiming that an effect exists), making it difficult to evaluate the trustworthiness of published results, according to a 2011 and 2016 study. Open science practices, such as registered reports, mitigate publication bias and improve the trustworthiness of the scientific literature. Registered reports are journal publication formats that peer-review and accept articles before data collection is undertaken, eliminating the pressure to distort results, according to a 2022 [study](https://www.nature.com/articles/s41562-021-01193-7). Other open science practices, such as pre-registration, also allows a partial look into projects that for various reasons (such as lack of funding, logistical issues or shifts in organizational priorities) have not been completed or disseminated, according to a 2023 [study](https://pubmed.ncbi.nlm.nih.gov/34396837/), giving these projects a publicly available output that can help inform about the current state research.
Furthermore, open science improves the state of scientific literature. Scientific journals have traditionally faced the severe issue of publication bias, where journal articles overwhelmingly feature novel and positive results, according to a 2018 [study](https://pubmed.ncbi.nlm.nih.gov/30523135/). This results in a state where scientific results in certain disciplines published may have a number of exaggerated effects, or even be "false positives" (wrongly claiming that an effect exists), making it difficult to evaluate the trustworthiness of published results, according to a 2011 and 2016 study ([1](https://journals.sagepub.com/doi/10.1177/0956797611417632), [2](https://elifesciences.org/articles/21451)). Open science practices, such as registered reports, mitigate publication bias and improve the trustworthiness of the scientific literature. Registered reports are journal publication formats that peer-review and accept articles before data collection is undertaken, eliminating the pressure to distort results, according to a 2022 [study](https://www.nature.com/articles/s41562-021-01193-7). Other open science practices, such as pre-registration, also allows a partial look into projects that for various reasons (such as lack of funding, logistical issues or shifts in organizational priorities) have not been completed or disseminated, according to a 2023 [study](https://pubmed.ncbi.nlm.nih.gov/34396837/), giving these projects a publicly available output that can help inform about the current state research.

<img src="../images/media/image254.png" style="width: 350px; height: auto;" />

Expand Down
2 changes: 1 addition & 1 deletion Module_2/Lesson_2/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ Metadata can facilitate the assessment of dataset quality and data sharing by an

Metadata enhances searchability and findability of the data by potentially allowing other machines to read and interpret datasets.

According to [The University of Pittsburgh](https://pitt.libguides.com/metadatadiscovery/metadata-standards), "A metadata standard is a high level document which establishes a common way of structuring and understanding data, and includes principles and implementation issues for utilizing the standard."
According to [the University of Pittsburgh](https://pitt.libguides.com/metadatadiscovery/metadata-standards), "A metadata standard is a high level document which establishes a common way of structuring and understanding data, and includes principles and implementation issues for utilizing the standard."

Many standards exist for metadata fields and structures to describe general data information. It is a best practice to use a standard that is commonly used in your domain, when applicable, or that is requested by your data repository. Examples of metadata standards for different domains include:

Expand Down
6 changes: 3 additions & 3 deletions Module_2/Lesson_4/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,13 +284,13 @@ From VS Code you can:
- Upload your changes directly to GitHub.
- Download changes from other team members to your local system.

**IDE Example: RStudio – IDE**
### IDE Example: RStudio – IDE

While Visual Studio Code is a more generic IDE where you can use plugins to specialize it, there are also IDEs, such as RStudio, that have specialized features for specific languages right out of the gate.

Researchers conducting statistical analysis tend to use the coding languages of R and Python. RStudio has built-in tools for that very purpose, including data visualization.

<img src="../images/media/image36.jpeg" style="width:100%;height:auto;" />
<img src="../images/media/image36.jpeg" style="width:100%;height:auto;" />

Source: https://en.wikipedia.org/wiki/File:RStudio_IDE_screenshot.png

Expand Down Expand Up @@ -439,7 +439,7 @@ Cons:
- Google Cloud
- Microsoft Azure

Many data providers, especially of large datasets, are migrating their data to the Cloud to increase accessibility and to make use of the large storage capacity that the Cloud provides. For instance, NASA Earthdata (which houses all NASA Earth science data) is now using AWS to store the majority of its data. Many Cloud providers also have a number of publicly available datasets, including [Google Cloud](https://cloud.google.com/storage/docs/public-datasets/#%3A~%3Atext%3DAvailable%20public%20datasets%20on%20Cloud%20Storage%201%20ERA5%3A%2Cfrom%202015%20through%20the%20present.%20...%20More%20items) and [AWS](https://registry.opendata.aws/)[.](https://cloud.google.com/storage/docs/public-datasets/#%3A~%3Atext%3DAvailable%20public%20datasets%20on%20Cloud%20Storage%201%20ERA5%3A%2Cfrom%202015%20through%20the%20present.%20...%20More%20items)
Many data providers, especially of large datasets, are migrating their data to the Cloud to increase accessibility and to make use of the large storage capacity that the Cloud provides. For instance, NASA Earthdata (which houses all NASA Earth science data) is now using AWS to store the majority of its data. Many Cloud providers also have a number of publicly available datasets, including [Google Cloud](https://cloud.google.com/storage/docs/public-datasets/#:~:text=Available%20public%20datasets%20on%20Cloud%20Storage%201%20ERA5%3A,from%202015%20through%20the%20present.%20...%20More%20items) and [AWS](https://registry.opendata.aws/).

When choosing a computing platform, it is important to consider where your datasets are saved and how big the datasets are. For instance, when working with small datasets, it is often preferable to use a personal computer since data download will take minimal time and large computing resources likely aren’t needed. When working with large datasets, however, it is best to minimize the amount of downloading and uploading data that is needed, as this can take significant amounts of time and internet bandwidth. If your large datasets are stored on the Cloud already, it is typically best to use Cloud resources for the computation as well, and likewise for HPC use.

Expand Down
27 changes: 18 additions & 9 deletions Module_3/Lesson_1/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,20 +56,29 @@ Data are any type of information that is collected, observed, or created in the

Data includes:

**Primary (raw) data** – Primary data refers to data that are directly collected or created by researchers. Research questions guide the collection of the data. Typically, a researcher will formulate a question, develop a methodology and start collecting the data. Some examples of primary data include:
<details>
<summary><span>Primary (raw) data</span></summary>
Primary data refers to data that are directly collected or created by researchers. Research questions guide the collection of the data. Typically, a researcher will formulate a question, develop a methodology and start collecting the data. Some examples of primary data include:

- Responses to interviews, questionnaires, and surveys.
- Data acquired from recorded measurements, including remote sensing data.
- Data acquired from physical samples and specimens form the base of many studies.
- Data generated from models and simulations.

**Secondary & Processed data** – Secondary data typically refers to data that is used by someone different from who collected or generated the data. Often, this may include data that has been processed from its raw state to be more readily usable by others.

**Published data** – Published data are the data shared to address a particular scientific study and/or for general use. While published data can overlap with primary and secondary data types, we have "published data" as its own category to emphasize that such datasets are ideally well-documented and easy to use.

**Metadata** – Metadata are a special type of data that describe other data or objects (e.g. samples). They are often used to provide a standard set of information about a dataset to enable easy use and interpretation of the data.

The term open data is defined in the open data handbook from the Open Knowledge Foundation:
</details>
<details>
<summary><span>Secondary & Processed data</span></summary>
Secondary data typically refers to data that is used by someone different from who collected or generated the data. Often, this may include data that has been processed from its raw state to be more readily usable by others.
</details>
<details>
<summary><span>Published data</span></summary>
Published data are the data shared to address a particular scientific study and/or for general use. While published data can overlap with primary and secondary data types, we have "published data" as its own category to emphasize that such datasets are ideally well-documented and easy to use.
</details>
<details>
<summary><span>Metadata</span></summary>
Metadata are a special type of data that describe other data or objects (e.g. samples). They are often used to provide a standard set of information about a dataset to enable easy use and interpretation of the data.

The term open data is defined in the open data handbook from the Open Knowledge Foundation:
</details><br><br>

<img style="width:100%;height:auto;" src="../images/media/opendatahandbookquote.jpg">

Expand Down
75 changes: 41 additions & 34 deletions Module_3/Lesson_2/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,34 +274,39 @@ Match the repository type to the correct definition.

Using open data for your project is contingent on a number of factors including quality of data, access and reuse conditions, data findability, and more. A few essential elements that enable you to assess the relevance and usability of datasets include (adapted from the [GODAN Action Open Data course](https://aims.gitbook.io/open-data-mooc/unit-3-using-open-data/lesson-2.2-quality-and-provenance)):

**Practical Questions**

- Is the data well described?
- Is the reason the data is collected clear? Is the publisher’s use for the data clear?
- Are any other existing uses of the data outlined?
- Is the data accessible?
- Is the data timestamped or up to date?
- Will the data be available for at least a year?
- Will the data be updated regularly?
- Is there a quality control process?

**Technical Questions**

- Is the data available in a format appropriate for the content?
- Is the data available from a consistent location?
- Is the data well-structured and machine-readable?
- Are complex terms and acronyms in the data defined?
- Does the data use a schema or data standard?
- Is there an API available for accessing the data?
- What tools or software are needed to use this data?

**Social Questions**

- Is there an existing community of users of the data?
- Is the data already relied upon by large numbers of people?
- Is the data officially supported?
- Are service level agreements available for the data?
- It is clear who maintains and can be contacted about the data?
<details>
<summary><span>Practical Questions</span></summary>

- Is the data well described?
- Is the reason the data is collected clear? Is the publisher’s use for the data clear?
- Are any other existing uses of the data outlined?
- Is the data accessible?
- Is the data timestamped or up to date?
- Will the data be available for at least a year?
- Will the data be updated regularly?
- Is there a quality control process?
</details>
<details>
<summary><span>Technical Questions</span></summary>

- Is the data available in a format appropriate for the content?
- Is the data available from a consistent location?
- Is the data well-structured and machine-readable?
- Are complex terms and acronyms in the data defined?
- Does the data use a schema or data standard?
- Is there an API available for accessing the data?
- What tools or software are needed to use this data?
</details>
<details>
<summary><span>Social Questions</span></summary>

- Is there an existing community of users of the data?
- Is the data already relied upon by large numbers of people?
- Is the data officially supported?
- Are service level agreements available for the data?
- It is clear who maintains and can be contacted about the data?
</details>

[[cite: https://aims.gitbook.io/open-data-mooc/unit-3-using-open-data/lesson-2.2-quality-and-provenance](https://aims.gitbook.io/open-data-mooc/unit-3-using-open-data/lesson-2.2-quality-and-provenance)]

Expand Down Expand Up @@ -338,15 +343,17 @@ Most datasets require (at a minimum) that you list the data’s producers, name

### Citing Open Data: Examples

**Example from a NASA Distributed Active Archive Center (DAAC)**

Matthew Rodell and Hiroko Kato Beaudoing, NASA/GSFC/HSL (08.16.2007), GLDAS CLM Land Surface Model L4 3 Hourly 1.0 x 1.0 degree Subsetted, version 001, Greenbelt, Maryland, USA:Goddard Earth Sciences Data and Information Services Center (GES DISC), Accessed on July 12th, 2018 at doi:10.5067/83NO2QDLG6M0

**Example from NASA Planetary Data System (PDS)**

Justin N. Maki. (2004). MER 1 MARS MICROSCOPIC IMAGER RADIOMETRIC
<details>
<summary><span>Example from a NASA Distributed Active Archive Center (DAAC)</span></summary>
Matthew Rodell and Hiroko Kato Beaudoing, NASA/GSFC/HSL (08.16.2007), GLDAS CLM Land Surface Model L4 3 Hourly 1.0 x 1.0 degree Subsetted, version 001, Greenbelt, Maryland, USA:Goddard Earth Sciences Data and Information Services Center (GES DISC), Accessed on July 12th, 2018 at doi:10.5067/83NO2QDLG6M0
</details>
<details>
<summary><span>Example from NASA Planetary Data System (PDS)</span></summary>
Justin N. Maki. (2004). MER 1 MARS MICROSCOPIC IMAGER RADIOMETRIC

RDR OPS V1.0 [Data set]. NASA Planetary Data System. [https://doi.org/10.17189/1520416](https://doi.org/10.17189/1520416)
RDR OPS V1.0 [Data set]. NASA Planetary Data System. [https://doi.org/10.17189/1520416](https://doi.org/10.17189/1520416)
</details>

## Lesson 2: Summary

Expand Down
44 changes: 35 additions & 9 deletions Module_3/Lesson_3/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,15 +76,41 @@ Some examples of open data formats include:

*Select each card to find out more information.*

| | |
|---|---|
| Comma Separated Values (CSV) | For simplicity, readability, compatibility, easy data exchange. |
| Hierarchical Data Format (HDF) | For efficient storing and retrieving data, compression, multi-dimensional support. |
| Network Common Data Form (NetCDF) | For self-describing and portability, efficient data subsetting (extract specific portions of large datasets), standardization and interoperability. |
| Investigation-Study- Assay (ISA) model for life science studies | For structured data organization, data integration and interoperability among experiments, reproducibility and transparency. |
| Flexible Image Transport System (FITS) | As a standard for astronomical data, flexible and extensible metadata and image headers, efficient data compression and archiving of large datasets. |
| Common Data Format (CDF) | For self-describing format readable across multiple operating systems, programming languages, and software environments, multidimensional data, and metadata inclusion. |
| Microsoft Word (.doc/.docx) | A proprietary file format used to store word processing data. |
<details>
<summary><span>Comma Separated Values (CSV)</span></summary>

For simplicity, readability, compatibility, easy data exchange.
</details>
<details>
<summary><span>Hierarchical Data Format (HDF)</span></summary>

For efficient storing and retrieving data, compression, multi-dimensional support.
</details>
<details>
<summary><span>Network Common Data Form (NetCDF)</span></summary>

For self-describing and portability, efficient data subsetting (extract specific portions of large datasets), standardization and interoperability.
</details>
<details>
<summary><span>Investigation-Study-Assay (ISA) model for life science studies</span></summary>

For structured data organization, data integration and interoperability among experiments, reproducibility and transparency.
</details>
<details>
<summary><span>Flexible Image Transport System (FITS)</span></summary>

As a standard for astronomical data, flexible and extensible metadata and image headers, efficient data compression and archiving of large datasets.
</details>
<details>
<summary><span>Common Data Format (CDF)</span></summary>

For self-describing format readable across multiple operating systems, programming languages, and software environments, multidimensional data, and metadata inclusion.
</details>
<details>
<summary><span>Microsoft Word (.doc/.docx)</span></summary>

A proprietary file format used to store word processing data.
</details>

By embracing open standards, authors can avoid unnecessary barriers and maximize their chances of making data useful to their communities.

Expand Down
Loading

0 comments on commit a1ae8a8

Please sign in to comment.