Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of data analysis environments #13

Open
wants to merge 6 commits into
base: trunk
Choose a base branch
from

Conversation

LittleAprilFool
Copy link

Sorry, I couldn't finish the whole draft on time. I will keep adding to this pr as I make more progress.

@LittleAprilFool LittleAprilFool changed the title 🚧 wip - data cleaning First draft of data analysis environments Nov 15, 2019
Copy link
Contributor

@cyrus- cyrus- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

April, great job covering a lot of different topics. This is very well organized. Most of my comments have to do with wording changes and organizational things (in particular, each paper should have a summary). In addition, you should add a bit more detail about how the various tools you discuss were evaluated.

src/index.rst Outdated
:caption: Data Analysis Environments
:hidden:

data-analysis-environments.rst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put Data Analysis Environments under the Live Programming chapter, rather in its own top-level chapter


Overview
========
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before defining "data science", you should introduce "data analysis", since that is the title of the section. Something like "Data scientists use data analysis environments for a variety of data science activities. The term data science was first..."

Overview
========
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997.
University of Michigan statistics professor C.F. Jeff Wu popularized the term "data science" in his talk "Statistics = Data Science?" and he identified 3 aspects to data science which differentiate it from pure statistics :cite:`donoho:2017`: 1) data collection, 2) data modeling and analysis, 3) problem solving and decision support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Do statisticians not do "data modeling and analysis"?)


This article reflects on the past 50 years of history of data science. It introduces how the field of data science emerged as a superset of statistics and machine learning, driven by commercial rather than intellectual developments. It also visions how the field of data science will grow in the next 50 years.

Data science has grown rapidly over the last decade with the rise of big data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"big data" is kind of a meaningless buzzword at this point, so let's avoid that and instead say something like "as datasets have increased rapidly in size and ubiquity".

This article reflects on the past 50 years of history of data science. It introduces how the field of data science emerged as a superset of statistics and machine learning, driven by commercial rather than intellectual developments. It also visions how the field of data science will grow in the next 50 years.

Data science has grown rapidly over the last decade with the rise of big data.
It never comes to a consensus on the workflow of data science.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awkward sentence, perhaps just remove it entirely.

Statistical analysis is the process to generate mathematically rigorous evaluations from data.
There are many statistical tests designed for different contexts and purposes, which may stand only under specific preconditions.
Thus, it is a difficult task for data science workers, especially people with little or no statistical expertise, to decide which statistical tests to use given a specific dataset and hypotheses.
Tea is a high-level declarative language to translate users' hypotheses and domain knowledge into all valid statistical tests :cite:`jun:2019`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to paper summary

There are many computational notebook platforms designed for different analysis languages and environments, for example, `Apache Zeppelin`_, `Spark Notebook`_, `Observable`_, `RStudio`_, `Wolfram Notebooks`_.
Among these computational notebook platforms, `Jupyter Notebook`_ is the most widely used one.
It evolved from IPython, which is a terminal-based interactive shell for creating interactive visualizations for scientific computing.
Wrapping IPython as the kernel, Jupyter Notebook has a powerful graphical interface that allows users to edit and execute "cells" -- small chunks of code or markdown text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that Jupyter now supports more than just Python.

.. bibliography:: data-analysis-environments2.bib
:filter: key == 'rule:2018'

Kery et al. took an interview approach to study how data scientists kept track of variants they explored in Jupyter notebook :cite:`kery:2018`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paper summary

To address the challenges in informal versioning, they designed Variolite, a code editing tool with local versioning control :cite:`kery:2017`.
Variolite is an Atom editor extension that enables users to version a section of the code based on users' selection.
They later integrated this design into Jupyter notebook with Verdant :cite:`kery:2019`.
They designed an enhanced history view with algorithmic and visualization techniques for data science workers to better foraging past analysis choices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talk about evaluation

.. bibliography:: data-analysis-environments2.bib
:filter: key == 'kery:2019'

Head et al. took a different design approach :cite:`head:2019`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paper summary + talk about evaluation

Copy link
Contributor

@cyrus- cyrus- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- main issue is that you are describing the papers in the text rather than in the corresponding paper summary. Please reorganize so that the details of specific papers are in the paper summaries (it's okay if the paper summaries are long). Second draft grade will be sent by email.


@inproceedings{zhang:2017,
author = {Zhang, Xiong and Guo, Philip J.},
title = {DS.Js: Turn Any Webpage into an Example-Centric Live Programming Environment for Learning Data Science},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put abbreviations in titles in braces, e.g. {DS.js}

@inproceedings{zhang:2019,
author = {Zhang, Xiong and Guo, Philip J.},
title = {Mallard\&\#58; Turn the Web into a Contextualized Prototyping Environment for Machine Learning},
booktitle = {Proceedings of the 32Nd Annual ACM Symposium on User Interface Software and Technology},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32Nd -> 32nd


@inproceedings{xia:2018,
author = {Xia, Haijun and Henry Riche, Nathalie and Chevalier, Fanny and De Araujo, Bruno and Wigdor, Daniel},
title = {DataInk: Direct and Creative Data-Oriented Drawing},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{DataInk}

@@ -16,6 +16,379 @@ REPLs

Data Analysis Environments
==========================
Data scientists use data analysis environments for a variety of data science activities.
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997 :cite:`donoho:2017`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start this sentence with "The term data science was first distinguished from pure statistics in 1997"

@@ -16,6 +16,379 @@ REPLs

Data Analysis Environments
==========================
Data scientists use data analysis environments for a variety of data science activities.
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997 :cite:`donoho:2017`.
University of Michigan statistics professor C.F. Jeff Wu popularized the term "data science" in his talk "Statistics = Data Science?".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to call out University of Michigan here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a link to the talk you can include?

It helps data science workers to present, reproduce, share, and collaborate their analysis.
There are many computational notebook platforms designed for different analysis languages and environments, for example, `Apache Zeppelin`_, `Spark Notebook`_, `Observable`_, `RStudio`_, `Wolfram Notebooks`_.
Among these computational notebook platforms, `Jupyter Notebook`_ supports more than 40 programming languages and has been widely used for writing and sharing computational narratives in various contexts.
It evolved from IPython, which is a terminal-based interactive shell for creating interactive visualizations for scientific computing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you can find a paper about IPython to cite

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Although computational notebooks are designed to support not only performing, but also documenting and sharing analysis, most people consider it personal, exploratory, and messy.

Rule et al. :cite:`rule:2018` analyzed over 1 million notebooks on GitHub and found them often lack explanatory text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

This paper reports a large scale analysis of over 1 million open-source computational notebooks.
The results show that only one in four held explanatory text.

Kery et al. took an interview approach to study how data scientists kept track of variants they explored in Jupyter notebook :cite:`kery:2018`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

To address the challenges in informal versioning, they designed Variolite, a code editing tool with local versioning control :cite:`kery:2017`.
Variolite is an Atom editor extension that enables users to version a section of the code based on users' selection.
A preliminary usability study shows that 9 out of 10 participants found the tool easy to use and all 10 of them would consider use it in real life.
They later integrated this design into Jupyter notebook with Verdant :cite:`kery:2019`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


This paper explores the design space in notebook code enviroments to help data scientists forage for information in their history.

Head et al. took a different design approach :cite:`head:2019`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor

@cyrus- cyrus- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Final grades will be sent by email soon.

Base automatically changed from master to trunk February 14, 2021 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants