-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of data analysis environments #13
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
April, great job covering a lot of different topics. This is very well organized. Most of my comments have to do with wording changes and organizational things (in particular, each paper should have a summary). In addition, you should add a bit more detail about how the various tools you discuss were evaluated.
src/index.rst
Outdated
:caption: Data Analysis Environments | ||
:hidden: | ||
|
||
data-analysis-environments.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put Data Analysis Environments under the Live Programming chapter, rather in its own top-level chapter
src/data-analysis-environments.rst
Outdated
|
||
Overview | ||
======== | ||
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before defining "data science", you should introduce "data analysis", since that is the title of the section. Something like "Data scientists use data analysis environments for a variety of data science activities. The term data science was first..."
src/data-analysis-environments.rst
Outdated
Overview | ||
======== | ||
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997. | ||
University of Michigan statistics professor C.F. Jeff Wu popularized the term "data science" in his talk "Statistics = Data Science?" and he identified 3 aspects to data science which differentiate it from pure statistics :cite:`donoho:2017`: 1) data collection, 2) data modeling and analysis, 3) problem solving and decision support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Do statisticians not do "data modeling and analysis"?)
src/data-analysis-environments.rst
Outdated
|
||
This article reflects on the past 50 years of history of data science. It introduces how the field of data science emerged as a superset of statistics and machine learning, driven by commercial rather than intellectual developments. It also visions how the field of data science will grow in the next 50 years. | ||
|
||
Data science has grown rapidly over the last decade with the rise of big data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"big data" is kind of a meaningless buzzword at this point, so let's avoid that and instead say something like "as datasets have increased rapidly in size and ubiquity".
src/data-analysis-environments.rst
Outdated
This article reflects on the past 50 years of history of data science. It introduces how the field of data science emerged as a superset of statistics and machine learning, driven by commercial rather than intellectual developments. It also visions how the field of data science will grow in the next 50 years. | ||
|
||
Data science has grown rapidly over the last decade with the rise of big data. | ||
It never comes to a consensus on the workflow of data science. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awkward sentence, perhaps just remove it entirely.
src/data-analysis-environments.rst
Outdated
Statistical analysis is the process to generate mathematically rigorous evaluations from data. | ||
There are many statistical tests designed for different contexts and purposes, which may stand only under specific preconditions. | ||
Thus, it is a difficult task for data science workers, especially people with little or no statistical expertise, to decide which statistical tests to use given a specific dataset and hypotheses. | ||
Tea is a high-level declarative language to translate users' hypotheses and domain knowledge into all valid statistical tests :cite:`jun:2019`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to paper summary
src/data-analysis-environments.rst
Outdated
There are many computational notebook platforms designed for different analysis languages and environments, for example, `Apache Zeppelin`_, `Spark Notebook`_, `Observable`_, `RStudio`_, `Wolfram Notebooks`_. | ||
Among these computational notebook platforms, `Jupyter Notebook`_ is the most widely used one. | ||
It evolved from IPython, which is a terminal-based interactive shell for creating interactive visualizations for scientific computing. | ||
Wrapping IPython as the kernel, Jupyter Notebook has a powerful graphical interface that allows users to edit and execute "cells" -- small chunks of code or markdown text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that Jupyter now supports more than just Python.
src/data-analysis-environments.rst
Outdated
.. bibliography:: data-analysis-environments2.bib | ||
:filter: key == 'rule:2018' | ||
|
||
Kery et al. took an interview approach to study how data scientists kept track of variants they explored in Jupyter notebook :cite:`kery:2018`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paper summary
src/data-analysis-environments.rst
Outdated
To address the challenges in informal versioning, they designed Variolite, a code editing tool with local versioning control :cite:`kery:2017`. | ||
Variolite is an Atom editor extension that enables users to version a section of the code based on users' selection. | ||
They later integrated this design into Jupyter notebook with Verdant :cite:`kery:2019`. | ||
They designed an enhanced history view with algorithmic and visualization techniques for data science workers to better foraging past analysis choices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talk about evaluation
src/data-analysis-environments.rst
Outdated
.. bibliography:: data-analysis-environments2.bib | ||
:filter: key == 'kery:2019' | ||
|
||
Head et al. took a different design approach :cite:`head:2019`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paper summary + talk about evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- main issue is that you are describing the papers in the text rather than in the corresponding paper summary. Please reorganize so that the details of specific papers are in the paper summaries (it's okay if the paper summaries are long). Second draft grade will be sent by email.
src/data-analysis-environments.bib
Outdated
|
||
@inproceedings{zhang:2017, | ||
author = {Zhang, Xiong and Guo, Philip J.}, | ||
title = {DS.Js: Turn Any Webpage into an Example-Centric Live Programming Environment for Learning Data Science}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put abbreviations in titles in braces, e.g. {DS.js}
src/data-analysis-environments.bib
Outdated
@inproceedings{zhang:2019, | ||
author = {Zhang, Xiong and Guo, Philip J.}, | ||
title = {Mallard\&\#58; Turn the Web into a Contextualized Prototyping Environment for Machine Learning}, | ||
booktitle = {Proceedings of the 32Nd Annual ACM Symposium on User Interface Software and Technology}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32Nd -> 32nd
src/data-analysis-environments2.bib
Outdated
|
||
@inproceedings{xia:2018, | ||
author = {Xia, Haijun and Henry Riche, Nathalie and Chevalier, Fanny and De Araujo, Bruno and Wigdor, Daniel}, | ||
title = {DataInk: Direct and Creative Data-Oriented Drawing}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{DataInk}
src/live-programming.rst
Outdated
@@ -16,6 +16,379 @@ REPLs | |||
|
|||
Data Analysis Environments | |||
========================== | |||
Data scientists use data analysis environments for a variety of data science activities. | |||
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997 :cite:`donoho:2017`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start this sentence with "The term data science was first distinguished from pure statistics in 1997"
src/live-programming.rst
Outdated
@@ -16,6 +16,379 @@ REPLs | |||
|
|||
Data Analysis Environments | |||
========================== | |||
Data scientists use data analysis environments for a variety of data science activities. | |||
With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997 :cite:`donoho:2017`. | |||
University of Michigan statistics professor C.F. Jeff Wu popularized the term "data science" in his talk "Statistics = Data Science?". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to call out University of Michigan here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a link to the talk you can include?
src/live-programming.rst
Outdated
It helps data science workers to present, reproduce, share, and collaborate their analysis. | ||
There are many computational notebook platforms designed for different analysis languages and environments, for example, `Apache Zeppelin`_, `Spark Notebook`_, `Observable`_, `RStudio`_, `Wolfram Notebooks`_. | ||
Among these computational notebook platforms, `Jupyter Notebook`_ supports more than 40 programming languages and has been widely used for writing and sharing computational narratives in various contexts. | ||
It evolved from IPython, which is a terminal-based interactive shell for creating interactive visualizations for scientific computing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see if you can find a paper about IPython to cite
src/live-programming.rst
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Although computational notebooks are designed to support not only performing, but also documenting and sharing analysis, most people consider it personal, exploratory, and messy. | ||
|
||
Rule et al. :cite:`rule:2018` analyzed over 1 million notebooks on GitHub and found them often lack explanatory text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
src/live-programming.rst
Outdated
This paper reports a large scale analysis of over 1 million open-source computational notebooks. | ||
The results show that only one in four held explanatory text. | ||
|
||
Kery et al. took an interview approach to study how data scientists kept track of variants they explored in Jupyter notebook :cite:`kery:2018`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
To address the challenges in informal versioning, they designed Variolite, a code editing tool with local versioning control :cite:`kery:2017`. | ||
Variolite is an Atom editor extension that enables users to version a section of the code based on users' selection. | ||
A preliminary usability study shows that 9 out of 10 participants found the tool easy to use and all 10 of them would consider use it in real life. | ||
They later integrated this design into Jupyter notebook with Verdant :cite:`kery:2019`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
src/live-programming.rst
Outdated
|
||
This paper explores the design space in notebook code enviroments to help data scientists forage for information in their history. | ||
|
||
Head et al. took a different design approach :cite:`head:2019`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Final grades will be sent by email soon.
Sorry, I couldn't finish the whole draft on time. I will keep adding to this pr as I make more progress.