New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

First draft of data analysis environments #13

Open

LittleAprilFool wants to merge 6 commits into fplab:trunk from LittleAprilFool:master

LittleAprilFool commented Nov 15, 2019

Sorry, I couldn't finish the whole draft on time. I will keep adding to this pr as I make more progress.


          🚧 wip - data cleaning

8a564fd

LittleAprilFool changed the title ~~🚧 wip - data cleaning~~ First draft of data analysis environments

LittleAprilFool added 2 commits

November 17, 2019 18:59


          🚧 wip computational notebooks

8ee48de


          ✨ finish the first draft

31c743d

cyrus- requested changes

View reviewed changes

Contributor

cyrus- left a comment

April, great job covering a lot of different topics. This is very well organized. Most of my comments have to do with wording changes and organizational things (in particular, each paper should have a summary). In addition, you should add a bit more detail about how the various tools you discuss were evaluated.

src/index.rst Outdated

+                :caption: Data Analysis Environments
+                :hidden:
+                data-analysis-environments.rst

Contributor

cyrus- Nov 19, 2019

Let's put Data Analysis Environments under the Live Programming chapter, rather in its own top-level chapter

src/data-analysis-environments.rst Outdated

+              Overview
+              ========
+              With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997.

Contributor

cyrus- Nov 19, 2019

Before defining "data science", you should introduce "data analysis", since that is the title of the section. Something like "Data scientists use data analysis environments for a variety of data science activities. The term data science was first..."

src/data-analysis-environments.rst Outdated

+              Overview
+              ========
+              With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997.
+              University of Michigan statistics professor C.F. Jeff Wu popularized the term "data science" in his talk "Statistics = Data Science?" and he identified 3 aspects to data science which differentiate it from pure statistics :cite:`donoho:2017`: 1) data collection, 2) data modeling and analysis, 3) problem solving and decision support.

Contributor

cyrus- Nov 19, 2019

(Do statisticians not do "data modeling and analysis"?)

src/data-analysis-environments.rst Outdated


		This article reflects on the past 50 years of history of data science. It introduces how the field of data science emerged as a superset of statistics and machine learning, driven by commercial rather than intellectual developments. It also visions how the field of data science will grow in the next 50 years.

		Data science has grown rapidly over the last decade with the rise of big data.

Contributor

cyrus- Nov 19, 2019

"big data" is kind of a meaningless buzzword at this point, so let's avoid that and instead say something like "as datasets have increased rapidly in size and ubiquity".

src/data-analysis-environments.rst Outdated

+                This article reflects on the past 50 years of history of data science. It introduces how the field of data science emerged as a superset of statistics and machine learning, driven by commercial rather than intellectual developments. It also visions how the field of data science will grow in the next 50 years.
+              Data science has grown rapidly over the last decade with the rise of big data.
+              It never comes to a consensus on the workflow of data science.

Contributor

cyrus- Nov 19, 2019

awkward sentence, perhaps just remove it entirely.

src/data-analysis-environments.rst Outdated

+              Statistical analysis is the process to generate mathematically rigorous evaluations from data.
+              There are many statistical tests designed for different contexts and purposes, which may stand only under specific preconditions.
+              Thus, it is a difficult task for data science workers, especially people with little or no statistical expertise, to decide which statistical tests to use given a specific dataset and hypotheses.
+              Tea is a high-level declarative language to translate users' hypotheses and domain knowledge into all valid statistical tests :cite:`jun:2019`.

Contributor

cyrus- Nov 21, 2019

move to paper summary

src/data-analysis-environments.rst Outdated

+              There are many computational notebook platforms designed for different analysis languages and environments, for example, `Apache Zeppelin`_, `Spark Notebook`_, `Observable`_, `RStudio`_, `Wolfram Notebooks`_.
+              Among these computational notebook platforms, `Jupyter Notebook`_ is the most widely used one.
+              It evolved from IPython, which is a terminal-based interactive shell for creating interactive visualizations for scientific computing.
+              Wrapping IPython as the kernel, Jupyter Notebook has a powerful graphical interface that allows users to edit and execute "cells" -- small chunks of code or markdown text.

Contributor

cyrus- Nov 21, 2019

Mention that Jupyter now supports more than just Python.

src/data-analysis-environments.rst Outdated

+                .. bibliography:: data-analysis-environments2.bib
+                  :filter: key == 'rule:2018'
+              Kery et al. took an interview approach to study how data scientists kept track of variants they explored in Jupyter notebook :cite:`kery:2018`.

Contributor

cyrus- Nov 21, 2019

Paper summary

src/data-analysis-environments.rst Outdated

+              To address the challenges in informal versioning, they designed Variolite, a code editing tool with local versioning control :cite:`kery:2017`.
+              Variolite is an Atom editor extension that enables users to version a section of the code based on users' selection.
+              They later integrated this design into Jupyter notebook with Verdant :cite:`kery:2019`.
+              They designed an enhanced history view with algorithmic and visualization techniques for data science workers to better foraging past analysis choices.

Contributor

cyrus- Nov 21, 2019

Talk about evaluation

src/data-analysis-environments.rst Outdated

+                .. bibliography:: data-analysis-environments2.bib
+                  :filter: key == 'kery:2019'
+              Head et al. took a different design approach :cite:`head:2019`.

Contributor

cyrus- Nov 21, 2019

Paper summary + talk about evaluation


          ✨ second draft

9ed340c

cyrus- requested changes

View reviewed changes

Contributor

cyrus- left a comment

Looks great -- main issue is that you are describing the papers in the text rather than in the corresponding paper summary. Please reorganize so that the details of specific papers are in the paper summaries (it's okay if the paper summaries are long). Second draft grade will be sent by email.

src/data-analysis-environments.bib Outdated

+              @inproceedings{zhang:2017,
+               author = {Zhang, Xiong and Guo, Philip J.},
+               title = {DS.Js: Turn Any Webpage into an Example-Centric Live Programming Environment for Learning Data Science},

Contributor

cyrus- Dec 3, 2019

put abbreviations in titles in braces, e.g. {DS.js}

src/data-analysis-environments.bib Outdated

+              @inproceedings{zhang:2019,
+               author = {Zhang, Xiong and Guo, Philip J.},
+               title = {Mallard\&\#58; Turn the Web into a Contextualized Prototyping Environment for Machine Learning},
+               booktitle = {Proceedings of the 32Nd Annual ACM Symposium on User Interface Software and Technology},

Contributor

cyrus- Dec 3, 2019

32Nd -> 32nd

src/data-analysis-environments2.bib Outdated

+              @inproceedings{xia:2018,
+               author = {Xia, Haijun and Henry Riche, Nathalie and Chevalier, Fanny and De Araujo, Bruno and Wigdor, Daniel},
+               title = {DataInk: Direct and Creative Data-Oriented Drawing},

Contributor

cyrus- Dec 3, 2019

{DataInk}

src/live-programming.rst Outdated

@@ @@ -16,6 +16,379 @@ REPLs @@
               Data Analysis Environments
               ==========================
+              Data scientists use data analysis environments for a variety of data science activities.
+              With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997 :cite:`donoho:2017`.

Contributor

cyrus- Dec 6, 2019

Start this sentence with "The term data science was first distinguished from pure statistics in 1997"

src/live-programming.rst Outdated

@@ @@ -16,6 +16,379 @@ REPLs @@
               Data Analysis Environments
               ==========================
+              Data scientists use data analysis environments for a variety of data science activities.
+              With the notion of extracting knowledge and insights from data, the term data science was first distinguished from pure statistics in 1997 :cite:`donoho:2017`.
+              University of Michigan statistics professor C.F. Jeff Wu popularized the term "data science" in his talk "Statistics = Data Science?".

Contributor

cyrus- Dec 6, 2019

no need to call out University of Michigan here

Contributor

cyrus- Dec 6, 2019

is there a link to the talk you can include?

src/live-programming.rst Outdated

+              It helps data science workers to present, reproduce, share, and collaborate their analysis.
+              There are many computational notebook platforms designed for different analysis languages and environments, for example, `Apache Zeppelin`_, `Spark Notebook`_, `Observable`_, `RStudio`_, `Wolfram Notebooks`_.
+              Among these computational notebook platforms, `Jupyter Notebook`_ supports more than 40 programming languages and has been widely used for writing and sharing computational narratives in various contexts.
+              It evolved from IPython, which is a terminal-based interactive shell for creating interactive visualizations for scientific computing.

Contributor

cyrus- Dec 6, 2019

see if you can find a paper about IPython to cite

src/live-programming.rst Outdated

+              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+              Although computational notebooks are designed to support not only performing, but also documenting and sharing analysis, most people consider it personal, exploratory, and messy.
+              Rule et al. :cite:`rule:2018` analyzed over 1 million notebooks on GitHub and found them often lack explanatory text.

Contributor

cyrus- Dec 6, 2019

same

src/live-programming.rst Outdated

+                This paper reports a large scale analysis of over 1 million open-source computational notebooks.
+                The results show that only one in four held explanatory text.
+              Kery et al. took an interview approach to study how data scientists kept track of variants they explored in Jupyter notebook :cite:`kery:2018`.

Contributor

cyrus- Dec 6, 2019

same

src/live-programming.rst

+              To address the challenges in informal versioning, they designed Variolite, a code editing tool with local versioning control :cite:`kery:2017`.
+              Variolite is an Atom editor extension that enables users to version a section of the code based on users' selection.
+              A preliminary usability study shows that 9 out of 10 participants found the tool easy to use and all 10 of them would consider use it in real life.
+              They later integrated this design into Jupyter notebook with Verdant :cite:`kery:2019`.

Contributor

cyrus- Dec 6, 2019

same

src/live-programming.rst Outdated


		This paper explores the design space in notebook code enviroments to help data scientists forage for information in their history.

		Head et al. took a different design approach :cite:`head:2019`.

Contributor

cyrus- Dec 6, 2019

same

LittleAprilFool added 2 commits

December 16, 2019 20:44


          ✨ final version

315839a


          🐛 fix a small issue

2df3803

cyrus- approved these changes

View reviewed changes

Contributor

cyrus- left a comment

Looks great! Final grades will be sent by email soon.

Base automatically changed from master to trunk

February 14, 2021 02:52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet