forked from r4bds/r4bds.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject_faq.qmd
289 lines (175 loc) · 18.5 KB
/
project_faq.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
# Project FAQ {.unnumbered #sec-projfaq}
## Text in the Quarto Docs
### How much text should we add beyond code?
- The emphasis for the communication of the course is on the presentation. Hence, think of the report text as "What information would I like to have / need in order to follow the code and project?"
### What should the narrative be?
- Do not think of the text as constituting a report in a classical sense, as you have likely done in other courses. Rather, the text constitutes the underpinnings of a technical report, i.e. the aim is to support the code and clarify what is going on enabling the reader to easily follow the code
## Data
### Is this data set ok to use?
- Where can I find data? Rethink the question: Discuss in your group what are your interests and then find data related to your problem, there are literally terabytes of publicly available data, i.e. don’t just google “data”, be specific
- In order for you to demonstrate that you master the entire bio data science cycle, you must choose a bio data set, which requires cleaning and tidying
- It would be good, if the data set requires joining, if not, consider artificially splitting it, to demonstrate that you master joining
- One approach could be to work on reproducing results from a paper. Here, you can even consider improving the data visualisations or adding something extra
- Do not use a data set from one of the course exercises. Doing so would not allow you to demonstrate that you are independently capable of performing a bio data science analysis
- You are 100% free to choose bio data from any resource as long as it meets the above requirements
- Be sure to understand that the purpose of the project is for you to synthesise an end-to-end Bio Data Science project demonstrating, that you have met the learning objectives of the course, as stated at the individual labs and on the DTU course base. Hence and importantly, your data must allow you to do so!
### How should we handle NAs in the data?
- Depends, does your data set come with a readme, which allows you to make a decision regarding NAs?
- Be careful not to simply drop all NAs as you might drop observations, where columns of interest have complete data
- Whether to remove NAs or not cannot be universally answered, as it is specific to the challenge / data at hand. If it's meaningful for you to keep NAs, then do so, if not, then remove them.
### How should we handle binary variables?
- In case your variable contains categories, i.e. values, where the question “Is one larger than the other” is nonsensical, then encode as factors
### How do we handle nonsensical data points?
- If you believe the data points are nonsensical, e.g. manual typing errors, then exclude them, but be very clear about which data (groups of) points were excluded and why
### Can we subset the data?
- Yes, if you are particularly interested in a subset of the data, then that is perfectly fine. Just be aware, that any choice you take with the data, you must be able to account for why you took that choice
- You should NOT just sample 100 observations as was done in the exercises to reduce runtime. However, if your data set is very large, consider down-sampling either randomly or by some metric of association. For the latter e.g. perform a test and identify the top X observations, save those to file and continue from there.
- For large data, try appending “.gz” to your files, when you write them using `write_tsv()`, this will invoke gzip-compression
## Input/output files
### How can we get a better overview of the data file flow?
- Consider creating a flow chart of “your data journey”. It will force you to think about how files are connected and how input/output flows, check out e.g. [draw.io](http://draw.io/) for this
### How can we write a file with e.g. factor encoding or a nested tibble?
- This can only be done by writing an R-object. However, this is for future reference - in this course if at all possible, stick to flat text files, e.g. .tsv or .csv
### How do we handle different naming conventions?
- **DO NOT** do a manual search and replace in Excel. That defeats the whole purpose of this course
- Fix names using a programmatic approach. The `stringr`-package is extremely useful in this context
### Where should we place external files?
- Consider creating an “images” folder in the “doc” folder, from where you can input images to your presentation
## Modelling
### What should we include in the modelling part?
- We have worked with fitting a linear regression and we have done a PCA. You could do something along those lines or something of similar complexity
- This is **not** a modelling course. We briefly visited modelling to bridge the process of going from the raw data to analysis ready and then communication via data visualisation. Therefore, do not start fitting a full fledged machine/deep learning model, it is a time-void, which you cannot “afford” to get sucked into
- Also, mind that the focus of this course is to make you realise that even though you have been trained in delivering results, then the process of arriving at the results in a reproducible manner is equally important - The process is the product!
### Do we HAVE to do a PCA?
- Does it make sense to do a PCA-analysis in your project? If you have group labels and you want to visualise to see if there is a separation, then a PCA is a good first step
### Our model is not "performing very well". What should we do?
- Leave it as is. The focus of this course is not on the results, but on the reproducible process of arriving and communicating said results
## Coding
### What goes in “augment”?
- If you google “Augment“, you will get “make (something) greater by adding to it; increase”, in other words, when you add e.g. variables to your data, which was not there initially, e.g. like we did, when we calculated the BMI from existing information on weight and height
- You should think of the augment script as the place, where you create your "database" for everything that happens afterwards, i.e. all your analyses and ideas
- Create that “database” and then load it into your downstream scripts
- Augment == Adding something
- Be aware, that if you downstream discover a need to create a variable, you will need to go back to the augment and update
### Can we mix base R and tidyverse?
- No, you might as well get used to it - Tidyverse all the way! (This is a tidyverse course)
- If you use base, like e.g. `my_data$my_var <- 1:4` or `my_data$my_var[1]` or similar, then you will get points deducted in the evaluation of your project
- Again, the process is the product and it matters much if you are using the course correct dialect of `R`. What you choose to do after the course is your choice
- Using base R, where course content contains describes the tidyverse alternative is no different from using python and I suppose you wouldn'e expect to be able to hand in python code as the final product in an R course
### What if we can’t make it work in tidyverse but only in base R?
- If you can make it work in base R, you can make it work in tidyverse
### Is it okay to use e.g. the base function sum()?
- Yes, think of it this way: R has as core functionality to do statistics. Tidyverse is for performing the data manipulation surrounding these calculations. Therefore, using functions such as sum(), mean(), sd(), etc. is naturally fine
- E.g. recall, when we did the `group_by()` $\rightarrow$ `summarise()`-workflow, we used the descriptive statistics functions
- Also, we used the `lm()`-function for fitting a linear regression.
- `Tidyverse` aims to replace the inconsistent and low legibility of base R with respect to data manipulation and visualisation - It does not replace the core functionality of doing statistics
### How do we know if there is a tidyverse function we should use rather than base?
- Identify the general area of what you’re working with. E.g. if strings, then go to your RStudio session and find the console and type `stringr::` and hit tab, then you can look through the functions
- Look and search in the [R4DS2e](https://r4ds.hadley.nz)-book
- Ask on the [Posit community pages](https://community.rstudio.com)
- Also generally base will use e.g. `read.table()`, where tidyverse will use `read_table()`, i.e. a period vs. an underscore
### Can we use package X for analysis/visualisation?
- If the package performs a lot of the work, which you are to demonstrate that you can do, then absolutely no. E.g. **DO NOT** use packages, where you call a plot-my-data-function and the ggplot-magick happens that will **NOT** allow you to demonstrate that you have met the course learning objectives
- For an example of and-then-magic-happens, see e.g. the `corrplot`-package
- Basically, any package, which automatically does "things", that are part of the learning objectives, will not allow you to demonstrate, that you have met the learning objectives!
### How do we run the entire project incl. the presentation?
- Make sure to include a programmatic call to render a file, i.e. in your run-all qmd file, when you include sub-documents and run the run-all file, the sub-files will not be generated, hence you need to include a programmatic call in your run-all file, which will also render the seperate sub-documents
- Note, you can include one Quarto document in another
### Should we create a Shiny app?
- The project deliverables are the GitHub repo and your presentation
- Remember, you do this project not for me as "your teacher", but for you to internalize the knowledge you have been exposed to during the initial 10 weeks of teaching
- If you find Shiny interesting/fun, then by all means, please do create an app
- Note, Shiny apps are optional and will not give extra credit - Be careful with your time!
### Should we create a package?
- If you find Rpackages interesting/fun, then by all means, please do create an Rpackage
- Note, Rpackages are optional and will not give extra credit - Be careful with your time!
### When should we create functions?
- Remember DRY (Don't-Repeat-Yourself), so generally, if you do something more than once, it is a function
- However, we do not focus much on functions in this course, so don’t put too much effort into creating functions
- Place functions in `99_proj_functions.R` as illustrated in the overview
- **DO NOT** hide your code away in functions, so that your main script just becomes 10 function calls. For this process-oriented course, show the code in your main `qmd`-documents, i.e. 01_, 02_, …
### Can we use loops?
- No, you should instead embrace functional programming and use functions from the `purrr`-package. Revisit lab 6, if needed
### Can we directly copy/paste a code chunk from an online source?
- No, that would be plagiarism. Understand the steps in the chunk and make the code your own
- I acknowledge that for coding this is not completely black and white, but please refrain from a direct copy/paste
### Can we use chatGPT or similar for coding?
- No, of course not, the coding has to be done by the students, not an AI
- How would you know if we used an AI? I may not be able to tell or I may be, I wouldn't recommend taking the risk
- Once through this course, AIs can be a powerful allied, but it requires skills and experience to use an AI productively, otherwise you will end up in "traps"
## Coding style
### How should we comment on our code?
- Remember, the point of writing verbose tidyverse code is that the code-becomes-the-comments. Think of it this way: The pipeline is the text on a page in a book, so before your pipeline, put a header/title on what is happening below, just as a title/header in a book
- The title/header will be what is generally going on, e.g. “Normalise all gene expression values using standard score approach” and the pipeline will be how that is actually done
- The pipe “%>%” is pronounced as “then”, when you “read” your code
### How should we style our code?
- Strictly adhere to the [Course Style Guide](code_styling.qmd) as introduced in the course
- Make sure that all scripts are styled the same way and be consistent in your coding style
- Your code should not be >80 chars wide, follow the vertical line in your editor
### How should we style our plots?
- Be concise, “less is more”
- Make some nice and relevant plots. Show us that you’ve learned data visualisation. Remember legends, titles etc.
- Do not put 3 messages in 1 plot. Instead put each message into a different plot
- Be **very careful** with matching the text size in your plots to that of your presentation (trial-and-error)
## GitHub
### Where should we place the project?
- The project must be placed on the course GitHub `rforbiodatascienceXX`, where `XX` is the course year
### Should our GitHub be public or private?
- Public, show the world what you can do!
### Should we keep all the data on GitHub?
- No, GitHub is meant for code, not data, see the project organisation
### What goes in the "doc" folder?
- The project organisation chart is a generic figure. You should not hand in a report. Your deliverable/product outcome for the project period is: 1) Your GitHub repository and 2) A presentation. Therefore, the "doc" folder would in this case contain your presentation
### Why are there multiple “qmd-reports” in the “results” folder?
- In case you have a computationally intensive part of the project it can be an advantage to split into sub-reports and then collect in a master report (think e.g. large LaTeX reports)
- Overview, even if your report is not computationally intensive, then it may be long and splitting into sub-reports facilitate better overview
### Should we include a README on GitHub?
- That would be a good way to briefly introduce what the contents of this repo is and something one would usually do
### Should everything be on Github, also plots we don’t present?
- Yes, everything!
## Exam/hand-in deliverables
### What does "Independently identify and adapt relevant novel state-of-the-art data science tools" mean?
- This is the exact result of completing the project work and my reason for not doing hand-holding in the project period. You are training your competences in that exact learning objective, so that once you leave this course, you will be able to do just that. This is absolute key as a modern bio data scientist. As an example, I programmed in a programming language called Perl, when I did my bioinformatics PhD, in fact the majority of my fellow phd students did that. I spend a lot of time getting really good at Perl... Today, I never use it... As in ever-never! Technology is under constants development and those with the capabilities stated in this LO will have a competitive edge and remain forefront in their working life
### What is the final product?
- A GitHub repository organised according to principles of reproducible data analysis
- A Quarto HTML-presentation following the IMRAD structure as elaborated in the [project description](project_description.qmd)
- This final presentation in HTML-format should be uploaded to DTU Learn
### Do we need to hand in a report?
- No, no report is required!
## Presentation
### Should we include code in our presentation?
- Your presentation should follow the standard scientific IMRAD-structure, i.e. introduction, materials-and-methods, results, and discussion.
- Include certain decisions you took with the data and your reasoning herfore.
- If you found a particular challenging problem in coding, to which you found an elegant solution, include that in your presentation as an example
- The presentation is your chance to practice communicating insights to stakeholders
- Your audience will be same-level bioinformaticians
- Perhaps consider a graphical representation of your process going from raw to analysis-ready data
### Should we include tables or plots?
- What makes sense? If you only have two numerical values, then perhaps a table is fine. If you have 100 observations, then a plot is likely better
- Remember, we are as humans evolutionary encoded to interpret visual information, not numbers
### How do we add plots to the Quarto presentation?
- Output the plot as a png file using the function ggsave and include that png in your presentation. Think dynamically, we don’t want to type anything manually in our presentation
### How do we get the presentation in wide-/full screen?
- Hit w and f while the HTML presentation is loaded
### Should we split our presentation into sub-presentations?
- No, this is not necessary given the extend/size of this project. Just create one `qmd`-file, which outputs an HTML-presentation
### We are working on re-creating paper results. Should we re-create the exact same plots?
- Ideally, you re-create the plots and then you create your own improved version of the visualisation
### Where should we “place” our Shiny app in our presentation?
- Demo it briefly at the end
### How can we illustrate our data handling?
- Consider creating a flow chart, what are input files and how are they connected to final output?
### How much code should we include in the presentation?
- The exam deliverables are two-fold. Code is the GitHub repo and your ability to communicate biological insights is in the presentation. Therefore, code could be included in the presentation if you faced and solved a particularly challenging problem in a clever way
### How do we include data numbers in the presentation?
- You could from your script output a results table with relevant numbers as a .tsv file and then read that into the presentation and extract the numbers
- Remember, think dynamical reporting, what if your input data changes and you manually entered the numbers in your presentation? Then you wouldn’t be much farther than a powerpoint
### What goes into the materials and methods section?
- Materials: What data did you use and where did you get it from?
- Methods: Which modelling did you use? Think of the methods section as a recipe for how to go from raw to results => Flow chart?
### Should we interpret our results?
- Yes, you have to think about it as a “normal” project presentation
### What about all the stuff we tried, which did not work?
- In the presentation, you should focus on what worked and what results you arrived at
- Exclude dead-ends from the presentation, but… Leave them in the repo, perhaps with an initial comment signifying that the following turned out to be a dead-end
- Think about this - Scenario: You’re working in a company and you and your team of 3 spend 6 months on a project, which turns out to be a dead-end. For future reference: Would the company be interested in knowing that this project was a dead end or should you delete everything and never speak of it again?