Skip to content

Commit

Permalink
Corrected spelling mistakes in M5, M6 and M8
Browse files Browse the repository at this point in the history
  • Loading branch information
LucasLista committed Jan 7, 2025
1 parent eec418d commit 55e5a20
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 32 deletions.
28 changes: 14 additions & 14 deletions s2_organisation_and_version_control/code_structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,15 @@ or maintain
(PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997

We are here going to focus on the organization of data science projects and machine learning projects. The core
difference this kind of projects introduces compared to more traditional systems is *data*. The key to modern machine
difference this kind of project introduces compared to more traditional systems is *data*. The key to modern machine
learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that
data should influence our choice of code structure. If we had another kind of application, then the layout of our
codebase should probably be different.

## Cookiecutter

We are in this course going to use the tool [cookiecutter](https://cookiecutter.readthedocs.io/en/latest/README.html),
which is tool for creating projects from *project templates*. A project template is in short just an overall structure
which is a tool for creating projects from *project templates*. A project template is in short just an overall structure
of how you want your folders, files etc. to be organized from the beginning. For this course we are going to be using a
custom [MLOps template](https://github.com/SkafteNicki/mlops_template). The template is essentially a fork of the
[cookiecutter data science template](https://github.com/drivendata/cookiecutter-data-science) that has been used for a
Expand Down Expand Up @@ -87,7 +87,7 @@ a lot of projects using `setup.py + setup.cfg`, so it is good to at least know a

=== "pyproject.toml"

`pyproject.toml` is the new standardized way of describing project metadata in a declaratively way, introduced in
`pyproject.toml` is the new standardized way of describing project metadata in a declarative way, introduced in
[PEP 621](https://peps.python.org/pep-0621/). It is written in [toml format](https://toml.io/en/) which is easy to
read. At the very least your `pyproject.toml` file should include the `[build-system]` and `[project]` sections:

Expand Down Expand Up @@ -159,8 +159,8 @@ a lot of projects using `setup.py + setup.cfg`, so it is good to at least know a
)
```

Essentially, the it is the exact same meta information as in `pyproject.toml`, just written directly in Python
syntax instead of `toml`. Because there was a wish to deperate this meta information into a separate file, the
Essentially, it is the exact same meta information as in `pyproject.toml`, just written directly in Python
syntax instead of `toml`. Because there was a wish to separate this meta information into a separate file, the
`setup.cfg` file was created which can contain the exact same information as `setup.py` just in a declarative
config.

Expand All @@ -173,7 +173,7 @@ a lot of projects using `setup.py + setup.cfg`, so it is good to at least know a
# ...
```

This non-standardized way of providing meta information regarding a package was essentially what lead to the
This non-standardized way of providing meta information regarding a package was essentially what led to the
creation of `pyproject.toml`.

Regardless of what way a project is configured, after creating the above files, the correct way to install them would be
Expand All @@ -188,7 +188,7 @@ pip install -e .
!!! note "Developer mode in Python"

The `-e` is short for `--editable` mode also called
[developer mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html). Since we will continuously
[developer mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html). Since we will be continuously
iterating on our package this is the preferred way to install our package, because that means that we do not have
to run `pip install` every time we make a change. Essentially, in developer mode changes in the Python source code
can immediately take place without requiring a new installation.
Expand Down Expand Up @@ -236,20 +236,20 @@ your head around where files are located.
When asked for a project name you should follow the
[PEP8](https://peps.python.org/pep-0008/#package-and-module-names) guidelines for naming packages. This means
that the name should be all lowercase and if you want to separate words, you should use underscores. For example
`my_project` is a valid name, while `MyProject` is not. Additionally, the packaage name cannot start with a
`my_project` is a valid name, while `MyProject` is not. Additionally, the package name cannot start with a
number.

??? note "Flat-layout vs src-layout"

There are two common choices on how layout your source directory. The first is called *src-layout*
where the source code is always place in a `src/<project_name>` folder and the second is called *flat-layout*
where the source code is place is just placed in a `<project_name>` folder. The template we are using in this
where the source code is always placed in a `src/<project_name>` folder and the second is called *flat-layout*
where the source code is just placed in a `<project_name>` folder. The template we are using in this
course is using the src-layout, but there are
[pros and cons](https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/) for both.

3. After having created your new project, the first step is to also create a corresponding virtual environment and
install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create
an new. Then install the project in that environment
install any needed requirements. If you have a virtual environment from yesterday feel free to use that, otherwise create
a new one. Then install the project in that environment

```bash
pip install -e .
Expand All @@ -270,7 +270,7 @@ your head around where files are located.
5. This template comes with a `tasks.py` which uses the [invoke](https://www.pyinvoke.org/) framework to define project
tasks. You can learn more about the framework in the last optional [module](cli.md) in today's session. However, for
now just know that `tasks.py` is a file that can be used to specify common tasks that you want to run in your
project. It is similar to `Markefile`s if you are familiar with them. Try out some of the pre-defined tasks:
project. It is similar to `Makefile`s if you are familiar with them. Try out some of the pre-defined tasks:
```bash
# first install invoke
Expand Down Expand Up @@ -350,7 +350,7 @@ your head around where files are located.
12. (Optional) Feel free to create more files/visualizations (what about investigating/exploring the data distribution?)
13. (Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is
13. (Optional) Let's say that you are not satisfied with the template I have recommended that you use, which is
completely fine. What should you then do? You should of course create your own template! This is actually not that
hard to do.

Expand Down
28 changes: 14 additions & 14 deletions s2_organisation_and_version_control/dvc.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@

!!! warning

Since August 2024, Google have changed their policy for the Google Drive API. This means that the proceduce for
setting up DVC with Google Drive has changed. The following exercises therefore needs extra authentication to work.
Since August 2024, Google has changed their policy for the Google Drive API. This means that the procedure for
setting up DVC with Google Drive has changed. The following exercises therefore need extra authentication to work.
You therefore have two options:

1. Skip these exercises for now. We are going to revisit DVC later in the course when we get access to a more
permanent storage solution in this [module](../s6_the_cloud/using_the_cloud.md).

2. Follow the instructions below to authenticate DVC with Google Drive. As a starting point read the following
[Github issue](https://github.com/iterative/dvc/issues/10516#issuecomment-2289652067) and then follow the
[GitHub issue](https://github.com/iterative/dvc/issues/10516#issuecomment-2289652067) and then follow the
instructions
[here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-a-custom-google-cloud-project-recommended).
for setting up a custom Google Cloud project.
Expand All @@ -34,7 +34,7 @@ Because this is an important concept there exist a couple of frameworks that hav
[DVC](https://dvc.org/), [DAGsHub](https://dagshub.com/), [Hub](https://www.activeloop.ai/),
[Modelstore](https://modelstore.readthedocs.io/en/latest/) and [ModelDB](https://github.com/VertaAI/modeldb/).
Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files
or in general storing any large *artifacts* files we instead store a pointer to these large flies. We then version
or in general storing any large *artifacts* files we instead store a pointer to these large files. We then version
control the point instead of the artifact.

<figure markdown>
Expand All @@ -45,7 +45,7 @@ control the point instead of the artifact.
</figure>

We are in this course going to use `DVC` provided by [iterative.ai](https://iterative.ai/) as they also provide tools
for automatizing machine learning, which we are going to focus on later.
for automating machine learning, which we are going to focus on later.

## DVC: What is it?

Expand Down Expand Up @@ -147,7 +147,7 @@ it contains excellent tutorials.
`dvc` converts the data into [content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
which makes data much faster to get. Finally, make sure that your data is not stored in your GitHub repository.

After authenticating the first time, DVC should be setup without having to authenticate again. If you for some
After authenticating the first time, DVC should be set up without having to authenticate again. If you for some
reason encounter that DVC fails to authenticate, you can try to reset the authentication. Locate the file
`$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json` where `$CACHE_HOME` depends on your operating system:

Expand All @@ -163,7 +163,7 @@ it contains excellent tutorials.

Delete the complete `{gdrive_client_id}` folder and retry authenticating with `dvc push`.

9. After completing the above steps, it is very easy for others (or yourself) to get setup with both
9. After completing the above steps, it is very easy for others (or yourself) to get set up with both
code and data by simply running

```bash
Expand All @@ -177,7 +177,7 @@ it contains excellent tutorials.

10. Let's now look at the process of creating a new version of our data. We are going to add some new data to our
dataset and version control this as well. The new data can be downloaded from this
[Google Driver folder](https://drive.google.com/drive/folders/1JTjbom7IrB41Chx6uxLCN16ZwIxHHVw1?usp=sharing)
[Google Drive folder](https://drive.google.com/drive/folders/1JTjbom7IrB41Chx6uxLCN16ZwIxHHVw1?usp=sharing)
or by running these two commands:
```bash
Expand All @@ -186,7 +186,7 @@ it contains excellent tutorials.
```
Copy the data to your `data/raw` folder and then rerun your data pipeline to incorporate the new data into the
files in your `processed` folder. The new data should are 4 files with train images and 4 files with train targets,
files in your `processed` folder. The new data should be 4 files with train images and 4 files with train targets,
a total of 20000 additional observations.
11. Redo the above steps, adding the new data using `dvc`, committing and tagging the metafiles e.g. the following
Expand All @@ -211,7 +211,7 @@ it contains excellent tutorials.
your model checkpoints.

In general `dvc` is a great framework for version-controlling data and models. However, it is important to note that it
does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever
does have some performance issues when dealing with datasets that consist of many files. Therefore, if you are ever
working with a dataset that consists of many small files, it can be a
[good idea to](https://fizzylogic.nl/2023/01/13/did-you-know-dvc-doesn-t-handle-large-datasets-neither-did-we-and-here-s-how-we-fixed-it):

Expand All @@ -228,10 +228,10 @@ working with a dataset that consists of many small files, it can be a
??? success "Solution"

Similar to a git repository having a `.git` directory, a repository using dvc needs to have a `.dvc` folder.
Alternatively you can you the `dvc status` command.
Alternatively you can use the `dvc status` command.

2. Assume you just added a folder called `data/` that you want to track with `dvc`. What is the sequence of 5 commands
to successful version control the folder? (assuming you already setup a remote)
to successfully version control the folder? (assuming you already set up a remote)

??? success "Solution"

Expand All @@ -246,6 +246,6 @@ working with a dataset that consists of many small files, it can be a
That's all for today. With the combined power of `git` and `dvc` we should be able to version control everything in
our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that `dvc`
offers more than just data version control, so if you want to deep dive into `dvc` we recommend their
[pipeline](https://dvc.org/doc/user-guide/project-structure/pipelines-files) feature and how this can be used to setup
version controlled [experiments](https://dvc.org/doc/command-reference/exp). Note that we are going to revisit `dvc`
[pipeline](https://dvc.org/doc/user-guide/project-structure/pipelines-files) feature and how this can be used to set up
version-controlled [experiments](https://dvc.org/doc/command-reference/exp). Note that we are going to revisit `dvc`
later for a more permanent (and large-scale) storage solution.
8 changes: 4 additions & 4 deletions s2_organisation_and_version_control/git.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ working together on the same project.

### ❔ Exercises

1. In your GitHub account create an repository, where the intention is that you upload the code from the final
1. In your GitHub account create a repository, where the intention is that you upload the code from the final
exercise from yesterday

1. After creating the repository, clone it to your computer
Expand Down Expand Up @@ -240,7 +240,7 @@ working together on the same project.
4. Finally, commit the merge and try to push.
8. (Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning.
However, if you are using a proper editor they also have build in support for version control. We recommend getting
However, if you are using a proper editor they also have built-in support for version control. We recommend getting
familiar with these features (here is a tutorial for
[VS Code](https://code.visualstudio.com/docs/editor/versioncontrol))
Expand All @@ -250,7 +250,7 @@ working together on the same project.
??? success "Solution"
You can check if there is a ".git" directory. Alternative you can use the `git status` command.
You can check if there is a ".git" directory. Alternatively you can use the `git status` command.
2. Explain what the file `gitignore` is used for?
Expand Down Expand Up @@ -288,7 +288,7 @@ That covers the basics of git to get you started. In the exercise folder you can
with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub:
in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a
IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine.
GitHub has an built-in editor that can simply be enabled by changing any URL from
GitHub has a built-in editor that can simply be enabled by changing any URL from
```bash
https://github.com/username/repository
Expand Down

0 comments on commit 55e5a20

Please sign in to comment.