Document dated [15, Jan 2023]
This space is for the Data Engineering Zoomcamp Learning by DataTalksClub
DataTalksClub Data Engineering GitHub
DataTalksClub Youtube Playlist 1 - With Airflow (From 2022 Zoomcamp Session)
DataTalksClub Youtube Playlist 2 - With Prefect *(this is an ongoing session on 2023 Jan)
(DataTalksClub)
Its a weekly online sessions focus on to develop oneself into a Data Engineering path.
This programme Comprises of multiple online session which spreads over for 6 weeks focusing on different aspects of Data Engineering.
- Week 1 : Introduction & Prerequisites
- Week 2 : Ingestion & Orchestration
- Week 3 : Data Warehouse
- Week 4 : Analytics Engineering
- Week 5 : Batch Processing
- Week 6 : Stream Processing
End of the Programme we would be focuing on the standalone project to build a Data Pipeline based on the learning to get Certified.
Required Software Installation
- Docker
- Google Cloud CLI
- Python
What is a Docker?
Docker-compose
- Technologies and sections which we will learn here
- GCP (Google Cloud Platform) - Data Lake
- BigQuery - Data Warehouse
- Docker - Containerization
- SPARK - Distributed Processing
- KAFKA - Streaming
- DBT - Data Transformation
- SQL - Data Analysis & Exploration
- Airflow/Prefect Pipeline orchestration
-
How Much time we need to spend on an Avg?
- 3 to 4 hours in a week would suffice, and this varies based on everone's own pace.
-
Are we Building a project htrough this course?
- We will be building a project in the course time with Taxi data of New York and will be performing Homework and at end we will build our own project at the end of the course to get certified.
-
How can I get Internship's?
- Get to know about current courses and create a strong foundation and showcase works on the LinkedIn, GitHub and other Public forums which the recruiters can access and see the interest we have towards the Field/Domain.
-
What CLoud Technologies we can use?
- GCP is the one we use predominantly, but it is also ok if we can use AWS. Main reason behind is the easy accessbility of Big Query. (Point the alternates for the Technologies)
-
Thoughts related to Interview Perspective and Datastructures
- Learning Algorithms and Data Structures helps a lot in the field and interview perspective
Designing Data Intensive Applications - Martin Kleppmann
Database Internals - Alex Petrov
DataTalksClub (no date) DataTalksClub/data-engineering-zoomcamp: Free Data Engineering Course!, GitHub. Available at: https://github.com/DataTalksClub/data-engineering-zoomcamp
OpenJDK Issues
- When I run
java -version
and getThe operation couldn’t be completed. Unable to locate a Java Runtime. in MAC m2
it means that the OpenJDK which was installed from brew dint create the symlink properly. To fix the same please the below command.
sudo ln -sfn $(brew --prefix)/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
Docker installation Issues
- Use
brew install docker
to install docker in Mac M1/M2, incase of the following daemon errorCannot connect to the Docker daemon on macOS
then you should be installing docker-machinebrew install docker-machine
- Attaching some useful links for the docker installation
- Docker installation by Vivek Suresh