Skip to content

Latest commit

 

History

History
110 lines (61 loc) · 4.09 KB

README.md

File metadata and controls

110 lines (61 loc) · 4.09 KB

Document dated [15, Jan 2023]

This space is for the Data Engineering Zoomcamp Learning by DataTalksClub

DataTalksClub Data Engineering GitHub

DataTalksClub Youtube Playlist 1 - With Airflow (From 2022 Zoomcamp Session)

DataTalksClub Youtube Playlist 2 - With Prefect *(this is an ongoing session on 2023 Jan)

(DataTalksClub)

Objective

Its a weekly online sessions focus on to develop oneself into a Data Engineering path.

This programme Comprises of multiple online session which spreads over for 6 weeks focusing on different aspects of Data Engineering.

  • Week 1 : Introduction & Prerequisites
  • Week 2 : Ingestion & Orchestration
  • Week 3 : Data Warehouse
  • Week 4 : Analytics Engineering
  • Week 5 : Batch Processing
  • Week 6 : Stream Processing

End of the Programme we would be focuing on the standalone project to build a Data Pipeline based on the learning to get Certified.

Technologies We will Learn

Week 1 Learning

Required Software Installation

  • Docker
  • Google Cloud CLI
  • Python

Docker

What is a Docker?

Docker-compose

Google Cloud Platform

Technologies

  • Technologies and sections which we will learn here
    • GCP (Google Cloud Platform) - Data Lake
    • BigQuery - Data Warehouse
    • Docker - Containerization
    • SPARK - Distributed Processing
    • KAFKA - Streaming
    • DBT - Data Transformation
    • SQL - Data Analysis & Exploration
    • Airflow/Prefect Pipeline orchestration

FAQ's

  • How Much time we need to spend on an Avg?

    • 3 to 4 hours in a week would suffice, and this varies based on everone's own pace.
  • Are we Building a project htrough this course?

    • We will be building a project in the course time with Taxi data of New York and will be performing Homework and at end we will build our own project at the end of the course to get certified.
  • How can I get Internship's?

    • Get to know about current courses and create a strong foundation and showcase works on the LinkedIn, GitHub and other Public forums which the recruiters can access and see the interest we have towards the Field/Domain.
  • What CLoud Technologies we can use?

    • GCP is the one we use predominantly, but it is also ok if we can use AWS. Main reason behind is the easy accessbility of Big Query. (Point the alternates for the Technologies)
  • Thoughts related to Interview Perspective and Datastructures

    • Learning Algorithms and Data Structures helps a lot in the field and interview perspective

Book Recommendations

Designing Data Intensive Applications - Martin Kleppmann

Database Internals - Alex Petrov

Reference

DataTalksClub (no date) DataTalksClub/data-engineering-zoomcamp: Free Data Engineering Course!, GitHub. Available at: https://github.com/DataTalksClub/data-engineering-zoomcamp

Issues faced and fixes

OpenJDK Issues

  • When I run java -version and get The operation couldn’t be completed. Unable to locate a Java Runtime. in MAC m2 it means that the OpenJDK which was installed from brew dint create the symlink properly. To fix the same please the below command.

sudo ln -sfn $(brew --prefix)/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk

Docker installation Issues

  • Use brew install docker to install docker in Mac M1/M2, incase of the following daemon error Cannot connect to the Docker daemon on macOS then you should be installing docker-machine brew install docker-machine
  • Attaching some useful links for the docker installation