GitHub - kidae92/data_engineer_should_know: 데이터 엔지니어가 알아야 하는 것들

데이터 엔지니어가 알아야 할 모든 것들을 정리합니다. 자료 출처는 각 문서의 하단을 참조하시기 바랍니다.

본 레포 문서는 기술블로그(https://dhkdn9192.github.io) 에서도 보실 수 있습니다.

1. Data Engineering

데이터 엔지니어가 알아야 할 기술 질문

1-1. Hadoop Ecosystem

Apache Hadoop
- HDFS의 replication-factor를 3->5로 변경하면 최대 몇 번의 장애까지 견딜 수 있는가?
- JournalNode의 장애 허용 개수
- Hadoop 3.x의 Erasure Coding
- YARN이 도입된 이유
- HA consensus of HDFS
- 손상된 블록을 탐지하고 처리하는 프로세스
- Parquet와 칼럼 기반 스토리지
- Parquet의 압축 알고리즘
- Standby Namenode vs Secondary Namenode
- YARN scheduler
- HDFS의 read/write/replication 절차
- RDBMS의 SQL과 Hadoop MapReduce의 차이점
- MapReduce spilling
- Hadoop 서버의 vm.swappiness 설정
- 클라이언트에서 hdfs write를 위한 옵션을 설정하려면 어떤 xml 설정파일을 수정해야될까?
- 클러스터로 구성된 서비스를 무중단으로 업데이트하려면?(Rolling Restart)
- WebHDFS와 HttpFS의 차이점
Apache Spark
- RDD, DataFrame, Dataset
- SparkContext and SparkSession
- Spark Executor의 메모리 구조
- PySpark에서 Scala UDF / Python UDF 성능 비교
- 언어별 Spark API 성능 차이
- RDD 커스텀 파티셔닝
- RDD Aggregation: groupByKey vs reduceByKey
- repartition과 coalesce의 차이점
- Spark access first n rows: take() vs limit()
- 효율적인 DataFrame Join 전략
- Spark의 memoryOverhead 설정과 OutOfMemoryError
- memoryOverhead만 높여주면 해결 가능한 exceeding memory limits 문제 (parquet )
- spark.executor.memoryOverhead와 spark.memory.offHeap.size 설정은 어떻게 다른가?
- Project Tungsten의 주요 Spark 성능 개선 사항은 무엇인가?
- Java 직렬화 vs Kryo 직렬화
- ORC, Parquet 등 Spark에서 사용할 수 있는 데이터 소스 포맷과 압축 알고리즘
- k8s에서 Spark Job을 수행한다면 종료 후 로그는 어떻게 확인해야될까? (Spark History Server? AWS S3 logging?)
- Spark Job에 과도하게 많은 Memory/CPU를 할당해주면 무슨 일이 일어날까?
- Spark bucketing이란?
Apache Flink
- 배치처리와 스트림처리
Apache Druid
- Druid의 주요 특징
- Druid의 아키텍처
Apache HBase
- Major Compaction vs Minor Compaction
- Region Server architecture
- Time series Row key design: Salting, Empty region
- Region's locality
Apache Hive
- Partition, Bucket, Index
- Why isn't the metastore in hdfs?
- Which is faster, SORT BY or ORDER BY in HiveQL?
- What is HCatalog?
- Hive UDF란?
Apache Impala
- Impala with parquet에서 스키마 변경을 위한 PARQUET_FALLBACK_SCHEMA_RESOLUTION 옵션
Apache Kafka
- Kafka의 partition은 많을 수록 좋을까?
- Kafka Streams Topology
- Kafka에서 Zookeeper의 역할
- Kafka + Spark Streaming : 2가지 Integration 방법 비교
- Kafka + Spark Streaming : 파티션 수와 컨슈머 수 정하기
- Kafka의 exactly-once delivery
- Burrow와 Telegraf로 Kafka Lag 모니터링하기
- ISR (In Sync Replica)
- Kafka의 Controller Broker(KafkaController)란 무엇인가?
- dead letter queue
Apache Oozie
- Oozie를 사용하면서 불편했던 점들
Apache Airflow
- Executor Types: Local vs Remote (link)
- Celery 개념과 Celery Excutor
CDH setup
- ~~Set up Virtual Box~~
- ~~Install Cloudera Manager~~
Common Questions

1-2. ELK Stack

1-3. Kubernetes and Docker

Docker
- Container vs VM
- Difference between Docker and process
Kubernetes Cluster
- Pod
- Replica Set
- Deployment
- Service
- Namespace

1-4. AWS

Amazon EC2
Amazon S3
- S3 vs EFS vs EBS
- s3, s3n, s3a 차이점
Amazon Redshift
- Amazon Redshift가 지원하지 않는 것들
Amazon EMR
- Node Types: Master, Core, Task Nodes

2. Computer Science

2-1. Operation System

2-2. Database

2-3. Network

2-4. Programming Language

Java
- 인터페이스와 추상클래스의 차이, 그리고 다형성
- JVM, JIT Compiler, GC
- GC 정리
- Java 메모리 누수
- On-heap과 Off-heap
- String 대신 StringBuffer, StringBuilder를 쓰는 이유
- static 선언과 GC
- Primitive type, Reference type, Wrapper class
Scala
- Scala의 함수형 프로그래밍 성질
- Scala의 pass-by-name
- 동반 객체 (Companion Object)
- 케이스 클래스 (case class)
Python
- GIL(Global Interpreter Lock)

2-5. Data Structure and Algorithm

Array vs Linked List
Stack and Queue
- Stack으로 Queue 구현하기
Tree
- Binary Search Tree (BST)
- AVL Tree
- Heap
Hash Table
Graph
- Dijkstra algorithm
Sorting
Recursion
Dynamic Programming

2-6. common sense

MVC Pattern
객체지향의 DTO, DAO, VO 개념 용어
Idempotence(멱등성)
테스트 도구와 절차
트래픽/트랜잭션량 측정
Singleton 패턴을 사용하는 이유

3. GoF Design Pattern and Architecture Pattern

GoF란 1995년에 출간된 "Design Patterns of Reusable Object-Oriented Software"라는 책의 저자들(Erich Gamma, Richard Helm, Ralph Johnson, John Vlissdes)를 의미한다.

4. Designing Data-Intensive Application

데이터 중심 애플리케이션 설계

OLTP와 OLAP

5. Fields of Study

머신러닝, 데이터분석 등 관심있는 연구 분야와 수행 프로젝트 정리

Anomaly Detection
Churn Prediction
NLP
Recommender System
ideas
- PySpark 클러스터 환경에서 각 노드별 python package 일괄 관리 툴
- Apache Nutch의 streaming 버전, Spark 기반의 웹 크롤러

Name		Name	Last commit message	Last commit date
Latest commit History 623 Commits
bigdata_components		bigdata_components
fields_of_study/anomaly_detection		fields_of_study/anomaly_detection
img		img
interview		interview
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

1. Data Engineering

1-1. Hadoop Ecosystem

1-2. ELK Stack

1-3. Kubernetes and Docker

1-4. AWS

2. Computer Science

2-1. Operation System

2-2. Database

2-3. Network

2-4. Programming Language

2-5. Data Structure and Algorithm

2-6. common sense

3. GoF Design Pattern and Architecture Pattern

4. Designing Data-Intensive Application

5. Fields of Study

Reference

About

Releases

Packages

Languages

kidae92/data_engineer_should_know

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

1. Data Engineering

1-1. Hadoop Ecosystem

1-2. ELK Stack

1-3. Kubernetes and Docker

1-4. AWS

2. Computer Science

2-1. Operation System

2-2. Database

2-3. Network

2-4. Programming Language

2-5. Data Structure and Algorithm

2-6. common sense

3. GoF Design Pattern and Architecture Pattern

4. Designing Data-Intensive Application

5. Fields of Study

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages