Skip to content

Latest commit

 

History

History
26 lines (21 loc) · 1.29 KB

README.md

File metadata and controls

26 lines (21 loc) · 1.29 KB

Distributed Data Mining Practical Course

This repository contains applications developed during Distributed Data Mining Practical Course at Technical University of Munich. During the course, I have written several programs for Big Data processing and Data Mining using different languages and frameworks. Additionally, I have been using Terraform and Ansible to automate setting up the environment that is required to run these programs.

Technologies

During a Distributed Data Mining, I was using following technologies:

  • Apache Spark SQL - Spark library for data processing
  • Apache Spark MLlib - Spark library for machine learning
  • Hadoop Distributed File System (HDFS) - distributed file system for storing large files
  • Hadoop YARN - resource manager responsible for running and scheduling applications in cluster
  • Hadoop MapReduce - framework enabling processing large amount of data using MapReduce
  • Dask - Python library for parallelizing processing using multithreading and distributed processing
  • Terraform - tool for automating infrastructure provisioning using Infrastructure as Code
  • Ansible - tool for automating nodes/VMs configuration

During the course, I used the following languages:

  • Scala - Spark SQL, Spark MLlib
  • Java - Hadoop MapReduce, Spark SQL
  • Python - PySpark, Dask