Distributed Data Mining Practical Course

This repository contains applications developed during Distributed Data Mining Practical Course at Technical University of Munich. During the course, I have written several programs for Big Data processing and Data Mining using different languages and frameworks. Additionally, I have been using Terraform and Ansible to automate setting up the environment that is required to run these programs.

Technologies

During a Distributed Data Mining, I was using following technologies:

Apache Spark SQL - Spark library for data processing
Apache Spark MLlib - Spark library for machine learning
Hadoop Distributed File System (HDFS) - distributed file system for storing large files
Hadoop YARN - resource manager responsible for running and scheduling applications in cluster
Hadoop MapReduce - framework enabling processing large amount of data using MapReduce
Dask - Python library for parallelizing processing using multithreading and distributed processing
Terraform - tool for automating infrastructure provisioning using Infrastructure as Code
Ansible - tool for automating nodes/VMs configuration

During the course, I used the following languages:

Scala - Spark SQL, Spark MLlib
Java - Hadoop MapReduce, Spark SQL
Python - PySpark, Dask

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Distributed Data Mining Practical Course

Technologies

Files

README.md

Latest commit

History

README.md

File metadata and controls

Distributed Data Mining Practical Course

Technologies