This repository contains applications developed during Distributed Data Mining Practical Course at Technical University of Munich. During the course, I have written several programs for Big Data processing and Data Mining using different languages and frameworks. Additionally, I have been using Terraform and Ansible to automate setting up the environment that is required to run these programs.
During a Distributed Data Mining, I was using following technologies:
- Apache Spark SQL - Spark library for data processing
- Apache Spark MLlib - Spark library for machine learning
- Hadoop Distributed File System (HDFS) - distributed file system for storing large files
- Hadoop YARN - resource manager responsible for running and scheduling applications in cluster
- Hadoop MapReduce - framework enabling processing large amount of data using MapReduce
- Dask - Python library for parallelizing processing using multithreading and distributed processing
- Terraform - tool for automating infrastructure provisioning using Infrastructure as Code
- Ansible - tool for automating nodes/VMs configuration
During the course, I used the following languages:
- Scala - Spark SQL, Spark MLlib
- Java - Hadoop MapReduce, Spark SQL
- Python - PySpark, Dask