Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
ml		ml
.gitignore		.gitignore
README.md		README.md
accuracy.txt		accuracy.txt
data_preprocessing.py		data_preprocessing.py
requirements.txt		requirements.txt
server.py		server.py
streaming.py		streaming.py
test_accuracy.py		test_accuracy.py

Repository files navigation

bigdata-project

Prepare Environment

Clone project:

git clone [email protected]:ZakiChammaa/bigdata-project.git

Create a virtual environment:
```
virtualenv -p python3 virtualenv
```
Install the required libraries:
```
pip install -r requirements.txt
```

How to run

Build machine learning model:

You can build the decision tree or the random forest model. Note that you have to delete the model/ directory everytime you want to build a new one.

To build the decision tree model:
```
python ml/decision_trees.py
```
To build the random forest model:
```
python ml/random_forest.py
```
Stream the data and evaluate:

To stream the data, open 2 terminals.

On the first one, run the following:
```
python server.py localhost 9999
```
On the second terminal, run the following:
```
./virtualenv/bin/spark-submit streaming.py localhost 9999
```
When the data is done streaming, kill the program and run the following to get the accuracy:
```
python test_accuracy.py
```

Note that the data is already cleaned up and is available in the data folder. If you want to run the preprocessing script, run the following:

python data_preprocessing.py

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%