PyCIRAS (Python Code Insight and Repository Analysis System) is a comprehensive tool designed for mining, analyzing, and visualizing data from Git repositories. This tool has been developed as part of a research study, and it is intended to assist researchers and developers in extracting valuable insights from code repositories.
- Clone and manage multiple Git repositories.
- Extract metadata, unit testing data, and code quality metrics from repositories.
- Store raw data in an SQLite database or json, and process it into CSV files.
- Interactive data analysis and visualization using JupyterLab Notebooks.
- Clone the repository:
git clone https://github.com/Majistaten/PyCIRAS
cd PyCIRAS
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the required dependencies:
pip install -r requirements.txt
- Python 3.10 or higher
- GitHub API token (optional, but needed for metadata and stargazer mining)
- Clone the reproduction package:
git clone https://github.com/Majistaten/PyCIRAS-reproduction-package
- Unzip the content of the reproduction package into the
out/data
folder and ensure that the name of the folder is2024-03-29_11-30
. - Start JupyterLab:
jupyter lab
- Open and interact with the provided notebook
notebooks/thesis.ipynb
to conduct the data analysis and visualization.
- Start JupyterLab:
jupyter lab
- Open the provided notebook
notebooks/DEMO.ipynb
to see a demonstration of the PyCIRAS tool.
- Start JupyterLab:
jupyter lab
- Define the repositories to mine by either:
- Entering the repository URLs in the
repos.txt
file, one line per URL. - Creating a list of repository URLs in a jupyter notebook. For example, change the content of the
repos
list innotebooks/thesis.ipynb
.
- Entering the repository URLs in the
- Fine-tune the mining process by adjusting the content in the
utility/config.py
file. - Rename
.env.example
to.env
. - (optional) Enter your GitHub API token and NTFY credentials in the
.env
file. - Clone the repositories:
import pyciras
# Change the parameters according to your needs.
pyciras.run_repo_cloner(repo_urls=None, # replace with the list of repository URLs if not using repos.txt
chunk_size=100,
multiprocessing=True)
- Mine the repositories:
import pyciras
# Change the parameters according to your needs.
pyciras.run_mining(repo_urls=None, # replace with the list of repository URLs if not using repos.txt
chunk_size=1,
multiprocessing=False,
persist_repos=False,
stargazers=True,
metadata=True,
test=True,
git=True,
lint=True)
- Create a new Jupyter notebook or make the needed changes in
notebooks/thesis.ipynb
and start analyzing the mined data.
This module is responsible for all I/O and data management. It consists of three submodules:
repo_management
: Handles cloning, storing, removing, and loading Git repositories for mining operations.database_management
: Inserts raw data from mining operations into an SQLite database.data_management
: Processes raw JSON data into a flat format and creates CSV files using the Pandas library.database_models
: Contains the SQLAlchemy models for the SQLite database.
This module is responsible for mining repositories and extracting data:
git_mining
: Provides metadata mining through GitHub's API and Git process mining with Pydriller.test_mining
: Uses an abstract syntax tree traversal module to mine unit testing data.lint_mining
: Mines code quality data through Pylint.
This module contains JupyterLab Notebooks, which combine Python code, equations, documentation, and visualizations. These notebooks facilitate interactive experimentation and data analysis using libraries like NumPy, SciPy, Matplotlib, and Seaborn.
The main notebook is thesis.ipynb
, which is used to replicate the study's results. It contains the data analysis and
visualization code for the study.
This module contains utility functions and configurations for the PyCIRAS tool:
config
: Contains configurations for the mining process, such as directory structures, ignore directories, logging, and result formats.logger_setup
: Configures the logging system for the tool.ntfyer
: Sends notifications to the user using the NTFY library.progress_bars
: Provides progress bars for the mining process.utils
: Contains utility functions for the tool.timer
: Decorator for timing function execution times.
The reproduction package for this study can be cloned from PyCIRAS-reproduction-package. This package includes the datasets and additional materials required to reproduce the study's results.