A tool for mining software repositories and creating replication packages using a Neo4j Graph Database. This tool aims to do 3 things:
- Reduce scripting effort for drilling repository data.
- Provide a simple interface for querying drilled data.
- Allow easily creating a replication package which can be distributed.
This project is built with docker and must be run with docker:
- Copy env from example:
cp .env.example .env
- Run the docker container:
docker compose up -d
- Change the permissions of the import directory. This is necessary so that the backend can access the import directory.
chmod -R 775 volumes/neo4j_import/
- Navigate to http://localhost:5173/
The system is split into seperate docker containers that communicated between one another.
RabbitMQ is used for evenly distributing drill jobs between multiple instances of the driller workers.
This application can be used in 3 different ways, catering to different use cases.
If you are here to interact with a replication package that was created by a line of research.
For researchers who wish to validate the findings of the study.
For researchers who wish to start a new study from scratch with NeoRepro. The NeoRepro repository must be cloned and used as a template for the new project.
With the drilling yaml configuration you can extract data from a list of Git repositories. In the tool you will find an example drill configuration. The schema is made up of two primary sections:
defaults
: Object containing default values which are used for each drill job.repositores
: List of repositories to be drilled and the configuration for that drill. If a configuration is set in thedefaults
section but not in the individual configuration, then the default is applied.
Here is a snippet of a configuration from the Mining Cost awareness case study (https://github.com/AndrewRutherfoord/cloud-cost-awareness-NeoRepro-reproduction):
defaults:
delete_clone: false
index_file_modifications: true
pydriller:
to: "2022-05-30"
only_modifications_with_file_types:
- '.tf'
- '.tf.json'
filters:
commit:
- field: 'msg'
value:
- cheap
- expens
- cost
- efficient
- bill
- pay
method: 'contains'
repositories:
- name: iks_vpc_lab
url: https://github.com/ibm-cloud-architecture/iks_vpc_lab.git
delete_clone: true
pydriller:
to: "2023-05-30"
- name: cloud-platform-terraform-monitoring
url: https://github.com/ministryofjustice/cloud-platform-terraform-monitoring.git
- name: terraform-google-nat-gateway
url: https://github.com/GoogleCloudPlatform/terraform-google-nat-gateway.git
The defaults
object contains the following fields:
-
pydriller
: Object containing configurations for pydrillerRepository
class. This application uses pydriller under the hood to drill the repositories. All options explained at pydriller.since
: Date from which to start drilling. Format: YYYY-MM-DDto
: Date to which commits should be drilled. Format: YYYY-MM-DDfrom_commit
: A commit hash from which to start drilling.to_commit
: A commit hash to which commits should be drilled.from_tag
: A tag from which to start drilling.to_tag
: A tag to which commits should be drilled.only_in_branch
: Name of branch to be drilled.only_no_merge
: Boolean. If true, only commits that are not merged will be included.only_authors
: List of strings. Only commits by these authors will be included.only_commits
: List of strings. Commit hashes for commits to be included.only_release
: Boolean. Only commits that are tagged release will be included.filepath
: Only commits that modify this file will be included.only_modifications_with_file_types
: List of string. Only commits that modify files of this type will be included.
-
filters
: Object containing string filters.commit
: List of filters. (Shown below)
-
delete_clone
: Boolean. Indicates whether to delete the cloned repository after the drilling is complete. -
index_file_modifications
: Boolean. Indicates whether to drill the modified files. If false, only the commits will be drilled. -
index_file_diff
: Boolean. Indicates whether the file diffs should be indexed. If false, it won't be added to database.
A filter contains the following fields:
field
: The field to be checked for the filter.value
: A string or list of strings. The value(s) to be checked for the filter. If list, then behaves as anOR
(if field contains any of the values).method
: Can be one of the following:contains
: The value is contained in the field.!contains
: The value is not contained in the field.exact
: The value is equal to the field.!exact
: The value is not equal to the field.
Each repository can contain all of the fields from defaults
but must also contain the following fields:
name
: Name of the repository.url
: Https url to the repository to clone it in the case it isn't already cloned.
If any values are not provided in the repository, the default values from defaults
will be used.
Once the drill configuration file has been composed on the frontend of NeoRepro, the Execute
button on the Drill configration page can be clicked in order to commence the drilling of the
repositories based on the configuration file.
To view the status of the drill jobs, navigate to the job status page of the frontend. Here you can see which repositories are being drilled currenly, which are pending, which have failed and which are complete.
The time taken to drill will vary widely depending on the configurations set earlier and the number
of Driller Workers that are running. By default NeoRepro has 3 Driller Workers running which will
drill 3 repositories simultaneously. If your computer can handle it, you can increase this by
changing the number of replicas
in the Docker Compose file.
Once the drilling is complete, all of the repository data will be contained in the Neo4j Graph Database. The Cypher query language (https://neo4j.com/docs/cypher-manual/current/introduction/) can be used to query the data. There are some example queries pre-populated in NeoRepro.
To query the dataset, navigate to http://localhost:5173/query . Saved queries can be found on the left side. You can compose a new query by writing in the query box at the top of the page. Hitting execute will send the query to the Neo4j and the response will be shown just below the query box.
You can download the result of the query in the JSON format.
Queries can also be executed through the Neo4j interface that comes with the Neo4j docker container. This can be accessed at http://localhost:7474/
At any point you can take a database backup of the current state. This will produce a .cypher
file
which can be used to restore the database to it's current state. This is done on the manage DB page
(http://localhost:5173/manage-database). You can reload a snapshot of the database that you have
saved previously.
Warning! Loading a database snapshot will delete the current data in the database.
On the manage DB page you can delete all data from the database if you would like.
The manage DB page also shows some general information about the current state of the database.