Skip to content

This repository contains implementations of various assignments and a project related to the High-Level Computer Vision course.

Notifications You must be signed in to change notification settings

dhimitriosduka1/hlcv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High-Level Computer Vision (HLCV)

This repository contains implementations of various assignments and a project related to the High-Level Computer Vision course lectured by Prof. Dr. Bernt Schiele at Saarland University during the Sommer Semester 2024. All course materials and assignments outlines belong to the course's instructors. No guarantee is given to the correctness of the solution of the assignments, or any code implementation in this entire repository.

Table of Contents

Assignments

Each assignment is organized in separate folders as outlined below:

  • Assignment 1: Exercises and code related to the first assignment.
    • Classical Machine Learning: Image Representations and Histogram Distances (grayvalue, rgb, rg, dxdy), Object/Neighbours Identification via distance metrics (Chi^2, L2), and performance evaluation with PR curve.
    • UCI ML hand-written digits recognition with an SVM classifier.
  • Assignment 2: Exercises and code related to the second assignment.
    • Simple two-layer neural network and SGD training algorithm based on back-propagation using only basic matrix operations.
    • Multi-layer perceptron using PyTorch with different layer configurations using BatchNorm, Dropout, and EarlyStopping as regularization techniques.
    • Both trained, finetuned using GridSearchCV and evaluated on the CIFAR-10 dataset.
  • Assignment 3: Exercises and code related to the third assignment.
    • Convolutional Neural Networks (CNN) using PyTorch for image classification on the CIFAR-10 dataset, with filters visualization, Top-1 and Top-5 accuracy evaluation.
    • BatchNorm, Dropout, Data Augmentation and EarlyStopping as regularization techniques to improve the model's generalization.
    • Fine-tune the pre-trained model VGG_11_bn on the CIFAR-10 dataset, with various configurations (fine-tuning only the classifier layers, fine-tuning the entire model with pre-loaded weights, and without pre-loaded weights).

Project: From Strings to Sequences — Acoustic Guitar Chords Recognition

See the Project/reports/ folder for the complete LaTeX report.

Summary: This system automates chord recognition from acoustic guitar videos by detecting and classifying chords based on video input. It leverages YOLO and Faster R-CNN for fretboard detection, allowing the system to identify the position of the hand and fingers on the guitar neck. For chord classification, it utilizes Vision Transformers and DINOv2, which process visual cues to distinguish between different chords. Additionally, hand pose estimation was explored as a potential method to perform chord recognition. Finally, We extend the work from [Kristian et al., 2024] by exploring the potential of using state-of-the-art deep learning models and techniques with an additional proposal for an audio generation module.

Introduction

ACR is an information retrieval task that automatically recognizes the chords played in a music piece, whether it be an audio or video file. The ability to accurately recognize and identify chords is crucial for various downstream applications such as music analysis, music transcription, or even restoration of corrupted musical performances.

Our work aims to improve ACR in the context of acoustic guitars. We base our work on [Kristian et al., 2024] and extend it by exploring the YOLO [Redmon et al., 2016] and Faster R-CNN [Ren et al., 2016] family for fretboard (the neck of the guitar) recognition, alongside ViT [Dosovitskiy et al., 2020] and DINOv2 [Oquab et al.] architectures for chord recognition.

Datasets

We identified a significant gap in available datasets for the task of guitar chord recognition. Initially, we created our own by recording 90-second videos for each chord in three different environments, ensuring high quality by capturing them in 4K resolution at 60 fps. We extracted the frames from the video and downsampled them to a resolution of 640 $\times$ 360 pixels. This process generated approximately 30,000 frames per chord. To increase the diversity of the dataset, we used two different sampling methods: simple random sampling and kNN-based sampling. In the former method, we selected 1,000 frames at random, while in the latter, we used the kNN to choose 1,000 frames that were the most distinct from one another.

t-SNE plot of chords dataset

Figure 1: The t-SNE plot of our dataset containing 14 chords. Each point represents a KNN-sampled frame, with the color indicating the corresponding chord label.

Unfortunately, both sampling strategies resulted in an overly simplistic dataset that failed to capture the real-world complexity of chords, as shown by the above Figure. This resulted in poor model generalization. However, rather than abandoning our dataset we used it as a test set to evaluate the generalizability of our model. In the end, we decided to use existing datasets [1, 2, 3, 4] publicly available in Roboflow for training the models, merging them to create a more complex dataset which resulted in significantly better results.

This change in our approach necessitated a change in the scope of our chord recognition task. As a consequence of using existing datasets, we were limited to only seven chords in total—A, B, C, D, E, F, and G—down from the 14 chords originally planned.

For the fretboard detection task, we used pre-trained versions of the models on the COCO dataset, which is of considerable size commonly dedicated to object identification, where pproximately 200,000 labeled images are organized into 80 distinct categories [Lin. et al] (somewhat comparable to ImageNet, but with a different emphasis with regards to types of objects). This is further explained in Methods. To finetune it, we used the following publicly available dataset in Roboflow by Hubert Drapeau: Guitar necks detector.

Fretboard Detection

We experimented with the YOLOv8 (m), YOLOv9 (c) and YOLOv10 (l) models1, and from the Faster R-CNN family [Ren et al., 2016], we tried a ResNet-50-FPN backbone and a MobileNetV3-Large FPN backbone2. Furthermore, we tried two different finetuning methods: freezing every layer and adding a classifier head for our new fretboard class whose output is concatenated with the existing final layer output (from now on, models with "(FB)" next to the name), and not freezing any layer, i.e., finetuning the whole model. Both methods are fundamentally different and serve different purposes. The first method is used to finetune the model to a specific task while keeping the backbone as it is. This allows us to keep the previous learned features and classes. On the other hand, the second method will finetune the whole model to the new task, while potentially forgetting the previous learned features and classes. In our finetuning process, this one had the effect of having only 2 classes in the end: the fretboard class and the background.

Guitar Chord Classification

We used two different approaches for guitar chord classification: a Hand Pose Estimation + Classifier approach and a Classifier Only approach.

Hand Pose Estimation + Classifier

First, we wanted to try a simple yet interesting approach. For a given sample image, we utilized a hand pose estimation model to extract the hand shape from it, which was then used as the input to a classifier. We used MediaPipe to extract the hand shape followed by different classifiers — SVM, Random Forest, and a simple MLP — to classify the chords.

Occlusion in trained model - full picture

Figure 2: The pipeline of the Hand Pose Estimation + Classifier approach. First, the image is passed through a hand pose estimation model to extract the landmarks. Then the result is passed through a classifier to determine the chord being played.
Classifier only approach

Next, we wanted to explore the potential benefits of using more advanced architectures to perform chord classification. We decided to experiment with Vision Transformers (ViT) [Dosovitskiy et al., 2020], specifically ViT-B/16, ViT-B/32, ViT-L/16, and ViT-L/32, available on Hugging Face to assess how different configurations of patch sizes and model sizes would impact performance. Additionally, we were also interested in evaluating the effectiveness of pre-trained self-supervised models in our task, so we also included DINOv2 [Oquab et al.] in our experiments. This allowed us to compare their performance against the ViT models and explore whether self-supervised learning offers advantages in this task.

Experimental Results and Analyses

The table below shows the performance metrics of the different models tested on the finetuning dataset (Guitar necks detector) and the following Figure shows the Recall vs. mAP@50 for the models tested and finetuned on the fretboard class, while showcasing the number of parameters. Naturally, the models finetuned with a Frozen Backbone (FB) performed slightly worse than the models finetuned without a Frozen Backbone; this was expected since the latter had the advantage of being able to learn the new task from scratch, using all layers, while the former only trained a smaller classifier head. Since we wanted to keep having the ability to recognize the other 80 valuable classes from the COCO dataset, we chose a model from the (FB) list, the YOLOv9 (FB) model as the best model for our task. This one obtained the highest precision and, after re-evaluating on the COCO dataset + fretboard class, it was the one that gave better results in terms of confusion matrix and Precision-Recall curve.

Model P R mAP50-95 mAP50
YOLOv8 98.9% 93.0% 88.7% 98.2%
YOLOv9 96.4% 96.8% 85.3% 97.8%
YOLOv10 94.2% 87.0% 80.0% 94.4%
Faster-RCNN-Resnet50 80.8% 82.4% 77.5% 94.0%
Faster-RCNN-MobileNetv3 79.4% 81.6% 75.7% 94.9%
YOLOv8 (FB) 76.7% 85.1% 53.4% 87.8%
YOLOv9 (FB) 82.4% 74.7% 54.7% 87.0%
YOLOv10 (FB) 81.4% 84.0% 71.2% 89.9%
Faster-RCNN-Resnet50 (FB) 62.9% 66.3% 59.0% 93.4%
Faster-RCNN-MobileNetv3 (FB) 71.7% 73.6% 68.3% 93.0%

Table 1: Performance metrics of different models on the evaluation dataset, shown in percentages. Each column represents a specific metric: Precision, Recall, mAP50-95, and mAP50. (FB) denotes models fine-tuned with a Frozen Backbone.

Recall vs mAP@50

Figure 3: Recall vs. mAP@50 for the models tested and finetuned on the fretboard class.

Since our YOLOv9 model did not lose its capability to detect the original 80 classes from the COCO dataset, we decided to re-evaluate its performance on the whole COCO dataset to quantify how much the finetuning process affected the original pre-trained model's performance. The results are shown below, where positive values are desirable for diagonal entries (indicating correct classifications), and negative values are preferred for off-diagonal entries (indicating reduced misclassifications).

Confusion matrix of YOLOv9c on COCO dataset Subset for better visualization of fretboard class
Finetuned YOLOv9c - Baseline on COCO dataset Subset for better visualization of fretboard class

We can see from the previous illustrations that the accuracy on the fretboard class is ~91%. And given the appearance of the confusion matrix, constructed by the difference between the finetuned model and the original model, we can see that there are not many red entries in the diagonal, and actually some green entries in off-diagonal entries.

Confusion Matrix Pos. Positive Negative
Diagonal 1.41% 5.38%
Off-diagonal 4.10% 21.11%

Table 2: Absolute sums of values (as %) after taking the element-wise difference between the final confusion matrix obtained after fine-tuning the YOLOv9 model for our fretboard class and the original confusion matrix of the pre-trained version on the COCO dataset. These values indicate that, for the diagonal entries where the difference was positive, the model improved by 1.41%, while for the off-diagonal entries where the difference was negative, the model improved by 21.11%.

Some qualitative results are shown below, comparing the original YOLOv9 prediction with the finetuned model and the model with a frozen backbone + classifier layer.

Original YOLOv9 prediction Full-finetuning Frozen backbone + Classifier layer
With full-finetuning Frozen backbone + Classifier layer
Prediction on an image from the COCO dataset From the Penn-Fudan dataset

To evaluate our approach against those in the original paper, we implemented the InceptionResNetv2 model as described by the authors. After training the model using the hyperparameters provided by [Kristian et al., 2024] on our dataset, we obtained the results shown in Table 3, which provided us with a baseline to compare our models against. Surprisingly, this approach performed well, achieving good accuracy during validation and testing on two datasets. However, the model struggled to generalize to the third dataset, which was created by us. This outcome was anticipated, as the samples in our dataset were out of the training distribution, and the model lacked the complexity needed to generalize to such data.

Model GC GCT GCO
InceptionResNetv2 83.56% 68.63% 15.57%
SVM 95.27% 85.71% 18.61%
Random Forest 93.35% 52.41% 16.16%
MLP 89.44% 78.57% 14.39%

Table 3: Accuracy of the Hand Pose Estimation + Classifier in the test set of different datasets. The following parameters were used: SVM (C = 300), Random Forest (n_estimators = 200), and MLP (hidden_layer_sizes = (100, 256, 100)). Datasets used: GC: Guitar_Chords, GCT: Guitar_Chords_Tiny, GCO: Guitar_Chords_Ours.

Classifier only approach

To address this limitation of the previous approach, we decided to explore more complex models, such as Vision Transformers and DINOv2, which is also available on Hugging Face. The results of our experiments are summarized below:

Model GC GCT GCO
InceptionResNetv2 83.56% 68.63% 15.57%
ViT-B/16 98.96% 85.29% 96.24%
ViT-B/32 93.07% 81.37% 95.83%
ViT-L/16 95.84% 81.37% 12.29%
ViT-L/32 77.03% 43.14% 13.43%
DINOv2-S 96.24% 88.24% 98.18%
DINOv2-L 96.44% 91.18% 97.92%

Table 4: Accuracy of the Classifier-only approach on the test set of different datasets.

F1 Score plot

Figure 4: F1 Score for the models tested and finetuned on the fretboard class.

ViT models show varying performance across different datasets. The base models perform exceptionally well, with high accuracy on all datasets. However, the larger models do not exhibit the same performance. We argue that this is happening because the available data is not sufficient to train the large version of the models effectively. Additionally, we can also observe that the patch 16 versions of the ViT models perform better than the patch 32 versions. This is likely due to the fact that the patch 16 versions have a higher resolution, which is important for accurately distinguishing between different hand positions.

Moreover, both DINOv2 variants demonstrated strong and consistent performance across all datasets. The DINOv2-L model, in particular, achieved the highest accuracy on the Guitar_Chords_Ours dataset, slightly outperforming the small variant. The superior performance of DINOv2 can be attributed to its self-supervised learning approach. Unlike models pre-trained on ImageNet, which does not contain a specific class for hands, DINOv2's self-supervised learning allows enables it to learn more generic and transferable representations, leading to better generalization in our task. This enhanced generalization is further supported by attention visualizations of the model when applied to images from Guitar_Chords_Ours dataset, where the model correctly focuses on the hand performing the fretting, as evidenced by Figure 5 and Figure 6.

Occlusion in untrained model

Occlusion in trained model

Figure 5: Occlusion-based attribution [Kokhlikyan et al., 2020] for model interpretability on a 74 × 389 input image using a stride of 8 and a sliding window of 30 × 30, using Captum. Top: Untrained DINOv2 model. Bottom: Our DINOv2 model.

Occlusion in trained model - full picture

Figure 6: Our DINOv2 model on a 360x640 input image using a stride of 20 and a sliding window of 60x60.

Overall, our proposed models outperformed the InceptionResNetv2 model, achieving higher accuracy across all datasets. This demonstrates the potential of using more advanced models for chord classification tasks.

Conclusion

Throughout our work, we have explored different models and techniques to improve the performance of guitar chord recognition. We showed that using a pre-trained self-supervised model, such as DINOv2, can provide better generalization compared to models pre-trained on ImageNet thanks to its ability to learn more generic and transferable representations. In addition, using more complex classification models, can also make the usage of the fretboard detection model obsolete as shown in the occlusion-based attribution visualizations, Figure 5 and Figure 6, where the model was able to learn to focus on the fretting hand with only being trained with cropped images. However, this needs further investigation and more data to be confirmed as the majority of the existing data is a cropped version of the fretboard.

Unfortunately, we were unable to achieve satisfactory sound quality from our proposed pipeline. Our proposed approach was rather simplistic and did not take into account the complexity of the sound generation process. Moving forward, we propose implementing more advanced audio processing techniques such as the one used in [Su et al.] or [Li et al.], while also improving on the synchronization aspect. These improvements would enable the generation of a more true-to-life sound and therefore achieve our final goal of creating an end-to-end pipeline for recognizing and reconstructing the sound from a silent video of someone playing the guitar.

Footnotes

  1. For a similar parameter count, we chose these model sizes which are different for each version; m: 25.9M, c: 25.3M, l: 24.4M.

  2. For easily finetuning the Faster R-CNN models, we used the Faster R-CNN Pytorch Training Pipeline by Sovit Ranjan Rath.

About

This repository contains implementations of various assignments and a project related to the High-Level Computer Vision course.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •