Improve Bounding boxes classes performances #525

Abdoulaye-Sayouti · 2024-12-22T13:47:45Z

RT-DETR v2 Training Issues with Custom Dataset

I'm currently training RT-DETR v2 (PyTorch implementation) on a custom dataset. While the model performs well at detecting bounding boxes and their coordinates, it's showing suboptimal performance in class identification.

Questions

1. Class Performance Emphasis

Is there a way to adjust the training process to put more emphasis on classification performance?

2. Separate Classification Model

I noticed there's a dedicated classification task in the codebase:

TASKS: Dict[str, BaseSolver] = {
    'classification': ClasSolver,
    'detection': DetSolver,
}

Would training a separate classification model improve the overall performance?

3. Performance Improvement

What are some recommended approaches to improve the model's class identification accuracy?

lyuwenyu · 2024-12-23T03:14:13Z

You can modify loss_vfl weight.

RT-DETR/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml

Line 73 in 0b6972d

weight_dict: {loss_vfl: 1, loss_bbox: 5, loss_giou: 2,}
No, they are for two separate tasks.
Reference 1.

Abdoulaye-Sayouti · 2024-12-24T09:08:33Z

Thanks for your reply.

I'm thinking of using DINOv2 as backbone. Will it be an easy task to do ?
If yes which files will I need to modify ?

Thanks again

lyuwenyu · 2024-12-24T09:38:10Z

Yes, you just need to register new backbone using @register()

see details https://github.com/lyuwenyu/RT-DETR/blob/main/rtdetrv2_pytorch/src/nn/backbone/hgnetv2.py#L272

Then replace old one with your registered module name in config

https://github.com/lyuwenyu/RT-DETR/blob/main/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml#L13

Abdoulaye-Sayouti · 2024-12-24T22:33:27Z

Great, thanks I will try to do that.
In the meantime I tried to use the pre-defined backbones TimmModel and HGNetv2 without success.

Implementation Issues with TimmModel and HGNetv2 Backbones

HGNetv2 Implementation

Configuration

# rtdetrv2_r50vd.yml
RTDETR:
  backbone: HGNetv2
  encoder: HybridEncoder
  decoder: RTDETRTransformerv2

# rtdetrv2_r18vd_120e_coco.yml
HGNetv2:
  name: L

Error

RuntimeError: Given groups=1, weight of size [256, 128, 1, 1], expected input[16, 512, 80, 80] 
to have 128 channels, but got 512 channels instead

Error location: hybrid_encoder.py, line 294

TimmModel Implementation

Configuration

# rtdetrv2_r50vd.yml
RTDETR:
  backbone: TimmModel
  encoder: HybridEncoder
  decoder: RTDETRTransformerv2

# rtdetrv2_r18vd_120e_coco.yml
TimmModel:
  name: resnet34
  return_layers: ['layer2', 'layer4']

Error

AssertionError: assert len(feats) == len(self.in_channels)

Error location: hybrid_encoder.py, line 293

The assertion error suggests a mismatch between the number of feature maps being returned and the expected number of input channels in the encoder.

Would you like help resolving these issues, particularly with the TimmModel implementation?

lyuwenyu · 2024-12-25T03:35:07Z

And this line should adapt to specific bakcbone.

RT-DETR/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml

Line 29 in 0b6972d

in_channels: [512, 1024, 2048]

Abdoulaye-Sayouti · 2024-12-25T18:39:36Z

ViT and HybridEncoder Compatibility Analysis

Thanks, it finally worked. I tried to use Vision Transformer (ViT) architecture as backbone with TimmModel, but it seems like its output is not compatible with HybridEncoder expectations.

Here is the summary of what I understood:

HybridEncoder Expectations

It expects inputs in CNN format (batch_size, channels, height, width)
Default in_channels=[512, 1024, 2048] (typical ResNet feature map channels)
Input features should have decreasing spatial dimensions with feat_strides=[8, 16, 32]

ViT Last 3 Layers Output

Shape: (batch_size, N_patches, 768)
No explicit spatial structure
Constant channel dimension (768)
All layers have same dimensions

Mismatch Issues

1. Dimensional Structure

HybridEncoder expects 4D tensors (B,C,H,W)
ViT outputs 3D tensors (B,N,D)

2. Channel Progression

HybridEncoder expects increasing channels (512->1024->2048)
ViT has constant channels (768)

3. Spatial Resolution

HybridEncoder expects decreasing spatial dimensions
ViT maintains constant number of patches

I'm trying to adapt ViT outputs. But I think the adaptation might not be optimal because:

1- ViT's strength lies in global attention

2- Forcing spatial structure might lose the global relationship information

3- The original feature hierarchy of ResNet is fundamentally different from ViT's feature representation

Can you please confirm that? And if there is a way to make them compatible.

Thanks a lot!

lyuwenyu · 2024-12-26T03:36:41Z

Yes, I think you are right.

One possible solution is to add an extra adaptation module. You can reference this paper.

Exploring Plain Vision Transformer Backbones for Object Detection

Abdoulaye-Sayouti · 2024-12-27T18:48:50Z

Ok thanks very much. I will check it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Bounding boxes classes performances #525

Improve Bounding boxes classes performances #525

Abdoulaye-Sayouti commented Dec 22, 2024

lyuwenyu commented Dec 23, 2024 •

edited

Loading

Abdoulaye-Sayouti commented Dec 24, 2024 •

edited

Loading

lyuwenyu commented Dec 24, 2024

Abdoulaye-Sayouti commented Dec 24, 2024

lyuwenyu commented Dec 25, 2024

Abdoulaye-Sayouti commented Dec 25, 2024

lyuwenyu commented Dec 26, 2024 •

edited

Loading

Abdoulaye-Sayouti commented Dec 27, 2024

Improve Bounding boxes classes performances #525

Improve Bounding boxes classes performances #525

Comments

Abdoulaye-Sayouti commented Dec 22, 2024

RT-DETR v2 Training Issues with Custom Dataset

Questions

1. Class Performance Emphasis

2. Separate Classification Model

3. Performance Improvement

lyuwenyu commented Dec 23, 2024 • edited Loading

Abdoulaye-Sayouti commented Dec 24, 2024 • edited Loading

lyuwenyu commented Dec 24, 2024

Abdoulaye-Sayouti commented Dec 24, 2024

Implementation Issues with TimmModel and HGNetv2 Backbones

HGNetv2 Implementation

Configuration

Error

TimmModel Implementation

Configuration

Error

lyuwenyu commented Dec 25, 2024

Abdoulaye-Sayouti commented Dec 25, 2024

ViT and HybridEncoder Compatibility Analysis

HybridEncoder Expectations

ViT Last 3 Layers Output

Mismatch Issues

1. Dimensional Structure

2. Channel Progression

3. Spatial Resolution

1- ViT's strength lies in global attention

2- Forcing spatial structure might lose the global relationship information

3- The original feature hierarchy of ResNet is fundamentally different from ViT's feature representation

lyuwenyu commented Dec 26, 2024 • edited Loading

Abdoulaye-Sayouti commented Dec 27, 2024

lyuwenyu commented Dec 23, 2024 •

edited

Loading

Abdoulaye-Sayouti commented Dec 24, 2024 •

edited

Loading

lyuwenyu commented Dec 26, 2024 •

edited

Loading