Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update datumaro #8923

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from
Open

update datumaro #8923

wants to merge 8 commits into from

Conversation

Eldies
Copy link
Contributor

@Eldies Eldies commented Jan 10, 2025

Motivation and context

depends on cvat-ai/datumaro#77

How has this been tested?

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)
  • I have increased versions of npm packages if it is necessary
    (cvat-canvas,
    cvat-core,
    cvat-data and
    cvat-ui)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

  • Dependency Updates

    • Updated datumaro package to a new commit
    • Added multiple new packages including filelock, huggingface-hub, openvino, and scikit-learn
    • Updated versions of orjson, protobuf, and other existing dependencies
  • Code Structure Changes

    • Refactored import statements across multiple dataset manager format files
    • Modified class inheritance in dataset extraction components
    • Moved ItemTransform and DatasetItem imports to different modules

Copy link
Contributor

coderabbitai bot commented Jan 10, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The pull request involves significant updates to the CVAT dataset management system, primarily focusing on restructuring import statements and class inheritance. The changes center around modifications in the datumaro library, with updates to how extractors, transformers, and dataset components are imported and utilized. The modifications span multiple files in the dataset manager formats, updating class hierarchies and import sources to align with new library structures. A corresponding update to the requirements files reflects these changes, introducing several new dependencies and updating existing package versions.

Changes

File Change Summary
cvat/apps/dataset_manager/bindings.py Updated class inheritance for CvatTaskOrJobDataExtractor and CVATProjectDataExtractor from SourceExtractor/Extractor to SubsetBase/DatasetBase
cvat/apps/dataset_manager/formats/cvat.py Modified CvatExtractor to inherit from DatasetBase and updated import statements
cvat/apps/dataset_manager/formats/*.py Standardized ItemTransform import from datumaro.components.transformer in multiple format files (ICDAR, Market1501, MOTS, Velodynepoint)
cvat/apps/dataset_manager/formats/yolo.py Updated DatasetItem import to datumaro.components.dataset_base
cvat/requirements/base.in Updated datumaro package commit hash
cvat/requirements/base.txt Updated datumaro commit hash, added multiple new packages, and adjusted package versions

Poem

🐰 Hopping through code with glee,
Datumaro's dance, a library spree!
Extractors transform, classes realign,
Dependencies bloom like a rabbit's design 🌱
Code evolution, smooth and bright! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
cvat/apps/dataset_manager/formats/cvat.py (1)

Line range hint 1642-1642: Implement __iter__ and __len__ methods for CvatTaskOrJobDataExtractor.

Since CvatTaskOrJobDataExtractor now inherits from dm.SubsetBase, it should implement the __iter__ and __len__ methods to function properly as a dataset extractor. Currently, these methods are missing, which may lead to errors or unexpected behavior.

Apply this diff to add the missing methods:

+    def __iter__(self):
+        yield from self._items
+
+    def __len__(self):
+        return len(self._items)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 43cc1f3 and aa4d236.

📒 Files selected for processing (9)
  • cvat/apps/dataset_manager/bindings.py (4 hunks)
  • cvat/apps/dataset_manager/formats/cvat.py (2 hunks)
  • cvat/apps/dataset_manager/formats/icdar.py (1 hunks)
  • cvat/apps/dataset_manager/formats/market1501.py (1 hunks)
  • cvat/apps/dataset_manager/formats/mots.py (1 hunks)
  • cvat/apps/dataset_manager/formats/velodynepoint.py (1 hunks)
  • cvat/apps/dataset_manager/formats/yolo.py (1 hunks)
  • cvat/requirements/base.in (1 hunks)
  • cvat/requirements/base.txt (13 hunks)
✅ Files skipped from review due to trivial changes (5)
  • cvat/apps/dataset_manager/formats/mots.py
  • cvat/apps/dataset_manager/formats/velodynepoint.py
  • cvat/apps/dataset_manager/formats/market1501.py
  • cvat/apps/dataset_manager/formats/yolo.py
  • cvat/apps/dataset_manager/formats/icdar.py
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: build
  • GitHub Check: generate_github_pages
  • GitHub Check: testing
  • GitHub Check: Analyze (python)
  • GitHub Check: Linter
  • GitHub Check: Analyze (javascript)
🔇 Additional comments (10)
cvat/apps/dataset_manager/formats/cvat.py (5)

25-26: Verify the correctness of the new import statements.

The added import of DEFAULT_SUBSET_NAME and DatasetBase from datumaro.components.dataset_base should be checked to ensure they are necessary and correctly used in the code below.


54-54: Ensure compatibility with the new base class DatasetBase.

Changing the base class of CvatExtractor to DatasetBase may require implementing or overriding specific methods expected by DatasetBase. Please verify that all necessary methods are correctly implemented to prevent potential errors.


Line range hint 1653-1655: Confirm correct initialization of dm.SubsetBase.

Ensure that the parameters media_type and subset are correctly passed when initializing dm.SubsetBase. They should align with the expected values for different dimensions to avoid potential issues.


Line range hint 1751-1751: Ensure CVATProjectDataExtractor is compatible with dm.DatasetBase.

The class CVATProjectDataExtractor now inherits from dm.DatasetBase. Verify that all required methods and properties are implemented appropriately, and any differences from the previous base class dm.Extractor are properly handled.


Line range hint 1761-1762: Confirm correct initialization of dm.DatasetBase.

Ensure that the media_type parameter is correctly set when initializing dm.DatasetBase, matching the dimension specified (dm.Image for 2D data and dm.PointCloud for 3D data).

cvat/requirements/base.in (1)

15-15: Verify the updated datumaro dependency commit hash.

Ensure that the new commit hash dad90703c8db9e7d7e692250ea8caf8af233c9a0 for the datumaro package corresponds to the intended version and does not introduce any breaking changes or compatibility issues.

You can run the following script to check the commit message and changes associated with the new commit hash:

✅ Verification successful

The updated datumaro dependency commit is safe to use

The commit dad90703 only removes the tritonclient[all] dependency due to its unsafe transitive dependency on setuptools. This is a clean, well-documented change that doesn't introduce any breaking modifications to the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Retrieve the commit message and diff for the specified Datumaro commit.

# Fetch the commit message
git clone https://github.com/cvat-ai/datumaro.git temp_datumaro_repo
cd temp_datumaro_repo
git show --name-status dad90703c8db9e7d7e692250ea8caf8af233c9a0

# Clean up
cd ..
rm -rf temp_datumaro_repo

Length of output: 592


Script:

#!/bin/bash
# Get the detailed changes and check if this is the latest commit
git clone https://github.com/cvat-ai/datumaro.git temp_datumaro_repo
cd temp_datumaro_repo
# Show the actual changes in the file
git show dad90703c8db9e7d7e692250ea8caf8af233c9a0 -- requirements-core.txt
# Check if this is the latest commit
git rev-parse HEAD
cd ..
rm -rf temp_datumaro_repo

Length of output: 935

cvat/apps/dataset_manager/bindings.py (1)

1751-1751: CVATProjectDataExtractor changes look good.

The class CVATProjectDataExtractor correctly inherits from dm.DatasetBase, and necessary methods like __iter__, __len__, and categories are properly implemented.

cvat/requirements/base.txt (3)

155-156: Verify compatibility of new ML dependencies.

Multiple ML-related packages have been added (huggingface-hub, nltk, openvino, scikit-learn, tokenizers). These additions suggest significant new ML capabilities in datumaro.

Could you clarify:

  1. Are these new dependencies required for all CVAT installations or only for specific features?
  2. What is the impact on system requirements (memory, CPU, GPU)?

Consider:

  1. Making ML-related dependencies optional through extras_require in setup.py
  2. Documenting system requirements for ML features

Also applies to: 217-218, 221-224, 363-368, 387-388


155-156: Review security implications of new dependencies.

New packages with network capabilities (huggingface-hub, openvino, pyarrow) have been added. These could potentially impact the security posture of the application.

Consider:

  1. Documenting network access requirements for new dependencies
  2. Implementing appropriate network security controls
  3. Adding these packages to security scanning workflows

Run security check:

Also applies to: 221-224, 263-264

✅ Verification successful

Security check passed for new dependencies

The security scan shows no known vulnerabilities in the specified versions of huggingface-hub (0.27.1), openvino, and pyarrow (18.1.0). However, since these packages have network capabilities, the architectural recommendations about documenting network access requirements and implementing appropriate security controls remain valid.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for known vulnerabilities in new dependencies

# Check for security advisories for new packages
for pkg in "huggingface-hub" "openvino" "pyarrow"; do
  echo "Checking $pkg..."
  gh api graphql -f query="
  {
    securityVulnerabilities(first: 5, ecosystem: PIP, package: \"$pkg\") {
      nodes {
        advisory {
          summary
          severity
          publishedAt
        }
        vulnerableVersionRange
        firstPatchedVersion {
          identifier
        }
      }
    }
  }"
done

Length of output: 2139


Script:

#!/bin/bash
# Check pyarrow version in requirements
rg "^pyarrow==" cvat/requirements/

Length of output: 79


227-228: Verify intentional version downgrades.

The orjson package has been downgraded from 3.10.13 to 3.10.12. While protobuf has been upgraded, we should verify if the orjson downgrade is intentional.

Also applies to: 251-258

cvat/apps/dataset_manager/bindings.py Show resolved Hide resolved
# pyjwt
cycler==0.12.1
# via matplotlib
datumaro @ git+https://github.com/cvat-ai/datumaro.git@08e77b216080555a57e12c01625be8c8201e3131
datumaro @ git+https://github.com/cvat-ai/datumaro.git@dad90703c8db9e7d7e692250ea8caf8af233c9a0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Consider using a stable release version instead of a development branch commit

The specified commit dad90703 is from a development branch dl/sync-with-upstream-5 and not associated with any release tag. The repository has stable releases available up to v1.9.0. Using a commit from a development branch may lead to stability issues.

🔗 Analysis chain

Verify the referenced datumaro commit.

The datumaro package is being updated to a new version. Let's verify this commit exists and is stable.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the referenced datumaro commit exists and is the latest stable version

# Check if the commit exists in the repository
gh repo clone cvat-ai/datumaro /tmp/datumaro
cd /tmp/datumaro
if git cat-file -e dad90703c8db9e7d7e692250ea8caf8af233c9a0 2>/dev/null; then
    echo "Commit exists"
    # Show commit message and date to verify it's recent and stable
    git log -1 dad90703c8db9e7d7e692250ea8caf8af233c9a0 --pretty=format:"%h %ad %s"
else
    echo "Commit not found"
fi

Length of output: 2314


Script:

#!/bin/bash
# Get detailed information about the commit
cd /tmp/datumaro
echo "=== Commit Details ==="
git log -1 dad90703c8db9e7d7e692250ea8caf8af233c9a0 --pretty=format:'%h %ai %s'

echo -e "\n\n=== Latest Tag ==="
git describe --tags --abbrev=0

echo -e "\n\n=== Relationship to Latest Tag ==="
git describe --tags dad90703c8db9e7d7e692250ea8caf8af233c9a0

Length of output: 337


Script:

#!/bin/bash
cd /tmp/datumaro

echo "=== All Tags (sorted) ==="
git tag -l | sort -V

echo -e "\n=== Branch Information ==="
git branch -r --contains dad90703c8db9e7d7e692250ea8caf8af233c9a0

echo -e "\n=== Commit Context ==="
git log --oneline -n 5 dad90703c8db9e7d7e692250ea8caf8af233c9a0

Length of output: 775

@Eldies Eldies requested a review from azhavoro as a code owner January 10, 2025 16:53
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.83%. Comparing base (411f8e7) to head (42c05bd).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8923      +/-   ##
===========================================
- Coverage    73.84%   73.83%   -0.02%     
===========================================
  Files          412      412              
  Lines        44261    44262       +1     
  Branches      3993     3993              
===========================================
- Hits         32686    32680       -6     
- Misses       11575    11582       +7     
Components Coverage Δ
cvat-ui 78.38% <ø> (-0.04%) ⬇️
cvat-server 69.97% <100.00%> (+<0.01%) ⬆️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants