Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs #2339

Open
3 of 6 tasks
Tracked by #2386
helenxie-bit opened this issue Jun 1, 2024 · 6 comments
Open
3 of 6 tasks
Tracked by #2386
Assignees
Milestone

Comments

@helenxie-bit
Copy link
Contributor

helenxie-bit commented Jun 1, 2024

Motivation

The rapid advancements and growing popularity of Large Language Models (LLMs) have driven an increased need for effective LLMOps in Kubernetes environments. To address this, we developed a train API within the Training Python SDK, simplifying the process of fine-tuning LLMs using distributed PyTorchJob workers. However, hyperparameter optimization remains a crucial yet labor-intensive task for enhancing model performance.

Goal

This project aims to develop a high-level API for tuning hyperparameters of LLMs that automates the process of hyperparameter optimization in Kubernetes.

By leveraging the capabilities of Katib and Training Operator, this API allows users to define custom objective function or import pretained models and datasets from external platforms like HuggingFace and Amazon S3, as well as specify objective metrics, optimization algorithm, optimization goal, resources configuration, etc, then this API will automate the creation and execution of Experiment and Trials to find out best hyperparameters. This abstraction of Kubernetes infrastructure complexities will enable data scientists to optimize hyperparameters efficiently and effectively.

design_tune_api_20240906

What I Did in GSoC Project & Ongoing Works

  1. Prepare
  1. Development
  1. Wrapping up code and documentation
  • Create documentation for the API, including usage instructions, code examples, etc
  1. Other PRs

What I Learned from This Project

This is my first experience with open source, and as a beginner with Docker and Kubernetes, I gained significant knowledge throughout this project. Beyond understanding containers, Kubernetes, API development, and CI/CD pipelines, I’ve learned valuable lessons that will benefit my future studies and work:

Think from the User's Perspective: One key lesson was the importance of considering the user’s needs. Discussing API design with my mentors taught me to focus on what functionalities users need and how they prefer to use them. Listening to users’ feedback is crucial for effective product design.

Don't Fear Bugs: I used to be flustered by bugs and unsure how to address them. My mentor guided me through the debugging process, showing me how to understand and trace bugs. The key is to approach debugging methodically and think through the problem.

Communication is Important: Communication is important in collaboration, especially in open source projects. There are various ways of communicating in open-source projects, such as GitHub issues or PRs, Slack, and community meetings. And I’m grateful to my mentor for discussing my challenges during weekly meeting and providing invaluable guidance.

Every Contribution Counts: Initially, I thought contributing to open source was complex. I learned that every contribution, no matter how small, is valuable and appreciated. For example, contributing to documentation is crucial, especially for newcomers.

In The End

Thank you to Google for this invaluable opportunity. I’m deeply grateful to everyone who supported me throughout this project @andreyvelich @johnugeorge @deepanker13 @tenzen-y @nsingl00 @Electronic-Waste . Your suggestions, advice, and help were essential to completing my work.

And I want to say huge thanks to my mentor @andreyvelich . I'm impressed by your deep knowledge of the project and the industry, and your willingness to help. Your encouragement during our first meeting, sharing that you also found Kubernetes challenging at first, gave me great confidence. I appreciate the time and effort you invested in guiding me through this project, from the overall design of the API to the details of code formatting. I’ve learned a lot from your guidance.

I believe that anyone contributing to open source in their spare time has a passion for coding, and I’m glad to have worked with such a dedicated group. I will continue contributing and hope to support other beginners in the future.

@andreyvelich
Copy link
Member

/area gsoc

@helenxie-bit helenxie-bit changed the title Tracking Issue: Implementation of Tuning API in Katib for LLMs [GSoC] Project 4 Tracking Issue: Hyperparameter Optimization API in Katib for LLMs Jun 16, 2024
@helenxie-bit
Copy link
Contributor Author

/assign

Copy link

github-actions bot commented Dec 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@Electronic-Waste
Copy link
Member

/remove-label lifecycle/stale

Copy link

@Electronic-Waste: The label(s) /remove-label lifecycle/stale cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label lifecycle/stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants