enhance: STaR Integration #1514

Wendong-Fan · 2025-01-26T23:36:34Z

Description

enhancement based on review comment: #1478 (review)

Motivation and Context

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of example)

Implemented Tasks

Subtask 1
Subtask 2
Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide. (required)
My change requires a change to the documentation.
I have updated the tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

review-notebook-app · 2025-01-26T23:36:40Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

GitHoobar · 2025-01-27T00:36:05Z

can we further add a method to validate problem format?

ZIYU-DEEP · 2025-01-27T03:42:47Z

Thanks a lot for this PR! i just took a quick look and here are a few comments:

Deviation from STaR:
This is a great implementation of in-context test-time self-improving reasoner! However there are some deviations (or innovations!) from the original STaR method. For example, (i) this is an in-context method designed for test time, whereas STaR is a training-time method requiring reinforce fine-tuning; (ii) there is no rationalization process as in STaR (i.e., given problem x and true solution y, generate the reasoning trace z) since it is for test time; (iii) (minor) few-shot examples with reasoning traces are missing - we can probably add an entry somewhere to allow users to include them.
I would suggest to consider renaming this feature; probably making it a more general self-improving reasoner?
Other minor suggestions:

Solution generation: it looks like the current implementation generates one solution per iteration? If that is the case, we can extend it to generate multiple solutions in-parallel per problem per iteration, and we proceed with the highest ranked one(s) in the next iteration. Just like a tree, where the depth is the number of iterations, and the width is the search budget for solutions per iteration. To improve diversity in solution generation, maybe we can allow the use of multiple different llm APIs per generation.
Empirical results: it would be nice to add a readme file in the examples folder for this implementation to help users to quickly understand the background and usage, and show some experimental results with this method compared to single-round prompting.
Add fine-tuning support in future.

What do you think?

Wendong-Fan · 2025-01-29T14:47:06Z

Thanks a lot for this PR! i just took a quick look and here are a few comments:

Deviation from STaR:
This is a great implementation of in-context test-time self-improving reasoner! However there are some deviations (or innovations!) from the original STaR method. For example, (i) this is an in-context method designed for test time, whereas STaR is a training-time method requiring reinforce fine-tuning; (ii) there is no rationalization process as in STaR (i.e., given problem x and true solution y, generate the reasoning trace z) since it is for test time; (iii) (minor) few-shot examples with reasoning traces are missing - we can probably add an entry somewhere to allow users to include them.
I would suggest to consider renaming this feature; probably making it a more general self-improving reasoner?

Other minor suggestions:

Solution generation: it looks like the current implementation generates one solution per iteration? If that is the case, we can extend it to generate multiple solutions in-parallel per problem per iteration, and we proceed with the highest ranked one(s) in the next iteration. Just like a tree, where the depth is the number of iterations, and the width is the search budget for solutions per iteration. To improve diversity in solution generation, maybe we can allow the use of multiple different llm APIs per generation.

Empirical results: it would be nice to add a readme file in the examples folder for this implementation to help users to quickly understand the background and usage, and show some experimental results with this method compared to single-round prompting.

Add fine-tuning support in future.

What do you think?

Hey @ZIYU-DEEP , thanks for the review and happy Chinese New Year!

(i) This is an in-context method designed for test time, whereas STaR is a training-time method requiring reinforcement fine-tuning.
I think our goal here is to incorporate only the data generation aspect of STaR into this module. Model training can still be achieved by utilizing the generated reasoning data through a fine-tuning pipeline, which can then be validated within our current pipeline. Including fine-tuning directly in this module would reduce modularity and increase complexity. WDYT?

(ii) There is no rationalization process as in STaR (i.e., given problem x and true solution y, generate the reasoning trace z).
Agree and updated:

    def generate(self, rationalization: bool = False) -> List[Dict[str, Any]]:
        r"""Execute the STaR pipeline on all problems.

        Process problems and return results. If output_path is specified,
        also save results to file.

        Args:
            rationalization (bool, optional): Whether to use rationalization.
                (default: :obj:`False`)

        Returns:
            List[Dict[str, Any]]: List of processed results
        """

(iii) Few-shot examples with reasoning traces are missing—we can probably add an entry somewhere to allow users to include them.
Agree and updated:

    def __init__(
        self,
        reason_agent: ChatAgent,
        evaluate_agent: ChatAgent,
        problems: List[Dict],
        max_iterations: int = 3,
        score_threshold: Union[float, Dict[str, float]] = 0.7,
        reward_model: Optional[BaseRewardModel] = None,
        output_path: Optional[str] = None,
        few_shot_examples: Optional[str] = None,
    ):

Generating multiple solutions in parallel per problem per iteration
I agree that generating multiple solutions at once could enhance diversity and improve the quality within one iteration. However, the improvement it brings over iterative refinement after evaluation may be limited. Additionally, it would increase token consumption and runtime, I think it may not so necessary for now, WDYT?

Adding a README file in the examples folder to help users quickly understand the background and usage
Agree and updated

Adding fine-tuning support in the future
Fine-tuning will be an independent module that integrates with the current pipeline

Wendong-Fan · 2025-01-29T15:39:29Z

can we further add a method to validate problem format?

thanks @GitHoobar , added validation

GitHoobar · 2025-01-29T16:17:30Z

thanks for the addition

GitHoobar and others added 4 commits January 27, 2025 07:34

implementing STaR: Self Taught Reasoner

b93448e

minor fixes

cd78d6b

minor fixes

b51c884

Update STaR pipeline example and tests to match new implementation

acba007

Wendong-Fan requested a review from GitHoobar January 26, 2025 23:36

Merge branch 'feat/star-datagen' into star_enhance_wd

48d347a

Wendong-Fan requested review from mohamadkav, zjrwtx, ZIYU-DEEP, Asher-hss and harryeqs January 26, 2025 23:43

Wendong-Fan marked this pull request as ready for review January 26, 2025 23:44

Wendong-Fan self-assigned this Jan 26, 2025

Wendong-Fan added the enhancement New feature or request label Jan 26, 2025

Wendong-Fan changed the title ~~Refactor: STaR Integration~~ refactor: STaR Integration Jan 26, 2025

Wendong-Fan changed the title ~~refactor: STaR Integration~~ enhance: STaR Integration Jan 26, 2025

Wendong-Fan added 2 commits January 29, 2025 09:12

update

61871cf

update

8002e42

Wendong-Fan added 2 commits January 29, 2025 23:14

add doc for datagen module

d215fa1

add validate_problem_format

0f6b669

GitHoobar approved these changes Jan 29, 2025

View reviewed changes

Wendong-Fan added 3 commits January 30, 2025 09:00

5x times faster

a194ecc

Merge branch 'feat/star-datagen' into star_enhance_wd

15d5b5c

update with math 500 data and box menthod

f9cff45

update

61afbc5

Wendong-Fan merged commit 44bea0d into feat/star-datagen Jan 30, 2025

Wendong-Fan deleted the star_enhance_wd branch January 30, 2025 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance: STaR Integration #1514

enhance: STaR Integration #1514

Wendong-Fan commented Jan 26, 2025

review-notebook-app bot commented Jan 26, 2025

GitHoobar commented Jan 27, 2025

ZIYU-DEEP commented Jan 27, 2025 •

edited

Loading

Wendong-Fan commented Jan 29, 2025 •

edited

Loading

Wendong-Fan commented Jan 29, 2025

GitHoobar commented Jan 29, 2025

enhance: STaR Integration #1514

enhance: STaR Integration #1514

Conversation

Wendong-Fan commented Jan 26, 2025

Description

Motivation and Context

Types of changes

Implemented Tasks

Checklist

review-notebook-app bot commented Jan 26, 2025

GitHoobar commented Jan 27, 2025

ZIYU-DEEP commented Jan 27, 2025 • edited Loading

Wendong-Fan commented Jan 29, 2025 • edited Loading

Wendong-Fan commented Jan 29, 2025

GitHoobar commented Jan 29, 2025

ZIYU-DEEP commented Jan 27, 2025 •

edited

Loading

Wendong-Fan commented Jan 29, 2025 •

edited

Loading