Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
yihedeng9 authored Feb 9, 2024
1 parent b277458 commit fa43d03
Showing 1 changed file with 12 additions and 14 deletions.
26 changes: 12 additions & 14 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,17 @@ <h2 class="subtitle has-text-centered">
At iteration 1, SPIN has already surpassed DPO training on the majority of datasets.
</h2>
</div>
<div class="item">
<!-- Your image here -->
<img src="static/images/tab2.png"/>
<h2 class="subtitle has-text-centered">
Test performance on other reasoning benchmark datasets for SPIN at different iterations
and zephyr-7b-sft-full. We report the average score for MT-Bench and the accuracy score for
Big Bench datasets under standard few-shot CoT evaluation. On OpenBookQA, we report acc_norm
with 1-shot example as similar to previous literature. As similar to Open LLM Leaderboard evaluation,
we observe a steady improvement in performance on the other benchmark tasks, with no significant degradation.
</h2>
</div>
</div>
</div>
</div>
Expand All @@ -254,9 +265,7 @@ <h2 class="title is-3">Ablation Studies</h2>
<p>
We examine the effect of synthetic dataset size and training epochs within an iteration.
Our analysis demonstrates the effectiveness of the synthetic data used by SPIN compared to
the SFT data, as well as the necessity of iterative training in SPIN. Furthermore, to comprehensively
assess the performance improvements of SPIN, we perform additional evaluations on benchmark
tasks distinct from those in the Open LLM leaderboard.
the SFT data, as well as the necessity of iterative training in SPIN.
</p>
</div>
</div>
Expand Down Expand Up @@ -289,17 +298,6 @@ <h2 class="subtitle has-text-centered">
pivotal as training for more epochs during iteration 0 reaches a limit and cannot surpass iteration 1.
</h2>
</div>
<div class="item">
<!-- Your image here -->
<img src="static/images/tab2.png"/>
<h2 class="subtitle has-text-centered">
Test performance on other reasoning benchmark datasets for SPIN at different iterations
and zephyr-7b-sft-full. We report the average score for MT-Bench and the accuracy score for
Big Bench datasets under standard few-shot CoT evaluation. On OpenBookQA, we report acc_norm
with 1-shot example as similar to previous literature. As similar to Open LLM Leaderboard evaluation,
we observe a steady improvement in performance on the other benchmark tasks, with no significant degradation.
</h2>
</div>
</div>
</div>
</div>
Expand Down

0 comments on commit fa43d03

Please sign in to comment.