Users can follow inspect_ai
's official documentation to setup correpsonding API keys depending on the types of models they would like to evaluate.
In most cases, after users setting up the key, they can directly start the SciCode evaluation via the following command.
inspect eval scicode.py --model <your_model> --temperature 0
However, there are some additional command line arguments that could be useful as well.
--max-connections
: Maximum amount of API connections to the evaluated model.--limit
: Limit of the number of samples to evaluate in the SciCode dataset.-T input_path=<another_input_json_file>
: This is useful when user wants to change to another json dataset (e.g., the dev set).-T output_dir=<your_output_dir>
: This changes the default output directory (./tmp
).-T h5py_file=<your_h5py_file>
: This is used if your h5py file is not downloaded in the recommended directory.-T with_background=True/False
: Whether to include problem background.-T mode=normal/gold/dummy
: This provides two additional modes for sanity checks.normal
mode is the standard mode to evaluate a modelgold
mode can only be used on the dev set which loads the gold answerdummy
mode does not call any real LLMs and generates some dummy outputs
For example, user can run five sames on the dev set with background as
inspect eval scicode.py \
--model openai/gpt-4o \
--temperature 0 \
--limit 5 \
-T input_path=../data/problems_dev.jsonl \
-T output_dir=./tmp/dev \
-T with_background=True \
-T mode=gold
User can run the evaluation on Deepseek-v3
using together ai via the following command:
export TOGETHER_API_KEY=<YOUR_API_KEY>
inspect eval scicode.py \
--model together/deepseek-ai/DeepSeek-V3 \
--temperature 0 \
--max-connections 2 \
--max-tokens 32784 \
-T output_dir=./tmp/deepseek-v3 \
-T with_background=False
For more information regarding inspect_ai
, we refer users to its official documentation.
During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evalauted LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem.
We use the SciCode inspect_ai
integration to evaluate OpenAI's GPT-4o, and we compare it with the original way of evaluation. Below shows the comparison of two ways of the evaluations.
[💡It should be noted that it is common to have slightly different results due to the randomness of LLM generations.]
Methods | Main Problem Resolve Rate | Subproblem |
---|---|---|
inspect_ai Evaluation |
3.1 (2/65) |
25.1 |
Original Evaluation | 1.5 (1/65) |
25.0 |