Skip to content

Commit

Permalink
SWE-bench: Add summarise utility script to view passed/failed task IDs (
Browse files Browse the repository at this point in the history
All-Hands-AI#2137)

* SWE-bench: Add summarise utility script to view passed/failed task IDs

* Fix typos

* Move file

* Prettify

* Use merged jsonl file
  • Loading branch information
li-boxuan authored May 31, 2024
1 parent f188abd commit 4d14b44
Show file tree
Hide file tree
Showing 2 changed files with 51 additions and 1 deletion.
13 changes: 12 additions & 1 deletion evaluation/swe_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,9 +116,11 @@ selected_ids = ['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'scikit-learn__
Then only these tasks (rows whose `instance_id` is in the above list) will be evaluated.
In this case, `eval_limit` option applies to tasks that are in the `selected_ids` list.

After running the inference, you will obtain a `output.jsonl` (by default it will be saved to `evaluation/evaluation_outputs`).

## Evaluate Generated Patches

After running the inference described in the previous section, you will obtain a `output.jsonl` (by default it will save to `evaluation/evaluation_outputs`). Then you can run this one line script to evaluate generated patches, and produce a fine-grained report:
With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patches, and produce a fine-grained report.

If you want to evaluate existing results, you should first run this to clone existing outputs

Expand Down Expand Up @@ -185,6 +187,15 @@ It will contains an additional field `fine_grained_report` (see example below) c

Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about how to evaluate patches that are already generated (e.g., not by OpenDevin).

## View Result Summary

If you just want to know the resolve rate, and/or a summary of what tests pass and what don't, you could run

```bash
poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_output_merged_jsonl_file>
# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/output.merged.jsonl
```

## Submit your evaluation results

You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
39 changes: 39 additions & 0 deletions evaluation/swe_bench/scripts/summarise_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import json
import sys


def extract_test_results(json_file_path):
passed_tests = []
failed_tests = []
with open(json_file_path, 'r') as file:
for line in file:
data = json.loads(line.strip())
instance_id = data['instance_id']
resolved = False
if 'fine_grained_report' in data:
resolved = data['fine_grained_report']['resolved']
else:
resolved = data['test_result']['result']['resolved']
if resolved:
passed_tests.append(instance_id)
else:
failed_tests.append(instance_id)
return passed_tests, failed_tests


if __name__ == '__main__':
if len(sys.argv) != 2:
print(
'Usage: poetry run python summarise_results.py <path_to_output_merged_jsonl_file>'
)
sys.exit(1)
json_file_path = sys.argv[1]
passed_tests, failed_tests = extract_test_results(json_file_path)
succ_rate = len(passed_tests) / (len(passed_tests) + len(failed_tests))
print(
f'\nPassed {len(passed_tests)} tests, failed {len(failed_tests)} tests, resolve rate = {succ_rate}'
)
print('PASSED TESTS:')
print(passed_tests)
print('FAILED TESTS:')
print(failed_tests)

0 comments on commit 4d14b44

Please sign in to comment.