Evaluates tasks as an attempter, rating model responses across several key dimensions.
3+
A GPT for validating GitHub PRs for SWE-bench inclusion.
1+