HN
Analyzing article...

The authors had maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs that passed SWE-bench’s automated grader and found roughly half would not be merged — maintainer merge rates are about 24 percentage points lower than the automated grader and show slower apparent improvement, suggesting benchmarks can overestimate real-world usefulness without human feedback or iteration.

benchmarks ai-code-generation software-engineering open-source
40 pts 2 comments
← Prev
Page 36
Next →