The authors had maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs that passed SWE-bench’s automated grader and found roughly half would not be merged — maintainer merge rates are about 24 percentage points lower than the automated grader and show slower apparent improvement, suggesting benchmarks can overestimate real-world usefulness without human feedback or iteration.