![]() |
|
![]() |
| > For django-32517
Although I agree with your analysis and it doesn't look great for the authors, this issue (https://code.djangoproject.com/ticket/32517) arguably falls into their "Solution leak" category anyways, as the following text appears in the issue description (and so I think directly in `problem_statement` rather than `hints_text`): > Currently, OrderedSet isn't reversible (i.e. allowed to be passed as an argument to Python's reversed()). This would be natural to support given that OrderedSet is ordered. This should be straightforward to add by adding a __reversed__() method to OrderedSet. It isn't the exact code though, so I suppose it could be argued instead that the issue is just extremely easy. |
![]() |
| Anecdotal but I was always shocked to see Claude 3.5 perform so poorly in the benchmarks, when it generates 80% of my code in Cursor (and in cases it fails, no other model succeeds) |
![]() |
| Copilot is very different though. FWIW most people seem to find copilot super valuable.
The discussion is more around highly autonomous AI "coders" (cursor, cline/roocode, (open)devin, etc.) |
![]() |
| Yep anecdotally that's basically spot-on. It's also one of the reasons that I still find copilot vastly more useful than highly autonomous AI tooling (cursor, roocode, avante, etc.) |
![]() |
| To be totally fair, using PhD as a barometer of anything without specifying what is like claiming that LLMs have encyclopedic knowledge while meaning a children's encyclopedia. |
![]() |
| This was also my first thought, but reading [1] again, what they did was labeling like:
> Whether we consider the issue description to be underspecified and hence unfair to be testing on. > Whether the FAIL_TO_PASS unit tests filter out valid solution and a bit more. This is pointed out in the linked paper too. The moral of the story to me is that, don't believe the paid human annotator. You can (hopefully) still believe the PhD students doing these unpaid jobs as their research ;-) [1] https://openai.com/index/introducing-swe-bench-verified/ |
![]() |
| Right, so that AI companies can freely throw this significantly more valuable training data into a model and then turn around and advocate for clamping down on the freedom of models. |
![]() |
| > 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively? |
![]() |
| I would guess that reasoning models would generalize better (i.e. have a smaller discrepency between stuff in the training set and stuff out of it) but it would be very interesting to check. |
![]() |
| Large ones do better than small ones but still do worse than I would have expected before I tested them. E.g. `o1` doesn't know things which are repeated several times on wikipedia. |
![]() |
| o1 is not too large, and the emphasis is on reasoning rather than memorization.
Try the largest llama models, and phrase your prompt like a sentence to be completed instead of you asking a question. |
![]() |
| My own impression with SoTA models is that they’re very useful for coding, yet they suck ass for solving unique problems (which is the case for every sufficiently large codebase). |
![]() |
| I was wondering how long this would take to surface, you can tell a surprising amount just by carefully watching how the trainers answer interview questions, which is kinda meta really. |
![]() |
| I found that this paper was submitted to ICLR, but got rejected: https://openreview.net/forum?id=pwIGnH2LHJ
To me the analysis of SWE-Bench is a solid contribution and informative. My guess is that to meet conference's submission bar they had to come up with their own bench (SWE-Bench+), which wasn't thorough enough and the paper got rejected mainly because of that. |
![]() |
| > 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.
Is this what Hofstadter means by a strange-loop? |
![]() |
| You should immediately publish a paper on Arvix with your revolutionary IEF brand, an improvement on transformers and mamba architectures. Then, like Ilya, take $1B in funding the following week. |
For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.
For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.
Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.