Function is easier than Fit

Articles Feb 28, 2025 (Apr 7, 2025) Loading...

When building the scripts and prompts for the AIvsTDD workshop, I wanted* to understand the relevant moving parts, and to abstract the ones that were generic.

One thing that became very clear was that asking for code is easy. Some LLMs respond with code that is fairly-reliably syntactically correct. Code the runs first time felt like a triumph, for me and my butterfingers. A temporary triumph.

Getting code that passes tests was the aim of the magic loop. With a bunch of shell scripts, that goal was both achievable and interesting.

However, getting code that fits is harder. When you give an LLM a huge great lump of code, and ask it for a change, how is it meant to tell you the several places that the change needs to be made? Remember that it's building its reply word-by-word, line-by-line. It doesn't 'know' what it's going to write, and doesn't look back over to see where it fits when it's done so.

Something needs to work out where it goes: maybe another prompt to the LLM (you could ask what needs to change and where before asking for the change), maybe your deterministic tools (it's common to ask for a format that look s like diff output), maybe you. In our workshop, we pulled a fast one, and skipped that problem**.

It's interesting to find, then, this experiment from Aider: GPT Code Editing Benchmarks – looking at the differences between editing a file based on (something like a) a diff, and replacing the whole thing.

🤯

Aside: We started with the unit tests, and iterated several times. Aider start with a description, then run the tests and offer a second round based on the test results; the LLM doesn't see the unit tests. But, as they point out, these are all part of the training data, so in a way the LLM already knows what code works.

Aider's initial conclusions balanced between editing and replacement. In their follow-on and ongoing experiment LLM Leaderboards, they seem to be leaning more towards the editing approach. Either way, the firm conclusion is that while LLMs can produce runnable and maybe-satisfying code, it's harder to reliably put that code in the right place.

To follow: Patterns of failure in integrating generated code.

Member reactions

Reactions are loading...

Comments

James Lyndsay

Getting better at software testing. Singing in Bulgarian. Staying in. Going out. Listening. Talking. Writing. Making.

Recommended for you

Tools

cat, head and tail for Testers

21 days ago • 6 min read

Articles

Weird Word 2

a month ago • 3 min read

Articles

Different LLMs do Different Things

a month ago • 6 min read