Function is easier than Fit
When building the scripts and prompts for the AIvsTDD workshop, I wanted* to understand the relevant moving parts, and to abstract the ones that were generic.
One thing that became very clear was that asking for code is easy. Some LLMs respond with code that is fairly-reliably syntactically correct. Code the runs first time felt like a triumph, for me and my butterfingers. A temporary triumph.
Getting code that passes tests was the aim of the magic loop. With a bunch of shell scripts, that goal was both achievable and interesting.
However, getting code that fits is harder. When you give an LLM a huge great lump of code, and ask it for a change, how is it meant to tell you the several places that the change needs to be made? Remember that it's building its reply word-by-word, line-by-line. It doesn't 'know' what it's going to write, and doesn't look back over to see where it fits when it's done so.
Something needs to work out where it goes: maybe another prompt to the LLM (you could ask what needs to change and where before asking for the change), maybe your deterministic tools (it's common to ask for a format that look s like diff
output), maybe you. In our workshop, we pulled a fast one, and skipped that problem**.
It's interesting to find, then, this experiment from Aider: GPT Code Editing Benchmarks – looking at the differences between editing a file based on (something like a) a diff
, and replacing the whole thing.
Aider's initial conclusions balanced between editing and replacement. In their follow-on and ongoing experiment LLM Leaderboards, they seem to be leaning more towards the editing approach. Either way, the firm conclusion is that while LLMs can produce runnable and maybe-satisfying code, it's harder to reliably put that code in the right place.
To follow: Patterns of failure in integrating generated code.
`* I reckoned I wanted to work from scratch, but that is entirely illusory when your tools rest on the ingested content of the internet.
`** The sharp-eyed in our workshop will notice that we worked that particular misdirection by asking for changes to one file only, and asking for the whole file.
Subscribers get to see notes on what's wrong with changing a whole file
Comments
Sign in or become a Workroom Productions member to read and leave comments.