Photo by Cash Macanaya / Unsplash

Guiding Hands-off AI using Hands-on TDD

Workshops and Talks Jun 27, 2024 (Jun 28, 2024) Loading...

Bart Knaack and I will run this hands-on workshop at Agile Testing Days.

We'll update this page as we build the workshop. We'll share the ways that we have found to guide AI towards code that passes automated tests, and the stuff we've tried that hasn't worked for us. I also hope that we'll give you an insight into how we set up an interactive workshop.

Preparing for the Workshop

A series as we build stuff

My Magic Loop is working!

I wanted part of this workshop to seem magical – and whereas writing code that appears to correspond to what you've written about is astonishing, it's no longer magic. Especially to testers, who see that the code is often just awful.

Copying and pasting code suggested by CoPilot or Cody, then running the test suite is repetitive and clerical. I want to automate that away.

The magic I'd imagined is that participants add an (automated, confirmatory) test, step back while something else builds code that passes that test, and step in to see how weird the built thing might be.

Here's the chunk of shell script at the heart of this magic loop:

llm -t rewrite_python_to_pass_tests -p code "$(< ./src/$1)" -p tests "$(< ./tests/test_$1)" -p test_results "$(pytest ./tests/test_$1)" '' > ./src/$1

Let's unpick:

That line starts off with Simon Willison’s llm tool – a tool that acts as an interface to generative AIs.

The command takes a parameter ($1 above) which is the name of the source file to be changed. It uses that parameter to gather data in three lumps indicated with -p and named code, tests and test_results.

Each data gathering bit runs a tiny shell command: < to get the contents of a file and pytest to run the tests. So I'm labelling and sending in the code I expect to change, the tests I need the code to satisfy, and the results of running the tests through the code (remember I expect the tests to fail, and that the failure info is in some way helpful). I also expect the AI to make sense of these three.

llm uses all that labelled information to fill in a 'template' called rewrite_python_to_pass_tests , fires the filled-in template to a generative AI, and waits for the output.

llm's actions are set up in the following rewrite_python_to_pass_tests template:

model: claude-3.5-sonnet
system:  You are expert at Python. You can run an internal python interpreter. You can run pytest tests. You are methodical and able to explain your choices if asked. You write clean Python 3 paying attention to PEP 8 style. Your code is readable. When asked for ONLY code, you will output only the full Python code, omitting any precursors, headings, explanation, placeholders or ellipses. Output for ONLY code should start with a shebang – if you need to give me a message, make it a comment in the code.
prompt: 'Starting from Python code in $code, output code which has been changed to pass tests in $tests. Please note that the code currently fails the tests with message $test_results. Your output will be used to replace the whole of the input code, so please output ONLY code.'

I've already set llm up with a plugin and key so it can talk to an AI – in this case, Anthropic's Claude because it was released on Monday and all the nerds are gushing. Also, I spent a fiver on tokens there.

llm gives the AI a system prompt to tell it how to behave in general, and a prompt to pass that data and set a task. I've done it this way to separate chracter from task. It also might let me, later, build out so that my script can carry on the conversation with the AI, keeping necessary context.

When the AI hands back what it's generated, llm hands it to the shell and the shell overwrites the source file with whatever llm spits out.

So that's the line. The line lives in a short shell script, which runs the tests again on the new code. If the tests run without failure, the script stops, with a message that the code is ready for inspection. If not, it iterates a few times, and will report the test results if the code still doesn't pass satisfy the tests after a few passes goes.

And that's the magic loop. You write tests, run a script, the code changes and the tests pass. Mostly. Takes 5-20 seconds.

Magic over: What's next?

Either the newly-working code is ready for inspection. Perhaps, participants will fire up the system to explore, run a diff, write more tests to generate more code, or just commit and move on.

Or, the script's done and the code is bust. Maybe one just runs the script again to see what it does this time. Maybe one fixes the code directly. Maybe one looks at one's tests and realises that the tests are inconsistent. Maybe the AI has vandalised the code so much that you go get the last one out of change control.

In my limited playtime, I've been amused to see comments from the AI in the code to indicate that the tests are odd, but that the code has been adjusted to pass them anyway. That's how you pass as a thinking thing. Welcome to the team, Claude.

🫢
Yes, it overwrites the code. No, you can't get the code back with an undo. That's what change control is for. And besides: code is cheap.

What we've tried, and might try

  • Working in the IDE
  • Working in a shell script
  • Cody
  • Copilot
  • OpenAI
  • Ollama
  • Claude

On my rough list of next steps: Moving to Python (me? shell??), trying different AIs, continuing the conversation with the AI, automated checkin, working in Replit, custom models in Replicate, changing prompts, building something odd, building something useful, multiple files / tests, different prompts for different purposes, mapping the (financial) costs, imagining just how many ways this can go wrong or is already wrong...


Workshop Stuff – video and abstract

In this hands-on workshop, you’ll write tests, and an AI will write the code.

We’ll give you a zero-install environment with a simple unit testing framework, and an AI that can parse that framework. You’ll add to the tests, run the harness to see that they fail, then ask the AI to write code to make them pass. You’ll look at the code, ask for changes if it seems necessary, incorporate that code and run the tests for real. You’ll explore to find unexpected behaviours, and add tests to characterise those failures – or to expand what your system does. As you add more tests, the AI will make more code. Maybe you’ll pause to refactor the code within your tests.

Bart and James are exploring the different technologies and approaches that make this possible. We’ll bring worked examples, different test approaches, and enough experience (we hope) to help you to work towards insights that are relevant to you. All you need to bring are a laptop (or competent tablet) and an enquiring mind. You’ll take away direct experience of co-building code with an AI, and of finding problems in AI-coded systems. We hope that you’ll learn the power and the pitfalls of working in this way – and you'll see how we worked together to find out for ourselves.


Making this workshop, and making it interactive

Interactive workshops – and most especially those that ask participants to work with technology rather than with each other – are risky. We've done plenty, and they always fail (and succeed) in unexpected ways. We learn loads – I'll share here some of what we're learning about how we deliver this workshop, and that might help us to share what we've learned about delivering workshops overall.

General principles – unfinished...

Our general trick is to do as much of the infrastructural heavy lifting as we can, so that participants can get straight to testing work.

Workshops are a great way to learn. If participants are learning how to download and install a tool, then that's fine... but I'd prefer to get as much of the tool-sourcing out of the way so that we can get to the tool use and through that to the thinking.

We try to work within the constraints of a conference environment – short sessions, varied skills, random kit, flaky wifi. We've been known to bring laptops and routers and servers: now we configure tools that can run in browsers on tablets.

Aside: Conferences need places for people who work with technology to play with technology. We set up the TestLab at conferences because we recognised that testing conferences would be enhanced by having somewhere, and something, to test.

Bart and I find a challenge in making interactive technical workshops. It's a privilege to do them, and it's painful to get to a point where we can do them. There is always a vale of shit that we need to pass through, where the whole premise seems misguided, where the workshop seems undeliverable, where we've lost our connection with each other, where we have no sense of the experience we'd like to deliver. And we'll try to work through or round all those things.

The way to work is often simpler than we'd imagined, and that simplicity is often invisible before we've done the work. Frequently, we've bought (I've bought) the complexity, and we need to let something go to find a way through. Knowing that we have to deliver something is a great way to focus on the good bits. Knowing why we're delivering something is a great way to stay on track, and that sense of purpose is is something we've built over years of

Member reactions

Reactions are loading...

Sign in to leave reactions on posts

Tags

Comments

Sign in or become a Workroom Productions member to read and leave comments.

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.