CAPED Consultant Certification Workshop registration is open! Register now

AI for better BDD

We finally had to turn off Copilot because the suggestions were so bad. AI results are only as good as the training data and the prompt, and Copilot was trained on a lot of bad Gherkin examples.

Let me back up and explain…

I was recently helping a client adopt Behavior-Driven Development, or BDD, on their software teams. BDD is a collaboration practice that uses concrete examples as a way to build alignment and then as automated tests to help the team build the right thing.

When it comes time to turn examples into automated tests, most teams’ tool of choice is Cucumber (in this case Reqnroll, the Cucumber tool for .NET). And Cucumber’s language for specifying examples of system behavior is called Gherkin. Gherkin uses the Given-When-Then syntax for capturing examples—you’ve likely come across it even if you didn’t know what it was called.

This particular team had Copilot autocompletion in their editors, and Copilot desperately wanted to suggest scenarios for the feature they were working. Only, Copilot’s suggested scenarios were pure slop. The point of Gherkin is to capture examples in a way that’s both informative for humans and precise enough to act as an automated test. These might have been okay tests, but they didn’t express the desired behavior well for the human reader, missing the point of the tool.

It didn’t take long before we disabled Copilot and focused on writing good scenarios manually.

But in the back of my mind, I wondered, “Could we use AI to create better Gherkin? Better scenarios that would serve the team and its stakeholders more effectively?”

In my book on BDD with Cucumber, there’s a chapter on getting the language right in Gherkin scenarios. There’s nuance, but it comes down to 3 big things:

  1. Express your scenario as a concrete example in domain language. It shouldn’t be tautological: “the search should return the correct results.” Nor should it read like a manual test script: “click this, fill in that, click submit…”
  2. Remove noise and fillers in the language, focusing on just enough words to say what you mean.
  3. Be consistent in terminology and grammar so it’s easy to make sense of a collection of examples.

Here are a few examples to illustrate what I mean:

AI for BDD examples

LLMs are good at making sense of language; this is a language problem.

So, I decided to see if I could create instructions for an LLM to do code reviews on Gherkin that would mostly say what I would say if I were doing the review.

I wrote up the ideas from the book and from my classes and presentations in an LLM-friendly format. I collected real Gherkin examples from my own archives and from open source projects on Github (ones that were using Cucumber for actual tests, not Cucumber demo projects). And I iterated with Claude until the instructions reliably produced good code reviews.

To my surprise, Claude was actually pretty good at taking its own advice from a code review and rewriting scenarios to be more expressive. Less good than at reviewing, but still passable.

I’ve shared the instructions file on Github. Give it a try. Give your favorite LLM these instructions and a Gherkin feature file, and ask it to do a code review using the instructions. (I put the instructions in a Claude project so they’d be persistent across chats.) Or use this as your Copilot code review instructions for *.feature files.

Let us know how it goes.

Of course, better Gherkin is a just a small part of adopting BDD successfully. When BDD is done well, teams build the right thing, faster and with better quality. Visibility and predictability go up. Work becomes more collaborative and fun.

Interested in those outcomes? Contact us to discuss what this could look like for your team.

Last updated