How do you learn about the UX of AI? – designing tests for Drupal AI Assistants
In September 2024 Drupal unveiled AI Assistants as part of its new Drupal CMS product. Recognising the potential of this new feature to help and support University staff with web-related tasks, I collaborated with Jamie Abrahams, Drupal AI expert to learn how the Assistants worked and to design a UX research approach to test them.
Drupal is the open-source content management system underpinning several important University systems (including EdWeb). It is an exciting time in Drupal’s history with the launch of Drupal CMS, a new site-building and content-management product aimed at non-developers.
Read the product strategy for Drupal CMS (formerly called Drupal Starshot) on Drupal.org:
Introducing Drupal Starshot’s product strategy
Aligning AI development with UX work in Drupal
AI developments in Drupal have been rapidly gathering pace, and Drupal CMS has crystallised an opportunity for AI Assistants that can provide contextual help to empower and support people using the Drupal administrative interface. Freely Give is an agency specialising in AI development in Drupal, and Jamie Abrahams, co-founder of Freely Give is the AI track lead for Drupal CMS.
The development of Drupal CMS is being led by a UX mindset. Jamie and I were keen to align the AI track with the work of the UX track to ensure that the inclusion of AI into Drupal CMS remained user-centred, to meet and exceed the needs and expectations of the non-technical target audience. In particular, we were keen to put the AI in front of representative people to see how they naturally used it and expected it to work, and to test how successfully it supported them through these interactions. We recognised that doing this would provide us with valuable data to help drive the ongoing rapid iteration and improvement of the AI in an evidence-based way to ensure it kept getting better at being useful and usable and attractive for the target audience.
Read more about Drupal AI developments on Drupal.org:
Read more about the UX work on Drupal CMS in my related blog post:
UX leading the newest developments in Drupal – a mindset shift for Drupal CMS
How Drupal AI Assistants work using AI Agents, a chatbot and LLMs
AI Assistants are powered by several Drupal AI modules and elements working together. These include the AI Chatbot, AI Agents, the AI Assistant API, AI Providers (a set of Large Language Models (LLMs)) and the AI Evaluations module.
Each AI Assistant is designed to work like a Generative Pre-trained Transformer (GPT) where a person types in a prompt to the AI Chatbot (rendered on the front-end of the administrative interface through Blocks), the Chatbot uses its configured functionalities and Retrieval Augmented Generation (RAG) to analyse the query in the prompt and call upon the functions of relevant AI Agents and AI Providers in the back-end. It then generates an appropriate natural language response, acting like a co-pilot to help the person in context. Through the AI Evaluations module, the Assistant offers the person the opportunity to rate the exchange, typically with a thumbs-up or a thumbs-down.
Depending on what is required, the Assistants can be developed to have different specialities, and can be made to support people in different ways. The purpose and identity of the Assistants is defined through fields in their configuration settings, (including description, pre-prompt system role, assistant message, and pre-action prompt). For example, they can be configured to only draw upon certain LLMs (AI Providers), or make use of specific AI Agents (for example, the ‘Module Enable Agent’ – that is capable of enabling Drupal modules, the ‘Views Agent’ – capable of creating Drupal Views or the ‘Field Type Agent’ – capable of adding, editing, or removing field types to existing Drupal entity types). To ensure maximum flexibility in Assistants’ capabilities, the AI Assistant API can be used to enable data sharing between them.
We wanted people to test the AI Assistants to help us learn how to improve them
The AI Assistants had been proven to work in the initial iteration of Drupal CMS (released at DrupalCon Barcelona in September 2024) but they didn’t work all the time and they hadn’t been put in front of real people in the contexts they were envisaged to be used in. There was a need to test them end-to-end, starting from when a person initiated an interaction with an Assistant with a prompt to help them complete a task, going through to when the person received the Assistant’s help and then evaluated it. Our overall goals for the testing were therefore as follows:
- To understand how people naturally expected to use the AI Assistants
- To understand how people evaluated the success of the AI Assistants
- To use the information gathered from testing to measure the success of the AI Assistants and to iterate and improve them.
What makes a good AI Assistant experience? How we designed the tests
Usability testing (involving observing participants completing tasks using the AI Assistants) combined with contextual enquiry (involving questioning participants as they interacted with the Assistants in order to find out about their associated perceptions, expectations and reactions) seemed like the best approach to address the testing goals.
I drew on conversational design books to shape a scenario and tasks
To help me work out an appropriate usability testing scenario and tasks, I referred to two books: ‘Conversations with things: UX Design for Chat and Voice’ by Dianna Deibel and Rebecca Evanhoe and ‘Conversational Design’ by Erika Hall.
In ‘Conversations with things’, recognising the potential for unlimited input when testing something like a voice or a chat assistant, the authors pointed out the need to apply guidance and constraint when initiating testing of a conversational interface, and to carry out multiple rounds of testing, iterating and tweaking on each round. To assess the UX of a conversational interface, they suggested adopting several questions: (p255)
- Can people navigate the experience successfully – and would they actually go through this experience at all?
- Which prompts need wording tweaks for clarity or high success?
- What utterances are failing, and should those be added to training data?
- Where are common failure points and can they be improved?
In ‘Conversational Design’, building on the work of Paul Grice, a philosopher of language, and linguist Robin Lakoff, Erika Hall outlined five criteria representative of a successful conversational exchange: (p28-30)
- Quantity – just enough information is provided, not too much and not too little
- Quality – no false information is provided, and no information that is not adequately evidenced
- Relation – information provided is appropriate and relevant to the context
- Manner – information provided is to the point and unambiguous
- Politeness – respect is shown and a good feeling is created
Taking the information from the two books together, I noted the value of doing quick successive rounds of testing, to learn and iterate on each round. I recognised that test scenarios and tasks needed to be straightforward enough for the participants to easily understand what they needed to do, so they could focus on interacting with the AI Assistant to help them. I also realised that the scenario and tasks needed to be specific, but with enough flexibility to ensure participants could express their needs and responses in their own natural language, and could choose multiple ways to complete each task.
Before starting the tests I thought about UX success indicators
Considering the potential elements of a typical conversational exchange, I pre-empted that the tests would provide a very rich source of data. Thinking ahead to the data analysis, and considering that this needed to be done quickly to facilitate rapid iteration, I identified several areas to focus on to help me assess the UX of the interactions experienced:
- Effort required by the participant to enter and engineer prompts to get their desired output
- Quality of output returned by the AI Assistant (measured according to how relevant, clear, actionable and useful the information provided was to achieve the task in question)
- How the participant rated the exchange (indicated by the evaluations given)
Building on this, I drew up a set of factors indicative of a successful UX of an AI Assistant, from the perspective of the person (test participant) interacting with the Assistant:
- Participant has a positive response to being presented with the AI Assistant
- Participant can communicate with the AI Assistant using language they would naturally use
- AI Assistant responds to participants’ prompts with relevant answers (no hallucinations)
- Participant can understand, respond to or act upon the response from the AI Assistant
- Participant completes task with the help of the AI Assistant
- Participant reflects on a positive experience and evaluates accordingly.
We created a starting scenario and tasks that fitted with existing functionality
With this background knowledge in mind, I worked with Jamie to understand what was achievable in a test environment, given the then stage of AI Assistant development and the then functionality within Drupal CMS. In the Drupal CMS instance produced by Jamie’s company Freely Give, the AI Assistants could support the creation of content types and the amendment of the fields within existing content types, however, given we wanted to test the use of the Assistants by non-technical users, it wasn’t appropriate to present test participants with tasks in these system-specific terms. Instead, it was necessary to use this underlying information to build a scenario and tasks the participants could relate to. We came up with the following:
Scenario: You run a community group and you’ve taken the first steps to set up a website to keep people informed about what you do.
Task 1: You can publish textual news content on your site but you now need to be able to add images to the news items you publish. How would you use the AI Assistant to make sure you can add images to future news articles you publish?
[Expected result: image field added to news content type]
Task 2: In addition to the news articles, you want to be able to publish longer-form pieces providing information about collaborative community projects you have worked on, to include things like logos and links to the partners you have worked with. How would you use the AI Assistants to help you do this?
[Expected result: new content type created for long-form articles]
Follow-up task question: How successful would you say the AI Assistant was in helping you achieve this task?
We built in capacity for multiple iterative testing rounds
From the guidance I had read, it was good practice to test conversational interfaces in multiple rounds and considering our goals for the testing we recognised the value of working iteratively to give us the most opportunity to learn. We therefore sketched out a plan for two or three broad phases of testing with individual participants (moderated), and a later phase, open to all (unmoderated):
- Exploratory pilot test with a single participant with the scenario and tasks to try out the end-to-end AI Assistant process, from prompt to evaluation
- A few iterated test rounds based on learnings from phase 1, being prepared to tweak the test environment, scenario and/or tasks
- Wider test available to anyone online using scenarios and tasks of their choosing (adhering to any constraints dictated by functionality in the test environment)
Read more about the plans for testing in the meta issue on Drupal.org:
[Meta] Conduct evaluations of the AI Agents
I did a pilot test followed by two iterations of tests over two weeks
With the scenario, tasks, test environment and research plan in place, I was ready to recruit participants to schedule tests at the end of November 2024. Recognising the importance of observing how University staff naturally sought to interact with the AI Assistants without prior bias, I was careful not to describe the functionality of the AI Assistants when it came to putting a call-out for willing participants. Seven people were recruited and working with Jamie and the team at Freely Give, we were able to set up a test environment to enable anonymised access, and then schedule an exploratory pilot test followed by an iterated first round test, followed by an iterated second round test.
Testing was very successful – I’ve written other blog posts with full details
Overall, the participants responded positively to the AI Assistants, they liked the information they provided, they were impressed with what they could do and they could see the value of the Assistants to support them both to do tasks and to learn more about how the system worked. There were some areas for improvement, and we were able to work on some of these between testing rounds, making back-end and front-end tweaks to improve the functionality of the AI Assistants on the basis of how we saw they performed for the participants in the tests.
More details of what we learned and the iterations made between tests are documented in separate blog posts:
Early adventures in UX AI research: Pilot testing Drupal AI Assistants
Iterated testing for rapid, rich learning: Researching the UX of Drupal AI Assistants
I’ve also written a blog post summarising what we learned from the tests and what is next:
Making AI useful and usable – learnings from UX research of Drupal AI Assistants