Towards a unified user experience
Source: rawpixel.com
Conversational AI is an application of LLMs that has triggered a lot of buzz and attention due to its scalability across many industries and use cases. While conversational systems have existed for decades, LLMs have brought the quality push that was needed for their large-scale adoption. In this article, we will use the mental model shown in Figure 1 to dissect conversational AI applications (cf. Building AI products with a holistic mental model for an introduction to the mental model). After considering the market opportunities and the business value of conversational AI systems, we will explain the additional “machinery” in terms of data, LLM fine-tuning, and conversational design that needs to be set up to make conversations not only possible but also useful and enjoyable.
Figure 1: Mental model of an AI system
1. Opportunity, value, and limitations
Traditional UX design is built around a multitude of artificial UX elements, swipes, taps, and clicks, requiring a learning curve for each new app. Using conversational AI, we can do away with this busyness, substituting it with the elegant experience of a naturally flowing conversation in which we can forget the transitions between different apps, windows, and devices. We use language, our universal and familiar protocol for communication, to interact with different virtual assistants (VAs) and accomplish our tasks.
Conversational UIs are not exactly the new hot stuff. Interactive voice response systems (IVRs) and chatbots have been around since the 1990s, and major advances in NLP have been closely followed by waves of hope and development for voice and chat interfaces. However, before the time of LLMs, most of the systems were implemented in the symbolic paradigm, relying on rules, keywords, and conversational patterns. They were also limited to a specific, pre-defined domain of “competence”, and users venturing outside of these would soon hit a dead end. All in all, these systems were mined with potential points of failure, and after a couple of frustrating attempts, many users never came back to them. The following figure illustrates an example dialogue. A user who wants to order tickets for a specific concert patiently goes through a detailed interrogation flow, only to find out at the end that the concert is sold out.
Figure 2: Example of a poor conversation flow
As an enabling technology, LLMs can take conversational interfaces to a new level of quality and user satisfaction. Conversational systems can now display much broader world knowledge, linguistic competence, and conversational ability. Leveraging pre-trained models, they can also be developed in much shorter timespans, since the tedious work of compiling rules, keywords, and dialogue flows is now replaced by the statistical knowledge of the LLM. Let’s look at two types of applications where conversational AI can provide value at scale:
Customer support and, more generally, applications that are used by a large number of users who often make similar requests. Here, the company providing the customer support has a clear information advantage over the user and can leverage this to create a more intuitive and enjoyable user experience. Consider the case of rebooking a flight. For myself, a rather frequent flyer, this is something that happens 1–2 times per year. In-between, I tend to forget the details of the process, not to speak of the user interface of a specific airline. By contrast, the customer support of the airline has rebooking requests at the front and center of their operations. Instead of exposing the rebooking process via a complex graphical interface, its logic can be “hidden” from customers who contact the support, and they can use language as a natural channel to make their rebooking. Of course, there still remains a “long tail” of requests that are less familiar. For example, imagine a spontaneous mood swing that pushes a business customer to add her beloved dog as excess baggage to a booked flight. These more individual requests can be either passed on to human agents or covered via an internal knowledge management system that is connected to the virtual assistant.Knowledge management which is grounded in a large quantity of data. For many modern companies, the internal knowledge they accumulate over years of operating, iterating, and learning is a core asset and differentiator — if it is stored, managed, and accessed in an efficient way. Sitting on a wealth of data that is hidden in collaboration tools, internal wikis, knowledge bases, etc., we often fail to transform it into actionable knowledge. As employees leave, new employees are onboarded, and you never come to finalize that documentation page you started three months ago, valuable knowledge falls victim to entropy. It becomes more and more difficult to find a way through the internal data labyrinth and get your hands on the bits of information required in a specific business situation. This leads to huge efficiency losses for knowledge workers. To address this issue, we can augment LLMs with semantic search on internal data sources. LLMs allow to use natural-language questions instead of complex formal queries to ask questions against this database. Users can thus focus on their information needs rather than on the structure of the knowledge base or the syntax of a query language such as SQL. Being text-based, these systems work with data in a rich semantic space, making meaningful connections “under the hood”.
Beyond these major application areas, there are numerous other applications such as telehealth, mental health assistants, and educational chatbots that allow to streamline UX and bring value to their users in a faster and more efficient way.
2. Data
LLMs are originally not trained to engage in fluent small talk or more substantial conversations. Rather, they learn to generate the following token at each inference step, eventually resulting in a coherent text. This low-level objective is different from the challenge of human conversation. Conversation is incredibly intuitive for humans, but it gets incredibly complex and nuanced when you want to teach a machine to do it. For example, let’s look at the fundamental notion of intents. When we use language, we do so for a specific purpose, which is our communicative intent — it could be to convey information, socialize, or ask someone to do something. While the first two are rather straightforward for an LLM (as long as it has seen the required information in the data), the latter is already more challenging. Not only does the LLM need to combine and structure the related information in a coherent way, but it also needs to set the right emotional tone in terms of soft criteria such as formality, creativity, humor, etc. This is a challenge for conversational design (cf. section 5), which is closely intertwined with the task of creating fine-tuning data.
Making the transition from classical language generation to recognizing and responding to specific communicative intents is an important step toward better usability and acceptance of conversational systems. As for all fine-tuning endeavors, this starts with the compilation of an appropriate dataset.
The fine-tuning data should come as close as possible to the (future) real-world data distribution. First, it should be conversational (dialogue) data. Second, if your virtual assistant will be specialized in a specific domain, you should try to assemble fine-tuning data that reflects the necessary domain knowledge. Third, if there are typical flows and requests that will be recurring frequently in your application, as in the case of customer support, try to incorporate varied examples of these in your training data. The following table shows a sample of conversational fine-tuning data from the 3K Conversations Dataset for ChatBot, which is freely available on Kaggle:
Table 1: Sample of conversational fine-tuning data from the 3K Conversations Dataset for ChatBot
Manually creating conversational data can become an expensive undertaking — crowdsourcing and using LLMs to help you generate data are two ways to scale up. Once the dialogue data is collected, the conversations need to be assessed and annotated. This allows you to show both positive and negative examples to your model and nudge it towards picking up the characteristics of the “right” conversations. The assessment can happen either with absolute scores or labels, or a ranking of different options between each other. The latter approach leads to more accurate fine-tuning data because humans are normally better at ranking multiple options than evaluating them in isolation.
With your data in place, you are ready to fine-tune your model and enrich it with additional capabilities. In the next section, we will look at fine-tuning, integrating additional information from memory and semantic search, and connecting agents to your conversational system to empower it to execute specific tasks.
3. Assembling the conversational system
A typical conversational system is built with a conversational agent that orchestrates and coordinates the components and capabilities of the system, such as the LLM, the memory, and external data sources. The development of conversational AI systems is a highly experimental and empirical task, and your developers will be in a constant back-and-forth between optimizing your data, improving the fine-tuning strategy, playing with additional components and enhancements, and testing the results. Non-technical team members, including product managers and UX designers, will also be continuously testing the product. Based on their customer discovery activities, they are in a great position to anticipate future users’ conversation style and content and should be actively contributing this knowledge.
3.1 Teaching conversation skills to your LLM
For fine-tuning, you need your fine-tuning data (cf. section 2) and a pre-trained LLM. LLMs already know a lot about language and the world, and our challenge is to teach them the principles of conversation. In fine-tuning, the target outputs are texts, and the model will be optimized to generate texts that are as similar as possible to the targets. For supervised fine-tuning, you first need to clearly define the conversational AI task you want the model to perform, gather the data, and run and iterate over the fine-tuning process.
With the hype around LLMs, a variety of fine-tuning methods have emerged. For a rather traditional example of fine-tuning for conversation, you can refer to the description of the LaMDA model.[1] LaMDA was fine-tuned in two steps. First, dialogue data is used to teach the model conversational skills (“generative” fine-tuning). Then, the labels produced by annotators during the assessment of the data are used to train classifiers that can assess the model’s outputs along desired attributes, which include sensibleness, specificity, interestingness, and safety (“discriminative” fine-tuning). These classifiers are then used to steer the behavior of the model towards these attributes.
Figure 3: LaMDA is fine-tuned in two steps
Additionally, factual groundedness — the ability to ground their outputs in credible external information — is an important attribute of LLMs. To ensure factual groundedness and minimize hallucination, LaMDA was fine-tuned with a dataset that involves calls to an external information retrieval system whenever external knowledge is required. Thus, the model learned to first retrieve factual information whenever the user made a query that required new knowledge.
Another popular fine-tuning technique is Reinforcement Learning from Human Feedback (RLHF)[2]. RLHF “redirects” the learning process of the LLM from the straightforward but artificial next-token prediction task towards learning human preferences in a given communicative situation. These human preferences are directly encoded in the training data. During the annotation process, humans are presented with prompts and either write the desired response or rank a series of existing responses. The behavior of the LLM is then optimized to reflect the human preference.
3.2 Adding external data and semantic search
Beyond compiling conversations for fine-tuning the model, you might want to enhance your system with specialized data that can be leveraged during the conversation. For example, your system might need access to external data, such as patents or scientific papers, or internal data, such as customer profiles or your technical documentation. This is normally done via semantic search (also known as retrieval-augmented generation, or RAG)[3]. The additional data is saved in a database in the form of semantic embeddings (cf. this article for an explanation of embeddings and further references). When the user request comes in, it is preprocessed and transformed into a semantic embedding. The semantic search then identifies the documents that are most relevant to the request and uses them as context for the prompt. By integrating additional data with semantic search, you can reduce hallucination and provide more useful, factually grounded responses. By continuously updating the embedding database, you can also keep the knowledge and responses of your system up-to-date without constantly rerunning your fine-tuning process.
3.3 Memory and context awareness
Imagine going to a party and meeting Peter, a lawyer. You get excited and start pitching the legal chatbot you are currently planning to build. Peter looks interested, leans towards you, uhms and nods. At some point, you want his opinion on whether he would like to use your app. Instead of an informative statement that would compensate for your eloquence, you hear: “Uhm… what was this app doing again?”
The unwritten contract of communication among humans presupposes that we are listening to our conversation partners and building our own speech acts on the context we are co-creating during the interaction. In social settings, the emergence of this joint understanding characterizes a fruitful, enriching conversation. In more mundane settings like reserving a restaurant table or buying a train ticket, it is an absolute necessity in order to accomplish the task and provide the expected value to the user. This requires your assistant to know the history of the current conversation, but also of past conversations — for example, it should not be asking for the name and other personal details of a user over and over whenever they initiate a conversation.
One of the challenges of maintaining context awareness is coreference resolution, i.e. understanding which objects are referred to by pronouns. Humans intuitively use a lot of contextual cues when they interpret language — for example, you can ask a young child, “Please get the green ball out of the red box and bring it to me,” and the child will know you mean the ball, not the box. For virtual assistants, this task can be rather challenging, as illustrated by the following dialogue:
Assistant: Thanks, I will now book your flight. Would you also like to order a meal for your flight?
User: Uhm… can I decide later whether I want it?
Assistant: Sorry, this flight cannot be changed or canceled later.
Here, the assistant fails to recognize that the pronoun it from the user refers not to the flight, but to the meal, thus requiring another iteration to fix this misunderstanding.
3.4 Additional guardrails
Every now and then, even the best LLM will misbehave and hallucinate. In many cases, hallucinations are plain accuracy issues — and, well, you need to accept that no AI is 100% accurate. Compared to other AI systems, the “distance” between the user and the AI is rather small between the user and the AI. A plain accuracy issue can quickly turn into something that is perceived as toxic, discriminative, or generally harmful. Additionally, since LLMs don’t have an inherent understanding of privacy, they can also reveal sensitive data such as personally identifiable information (PII). You can work against these behaviors by using additional guardrails. Tools such as Guardrails AI, Rebuff, NeMo Guardrails, and Microsoft Guidance allow you to de-risk your system by formulating additional requirements on LLM outputs and blocking undesired outputs.
Multiple architectures are possible in conversational AI. The following schema shows a simple example of how the fine-tuned LLM, external data, and memory can be integrated by a conversational agent which is also responsible for the prompt construction and the guardrails.
Figure 4: Schema of a conversational AI system including a fine-tuned LLM, a database for semantic search, and a memory component
4. User experience and conversational design
The charm of conversational interfaces lies in their simplicity and uniformity across different applications. If the future of user interfaces is that all apps look more or less the same, is the job of the UX designer doomed? Definitely not — conversation is an art to be taught to your LLM so it can conduct conversations that are helpful, natural, and comfortable for your users. Good conversational design emerges when we combine our knowledge of human psychology, linguistics, and UX design. In the following, we will first consider two basic choices when building a conversational system, namely whether you will use voice and/or chat, as well as the larger context of your system. Then, we will look at the conversations themselves, and see how you can design the personality of your assistant while teaching it to engage in helpful and cooperative conversations.
4.1 Voice versus chat
Conversational interfaces can be implemented using chat or voice. In a nutshell, voice is faster while chat allows users to stay private and to benefit from enriched UI functionality. Let’s dive a bit deeper into the two options since this is one of the first and most important decisions you will face when building a conversational app.
To pick between the two alternatives, start by considering the physical setting in which your app will be used. For example, why are almost all conversational systems in cars, such as those offered by Nuance Communications, based on voice? Because the hands of the driver are already busy and they cannot constantly switch between the steering wheel and a keyboard. This also applies to other activities like cooking, where users want to stay in the flow of their activity while using your app. Cars and kitchens are mostly private settings, so users can experience the joy of voice interaction without worrying about privacy or about bothering others. By contrast, if your app is to be used in a public setting like the office, a library, or a train station, voice might not be your first choice.
After understanding the physical setting, consider the emotional side. Voice can be used intentionally to transmit tone, mood, and personality — does this add value in your context? If you are building your app for leisure, voice might increase the fun factor, while an assistant for mental health could accommodate more empathy and allow a potentially troubled user a larger diapason of expression. By contrast, if your app will assist users in a professional setting like trading or customer service, a more anonymous, text-based interaction might contribute to more objective decisions and spare you the hassle of designing an overly emotional experience.
As a next step, think about the functionality. The text-based interface allows you to enrich the conversations with other media like images, as well as graphical UI elements such as buttons. For example, in an e-commerce assistant, an app that suggests products by posting their pictures and structured descriptions will be way more user-friendly than one that describes products via voice and potentially provides their identifiers.
Finally, let’s talk about the additional design and development challenges of building a voice UI:
There is an additional step of speech recognition that happens before user inputs can be processed with LLMs and Natural Language Processing (NLP).Voice is a more personal and emotional medium of communication — thus, the requirements for designing a consistent, appropriate, and enjoyable persona behind your virtual assistant are higher, and you will need to take into account additional factors of “voice design” such as timbre, stress, tone, and speaking speed.Users expect your voice conversation to proceed at the same speed as a human conversation. To offer a natural interaction via voice, you need a much shorter latency than for chat. In human conversations, the typical gap between turns is 200 milliseconds — This prompt response is possible because we start constructing our turns while listening to our partner’s speech. Your voice assistant will need to match up with this degree of fluency in the interaction. By contrast, for chatbots, you compete with time spans of seconds, and some developers even introduce an additional delay to make the conversation feel like a typed chat between humans.Communication via voice is a linear, one-off enterprise — if your user didn’t get what you said, you are in for a tedious, error-prone clarification loop. Thus, your turns need to be as concise, clear, and informative as possible.
If you go for the voice solution, make sure that you not only clearly understand the advantages as compared to chat, but also have the skills and resources to address these additional challenges.
4.2 Where will your conversational AI live?
Now, let’s consider the larger context in which you can integrate conversational AI. All of us are familiar with chatbots on company websites — those widgets on the right of your screen that pop up when we open the website of a business. Personally, more often than not, my intuitive reaction is to look for the Close button. Why is that? Through initial attempts to “converse” with these bots, I have learned that they cannot satisfy more specific information requirements, and in the end, I still need to comb through the website. The moral of the story? Don’t build a chatbot because it’s cool and trendy — rather, build it because you are sure it can create additional value for your users.
Beyond the controversial widget on a company website, there are several exciting contexts to integrate those more general chatbots that have become possible with LLMs:
Copilots: These assistants guide and advise you through specific processes and tasks, like GitHub CoPilot for programming. Normally, copilots are “tied” to a specific application (or a small suite of related applications).Synthetic humans (also digital humans): These creatures “emulate” real humans in the digital world. They look, act, and talk like humans and thus also need rich conversational abilities. Synthetic humans are often used in immersive applications such as gaming, and augmented and virtual reality.Digital twins: Digital twins are digital “copies” of real-world processes and objects, such as factories, cars, or engines. They are used to simulate, analyze, and optimize the design and behavior of the real object. Natural language interactions with digital twins allow for smoother and more versatile access to the data and models.Databases: Nowadays, data is available on any topic, be it investment recommendations, code snippets, or educational materials. What is often hard is to find the very specific data that users need in a specific situation. Graphical interfaces to databases are either too coarse-grained or covered with endless search and filter widgets. Versatile query languages such as SQL and GraphQL are only accessible to users with the corresponding skills. Conversational solutions allow users to query the data in natural language, while the LLM that processes the requests automatically converts them into the corresponding query language (cf. this article for an explanation of Text2SQL).
4.3 Imprinting a personality on your assistant
As humans, we are wired to anthropomorphize, i.e. to inflict additional human traits when we see something that vaguely resembles a human. Language is one of the most unique and fascinating characteristics of humankind, and conversational products will automatically be associated with humans. People will imagine a person behind their screen or device — and it is good practice to not leave this specific person to the chance of your users’ imaginations, but rather lend it a consistent personality that matches well with your product and brand. This process is called “persona design”.
The first step of persona design is understanding the character traits you would like your persona to display. Ideally, this is already done at the level of the training data — for example, when using RLHF, you can ask your annotators to rank the data according to traits like helpfulness, politeness, fun, etc., in order to bias the model towards the desired characteristics. These characteristics can be matched with your brand attributes to create a consistent image that continuously promotes your branding via the product experience.
Beyond general characteristics, you should also think about how your virtual assistant will deal with specific situations beyond the “happy path”. For example, how will it respond to user requests that are beyond its scope, reply to questions about itself, and deal with abusive or vulgar language?
It is important to develop explicit internal guidelines on your persona that can be used by data annotators and conversation designers. This will allow you to design your persona in a purposeful way and keep it consistent across your team and over time, as your application undergoes multiple iterations and refinements.
4.4 Making conversations helpful with the “principle of cooperation”
Have you ever had the impression of talking to a brick wall when you were actually speaking with a human? Sometimes, we find our conversation partners are just not interested in leading the conversation to success. Fortunately, in most cases, things are smoother, and humans will intuitively follow the “principle of cooperation” that was introduced by the language philosopher Paul Grice. According to this principle, humans who successfully communicate with each other follow four maxims, namely quantity, quality, relevance, and manner.
Maxim of quantity
The maxim of quantity asks the speaker to be informative and make their contribution as informative as required. On the side of the virtual assistant, this also means actively moving the conversation forward. For example, consider this snippet from an e-commerce fashion app:
Assistant: What kind of clothing items are you looking for?
User: I am looking for a dress in orange.
Assistant: Don’t: Sorry, we don’t have orange dresses at the moment.
Do: Sorry, we don’t have dresses in orange, but we have this great and very comfortable dress in yellow: …
The user hopes to leave your app with a suitable item. By stopping the conversation because you don’t have items that would fit the exact description, you kill off the possibility of success. However, if your app makes suggestions about alternative items, it will appear more helpful and leave the option of a successful interaction open.
Especially in voice interactions, it is important to find the right balance between providing all the information the user might need for success, while not overwhelming them with unnecessary information which might cloud the interaction.
Maxim of quality
The maxim of quality asks speakers to be truthful and not say things they believe are false, or for which they lack adequate evidence. There is a lot of subjectivity and personal belief involved here, thus, this maxim cannot be directly projected onto LLMs. As we have seen, LLMs tend to hallucinate, and this risk is especially high when the conversation drifts off into less familiar territories. Additional guardrails can be used in order to avoid misleading statements and bring the conversation back on track.
Maxim of relevance
The maxim of relevance asks us to be — well, relevant. As a precondition, the virtual assistant has to understand the true intent of the user. Here is an example of an airline assistant that violates this rule:
User: I need a flight from Berlin to Zurich tomorrow morning.
Assistant: OK, I found the following flights from Berlin to Zurich for tomorrow morning. Flight BZ36294 leaves Berlin at 6.30am and arrives in Zurich at 8am. Flight BZ96294 leaves Berlin at 8am and arrives at 9.30am. Flight BZ96294 leaves Berlin at 8am and arrives in Zurich at 9.30am. Flight BZ95614 leaves Berlin at 11am and arrives in Zurich at 12.30pm. There is also the possibility of a connecting flight, in which case you would leave Berlin at 11.45am and arrive in Zurich at 3pm.
User: Sorry… could you repeat this again?
The assistant is overwhelming the user with detailed information that might not be relevant. By breaking this up and guiding the user to limit their search space, the dialogue becomes much easier to digest:
User: I need a flight from Berlin to Zurich tomorrow morning.
Assistant: Ok, I found flights leaving at 6.30, 8, and 11am. Which time would you like to leave?
User: I need to be in Zurich before 9am.
Assistant: OK, so you can take the flight BZ36294. It leaves at 6.30 and arrives at 8am. Should I buy the ticket for you?
User: Yes, thanks.
Maxim of manner
Finally, the maxim of manner states that our speech acts should be clear, concise and orderly, avoiding ambiguity and obscurity of expression. Your virtual assistant should avoid technical or internal jargon, and favour simple, universally understandable formulations.
While Grice’s principles are valid for all conversations independently of a specific domain, LLMs that were not trained specifically for conversation will often fail to fulfill them. Thus, when compiling your training data, it is important to have enough dialogue samples that allow your model to learn these principles.
The domain of conversational design is developing rather quickly. Whether you are already building AI products or thinking about your career path in AI, I encourage you to dig deeper into this topic (cf. the excellent introductions in [5] and [6]). As AI is turning into a commodity, good design together with a defensible data strategy will become two important differentiators for AI products.
Summary
Let’s summarize the key takeaways from the article. Additionally, figure 6 shows a “cheatsheet” with the main points that you can download as a reference.
LLMs enhance conversational AI: Large Language Models (LLMs) have significantly improved the quality and scalability of conversational AI applications across various industries and use cases.Conversational AI can add a lot of value to applications with lots of similar user requests (e.g. customer service), or which need to access a large quantity of unstructured data (e.g. knowledge management).Data: Fine-tuning LLMs for conversational tasks requires high-quality conversational data that closely mirrors real-world interactions. Crowdsourcing and LLM-generated data can be valuable resources for scaling data collection.Putting the system together: Developing conversational AI systems is an iterative and experimental process, involving constant optimization of data, fine-tuning strategies, and component integration.Teaching conversation skills to LLMs: Fine-tuning LLMs involves training them to recognize and respond to specific communicative intents and situations.Adding external data with semantic search: Integrating external and internal data sources using semantic search enhances the AI’s responses by providing more contextually relevant information.Memory and context awareness: Effective conversational systems must maintain context awareness, including tracking the history of the current conversation and past interactions, to provide meaningful and coherent responses.Setting guardrails: To ensure responsible behavior, conversational AI systems should employ guardrails to prevent inaccuracies, hallucinations, and breaches of privacy.Persona design: Designing a consistent persona for your conversational assistant is essential to create a cohesive and branded user experience. Persona characteristics should align with your product and brand attributes.Voice vs. chat: Choosing between voice and chat interfaces depends on factors like the physical setting, emotional context, functionality, and design challenges. Consider these factors when deciding on the interface for your conversational AI.Integration in various contexts: Conversational AI can be integrated in different contexts, including copilots, synthetic humans, digital twins, and databases, each with specific use cases and requirements.Observing the Principle of Cooperation: Following the principles of quantity, quality, relevance, and manner in conversations can make interactions with conversational AI more helpful and user-friendly.Figure 6: Key takeaways and best practices for conversational AI
References
[1] Heng-Tze Chen et al. 2022. LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything.
[2] OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. Retrieved on January 13, 2022.
[3] Patrick Lewis et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
[4] Paul Grice. 1989. Studies in the Way of Words.
[5] Cathy Pearl. 2016. Designing Voice User Interfaces.
[6] Michael Cohen et al. 2004. Voice User Interface Design.
Note: All images are by the author, except noted otherwise.
Redefining Conversational AI with Large Language Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Heya i am for the first time here. I found this board and I to find It really helpful & it helped me out much. I’m hoping to present something again and help others like you helped me.