Unfolding the universe of possibilities..

Navigating the waves of the web ocean

Synergy of LLM and GUI, Beyond the Chatbot

Use OpenAI GPT function calling to drive your mobile app

Image created using Midjourney

Introduction

We introduce a radical UX approach to optimally blend Conversational AI and Graphical User Interface (GUI) interaction in the form of a Natural Language Bar. It sits at the bottom of every screen and allows users to interact with your entire app from a single entry point. Users always have the choice between language and direct manipulation. They do not have to search where and how to accomplish tasks and can express their intentions in their own language, while the speed, compactness, and affordance of the GUI are fully preserved. Definitions of the screens of a GUI are sent along with the user’s request to the Large Language Model (LLM), letting the LLM navigate the GUI toward the user’s intention. We built upon a previous article, where the concept is optimized, and an implemented Flutter sample app is available here for you to try for yourself. The full Flutter code is available on GitHub, so you can explore the concept in your own context. This article is intended for product owners, UX designers, and mobile developers.

Background

Natural language interfaces and Graphical User Interfaces(GUIs) connect the human user to the abilities of the computer system. Natural language makes it possible for humans to communicate with each other about things outside of immediacy while pointing allows communication about concrete items in the world. Pointing requires less cognitive effort for your communicative counterpart than producing and processing natural language and it also leaves less room for confusion. Natural language, however, can convey information about the entire world: concrete, abstract, past, present, future, and the meta-world, offering random access to everything.

With the rise of ChatGPT the interpreting quality of NLP has reached a high level, and using function calling it is now feasible to make complete natural language interfaces to computer systems, that make little misinterpretations. The current trend in the LLM community is to focus on chat interfaces as the main conversational user interface. This approach stems from chat being the primary form of written human-to-human interaction, preserving conversational history in a scrolling window. Many sorts of information are suitable for graphical representation. A common approach is to weave GUI elements into the chat conversation. The cost of this, however, is that the chat history becomes bulky and the state management of GUI elements in a chat history is non-trivial. Also by fully adopting the chat paradigm we lose the option of offering menu-driven interaction paths to the users, so they are left more in the dark with respect to the abilities of the app.

The approach taken here can be applied to a whole range of apps such as banking-, shopping- and travel apps. Mobile apps have their most important feature on the front screen, but features on other tabs or screens buried in menus may be very difficult for users to find. When users can express their requests in their own language, they can naturally be taken to the screen that is most likely to satisfy their needs. When the most important feature is on the front screen, the number of options available for this core feature may be quite overwhelming to all present in the form of GUI elements. Natural language approaches this from the other end: the users have the initiative and express exactly what they want. Combining the two leads to an optimum, where both approaches complement each other and users can pick what is the best option to suit their task or subtask.

The Natural Language Bar

The Natural Language Bar (NLB) allows users to type or say what they want from the app. Along with their request, the definitions of all screens of the app are sent to the LLM using a technique coined ‘function calling’ by OpenAI. In our concept, we see a GUI screen as a function that can be called in our app, where the widgets for user input on the screen are regarded as parameters of that function.

We will take a banking app as an example to illustrate the concept. When the user issues a request in natural language, the LLM responds by telling the navigation component in our app which screen to open and which values to set. This is illustrated in the following figure:

Some interaction examples are given in the following images:

The following image shows a derived conclusion by the LLM, where it concludes that the best available way to help the user is by showing the banking offices near you:

The following example shows that even very shortened expressions may lead to the desired result for the user:

So free typing can also be a very fast interaction mode. The correct interpretation of such shorthands depends on the non-ambiguity of the intention behind it. In this case, the app has no other screen than transfers that this could be meant for so the LLM could make a non-ambiguous decision.

Another bonus feature is that the interaction has a history, so you can continue to type to correct the previous intent:

So the LLM can combine several messages, one correcting or enhancing the other, to produce the desired function call. This can be very convenient for a trip-planning app where you initially just mention the origin and destination, and in subsequent messages refine it with extra requirements, like the date, the time, only direct connections, only first-class, etc.

You click here to try the sample app for yourself. Speech input does work in a Chrome browser and on Android and iOS native. The provided speech recognition of the platform is used, so there’s room for improvement if the quality is not sufficient for your purpose.

How it works

When the user asks a question in the Natural Language Bar a JSON schema is added to the prompt to the LLM, which defines the structure and purposes of all screens and their input elements. The LLM attempts to map the user’s natural language expression onto one of these screen definitions and returns a JSON object so your code can make a ‘function call’ to activate the applicable screen.

The correspondence between functions and screens is illustrated in the following figure:

A full function specification is available for your inspection here.

The Flutter implementation of the Natural Language Bar is based on LangChain Dart, the Dart version of the LangChain ecosystem. All prompt engineering happens client side. It turns out to make more sense to keep screens, navigation logic, and function templates together. In fact, the function templates are knit into the navigation structure since there is a one-to-one relationship. The following shows the code for activating and navigating to the credit card screen:

DocumentedGoRoute(
name: ‘creditcard’,
description: ‘Show your credit card and maybe perform an action on it’,
parameters: [
UIParameter(
name: ‘limit’,
description: ‘New limit for the card’,
type: ‘integer’,
),
UIParameter(
name: ‘action’,
description: ‘Action to perform on the card’,
enumeration: [‘replace’, ‘cancel’],
),
],
pageBuilder: (context, state) {
return MaterialPage(
fullscreenDialog: true,
child: LangBarWrapper(
body: CreditCardScreen(
label: ‘Credit Card’,
action: ActionOnCard.fromString(
state.uri.queryParameters[‘action’]),
limit:
int.tryParse(state.uri.queryParameters[‘limit’] ?? ”))));
}),

At the top, we see that this is a route: a destination in the routing system of the app, that can be activated through a hyperlink. The description is the portion the LLM will use to match the screen to the user’s intent. The parameters below it (credit card limit and action to take) define the fields of the screen in natural language, so the LLM can extract them from the user’s question. Then the pageBuilder-item defines how the screen should be activated, using the query parameters of the deep link. You can recognize these in https://langbar-1d3b9.web.app/home: type: ‘credit card limit to 10000’ in the NLB, and the address bar of the browser will read: https://langbar-1d3b9.web.app/creditcard?limit=10000.

A LangChain agent was used, which makes this approach independent of GPT, so it can also be applied using other LLMs like Llama, Gemini, Falcon, etc. Moreover, it makes it easy to add LLM-based assistance.

History Panel

The Natural Language Bar offers a collapsible interaction history panel, so the user can easily repeat previous statements. This way the interaction history is preserved, similarly to chat interfaces, but in a compacted, collapsible form, saving screen real estate and preventing clutter. Previous language statements by the user are shown using the language the user has used. System responses are incorporated as a hyperlink on that user statement, so they can be clicked on to reactivate the corresponding screen again:

When the LLM cannot fully determine the screen to activate, system responses are shown explicitly, in which case the history panel expands automatically. This can happen when the user has provided too little information, when the user’s request is outside of the scope of the app, or when an error occurs:

Future

The history panel is a nice place to offer customer support and context-sensitive help in chatbot form. At the time of writing, there is a lively discussion and evolution of RAG (Retrieval Augmented Generation) techniques that let chatbots answer user questions based on a large body of text content provided by your own organization. Besides that, the Natural Language Bar is a good starting point to imagine what more power and ease you can give to applications using natural language. Please leave your ideas in the comments. I’m really curious.

Customer Support

Besides your app, your organization also has a website with lots of information for your users. Maybe this website already has a chatbot. Maybe even your app already has a chatbot. The history panel of interactions is a good place to also have such customer-support conversations.

Context-sensitive Help

In the context described above we maintain a history of linguistic interaction with our app. In the future, we may (invisible) add a trace of direct user interaction with the GUI to this history sequence. Context-sensitive help could then be given by combining the history trace of user interaction with RAG on the help documentation of the app. User questions will then be answered more in the context of the current state of the app.

Beyond static assistance for Mobile Apps

The current proposal is an MVP. It offers a static template for the interpretation of a user’s linguistic requests in the context of an app. This technique opens a broad spectrum of future improvements:

When users pose a question when they are on a specific screen, we may be able to dynamically add more specific interpretation templates (functions) to the prompt that depend on the state of that screen like ‘Why is the submit button greyed out/disabled?’.Function calling using a Natural Language Bar can be used as an assistant for creative applications, e.g. to execute procedures on selections like ‘make the same size’, or ‘turn into a reusable component’. Microsoft Copolit 365 is already using similar features. The approach taken in this article can also enable your organization to take advantage of such functions.

Natural language interaction with every aspect of your system will rapidly become a major component of every UI. When using function calling, you have to include your system abilities in the prompt, but soon more economical and powerful methods will hit the market. For instance, OpenAI has recently opened up model finetuning with function calling, allowing you to create an LLM version with the abilities of your system baked in. Even when those abilities are very extensive, the load on the prompt remains limited.

Conclusion

LLMs can be used as a wonderful glue for interacting with GUI-based apps in natural language through ‘function calling’. A Natural Language Bar was introduced that enables users to type or speak their intentions and the system will respond by navigating to the right screen and prefilling the right values. The sample app allows you to actually feel what that is like and the available source code makes it possible to quickly apply this to your own app if you use Flutter. The Natural Language Bar is not for Flutter or mobile apps only but can be applied to any application with a GUI. Its greatest strength is that it opens up the entirety of the functionality of the app for the user from a single access point, without the user having to know how to do things, where to find them, or even having to know the jargon of the app. From an app development perspective, you can offer all this to the user by simply documenting the purpose of your screens and the input widgets on them.

Follow me on LinkedIn

Special thanks to David Miguel Lozano for helping me with LangChain Dart

Some interesting articles: multimodal dialog, google blog on GUIs and LLMs, interpreting GUI interaction as language, LLM Powered Assistants, Language and GUI, Chatbot and GUI

All images in this article, unless otherwise noted, are by the author

Synergy of LLM and GUI, Beyond the Chatbot was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment