
In the rapidly evolving landscape of AI, understanding user intent is crucial for creating truly helpful digital assistants. Whether you’re navigating a mobile app or browsing the web, these assistants need to anticipate your needs to provide seamless experiences. Our latest paper, “Small Models, Big Results: Achieving Superior Intent Extraction Through Decomposition,” presented at EMNLP 2025, introduces a groundbreaking approach to achieve this using small models. Let’s dive into the details and explore how this innovative method can revolutionize intent extraction.
Understanding the Challenge
As AI technologies advance, truly helpful agents need to become better at anticipating user needs. For experiences on mobile devices to be truly helpful, the underlying models need to understand what the user is doing (or trying to do) when users interact with them. Once current and previous tasks are understood, the model has more context to predict potential next actions. For example, if a user previously searched for music festivals across Europe and is now looking for a flight to London, the agent could offer to find festivals in London on those specific dates. Large multimodal LLMs are already quite good at understanding user intent from a user interface (UI) trajectory. But using LLMs for this task would typically require sending information to a server, which can be slow, costly, and carries the potential risk of exposing sensitive information. Our recent paper addresses the question of how to use small multimodal LLMs (MLLMs) to understand sequences of user interactions on the web and on mobile devices all on device.
The Decomposed Workflow
Our approach introduces a decomposed workflow for user intent understanding from user interactions. At inference time, the model performs two main steps.
Individual Screen Summaries
In the first step, each individual interaction on a single screen and UI element is summarized independently. Given a sliding window of three screens (previous, current, next), the model asks the following questions:
– What is the relevant screen context?
– Give a short list of salient details on the current screen.
– What did the user just do? Provide a list of actions that the user took in this interaction.
– Speculate. What is the user trying to accomplish with this interaction?
For each screenshot, action pair, we look at the surrounding screens, and ask questions about the screen context, the user action, and speculation about what the user is trying to do. At the bottom, we show a potential LLM-generated summary answering the three questions. This summary will serve as an input to the second stage of the decomposed workflow.
Intent Extraction from Summaries
In this stage, a fine-tuned small model is used to extract a single sentence from the screen summaries. We find that the following techniques are helpful.
– Fine-tuning: Giving examples of what a “good” intent statement looks like helps the model focus on the important parts of the summaries and drop the non-useful ones. We use publicly available automation datasets for training data, since they have good examples that pair intent with sequences of actions.
– Label Preparation: Because the summaries may be missing information, if we train with the full intents, we inadvertently teach the model to fill in details that aren’t present (i.e., to hallucinate). To avoid this, we first remove any information that doesn’t appear in the summaries from the training intents (using a separate LLM call).
– Dropping speculations: Giving the model a specified place to output its speculations on what the user is trying to do helps create a more complete step summary in stage one, but can confuse the intent extractor in stage two. So we do not use the speculations during the second stage. While this may seem counterintuitive — asking for speculations in the first stage only to drop them in the second — we find this helps improve performance.
The second stage of the decomposed workflow uses a fine-tuned model that takes the summaries generated in the first stage as inputs and outputs a concise intent statement. During this stage, we drop all speculation from the summaries and clean the labels during training so that they don’t encourage hallucinations.
Evaluation Approach
We use the Bi-Fact approach to evaluate the quality of a predicted intent against a reference intent. With this approach, we compare the predicted intent to the reference intent using a set of binary facts. A binary fact is a statement that is either true or false. For example, the fact “the user is searching for a flight” is a binary fact. We then calculate the precision and recall of the predicted intent against the reference intent. Precision is the number of true positive facts divided by the number of predicted facts. Recall is the number of true positive facts divided by the number of reference facts. We then calculate the F1 score, which is the harmonic mean of precision and recall. We find that our approach yields results comparable to much larger models, illustrating its potential for on-device applications.
Real-World Applications
The decomposed workflow has several real-world applications. For instance, it can be used to improve the user experience on mobile devices by anticipating user needs and providing relevant suggestions. It can also be used to enhance accessibility features by understanding user intent and providing appropriate assistance. Additionally, it can be used to improve the performance of digital assistants by understanding user intent and providing more accurate responses.
Conclusion
In conclusion, our novel approach to intent extraction using small models shows promising results. By decomposing the task into two stages, we make it more tractable for small models and achieve results comparable to much larger models. This work builds on previous work from our team on user intent understanding and illustrates the potential for on-device applications. As AI technologies continue to advance, we believe that this approach will play a crucial role in creating truly helpful digital assistants.
FAQ
Q: Why use small models for intent extraction?
A: Small models are more efficient and can be run on-device, reducing latency and the risk of exposing sensitive information. They also require less computational resources, making them more accessible.
Q: How does the decomposed workflow improve performance?
A: The decomposed workflow improves performance by breaking down the complex task of intent extraction into two simpler stages. This makes it easier for small models to understand user interactions and extract intent.
Q: What are the real-world applications of this approach?
A: The decomposed workflow can be used to improve the user experience on mobile devices, enhance accessibility features, and improve the performance of digital assistants by understanding user intent and providing more accurate responses.
Q: How does the evaluation approach work?
A: We use the Bi-Fact approach to evaluate the quality of a predicted intent against a reference intent. This approach compares the predicted intent to the reference intent using a set of binary facts and calculates the precision, recall, and F1 score.
Q: What are the limitations of this approach?
A: While the decomposed workflow shows promising results, it may still struggle with complex user interactions and ambiguous intents. Additionally, the quality of the summaries generated in the first stage can impact the performance of the intent extractor in the second stage.