close
close

Microsoft researchers propose framework for building data-backed LLM applications

Microsoft researchers propose framework for building data-backed LLM applications

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more


Enriching large language models (LLMs) with information beyond training data is of significant interest, especially for enterprise applications.

The best-known way to incorporate domain- and customer-specific knowledge into Masters is retrieval augmented generation (RAG). However, simple RAG techniques are not sufficient in most cases.

Creating effective LLM applications enriched with data requires careful consideration of several factors. One new paperresearchers Microsoft Propose a framework for categorizing different types of RAG tasks based on the type of external data they require and the complexity of the logic they contain.

“Data-enriched LLM applications are not a one-size-fits-all solution,” the researchers write. “The demands of the real world, especially those in specialized fields, are extremely complex and can vary significantly in their relationship to given data and the reasoning challenges they entail.”

To overcome this complexity, researchers propose a four-level categorization of user queries based on the type of external data required and the cognitive processing involved in generating accurate and relevant answers:

– Clear facts: Queries that require retrieving explicitly stated facts from the data.

– Veiled facts: Queries that require the inference of information not explicitly stated in the data, often involving basic reasoning or common sense.

– Interpretable grounds: Queries that require understanding and application of domain-specific rationales or rules explicitly provided in external sources.

– Secret reasons: Queries that require uncovering and exploiting implicit domain-specific reasoning methods or strategies that are not explicitly disclosed in the data.

Each level of query presents unique challenges and requires tailored solutions to address them effectively.

Categories of data-enriched LLM applications

Open fact queries

Explicit fact queries are the simplest type, focusing on retrieving factual information directly specified in the provided data. “The defining feature of this level is clear and direct dependence on specific pieces of external data,” the researchers write.

The most common approach to addressing these queries is to use basic RAG, where LLM retrieves relevant information from a knowledge base and uses it to generate an answer.

However, even with open data queries, RAG pipelines face various challenges at every stage. For example, at the indexing stage, where the RAG system creates a store of chunks of data that can later be retrieved as context, it may have to deal with large and unstructured datasets, possibly containing multimodal elements such as images and tables. This problem can be solved by multimodal document parsing and multimodal embedding models that can map the semantic context of both textual and non-textual elements into a shared embedding space.

During the information retrieval phase, the system must ensure that the data retrieved is relevant to the user’s query. Here, developers can use techniques that improve the compatibility of queries with document stores. For example, a Master can generate synthetic answers for the user’s query. The answers may not be correct per se, but their placement can be used to retrieve documents containing relevant information.

In the answer generation phase, the model must determine whether the information received is sufficient to answer the question and find the right balance between the given context and its own internal knowledge. Specialized fine-tuning techniques can help the LLM learn to ignore irrelevant information retrieved from the knowledge base. Joint training of the collector and intervention generator may also lead to more consistent performance.

Implicit fact queries

Implicit fact queries require the LLM to go beyond simply retrieving explicitly stated information and perform some level of reasoning or inference to answer the question. “Queries at this level require collecting and processing information from multiple documents in the collection,” the researchers write.

For example, if a user asks “How many products did company X sell in the last quarter?” he might ask. or “What are the key differences between the strategies of company X and company Y?” Answering these queries requires combining information from multiple sources in the knowledge base. This is sometimes called “multi-hop question answering.”

Implicit fact queries present additional challenges, including the need to coordinate multiple context retrievals and effectively integrate reasoning and retrieval capabilities.

These queries require advanced RAG techniques. For example, techniques such as Spaced Access with Chain of Thought (IRCoT) and Restoring Augmented Thought (MOUSE) use chain-of-thought guidance to guide the retrieval process based on previously remembered information.

Another promising approach is Combining knowledge graphics with LLMs. Knowledge graphics represent information in a structured format, making it easier to engage in complex reasoning and connect disparate concepts. Graph RAG systems can transform the user’s query into a chain containing information from different nodes in a graph database.

Interpretable rational queries

Interpretable logic queries require LLMs to not only understand the actual context but also apply domain-specific rules. These justifications may not be present in the LLM pretraining data, but they are not difficult to find in the corpus of knowledge.

“Interpretable logic queries represent a relatively simple category of applications that rely on external data to provide logic,” the researchers write. “Supplementary data for such queries often includes clear descriptions of the thought processes used to solve problems.”

For example, a customer service chatbot may need to integrate documented guidelines for handling returns or refunds with the context provided by the customer’s complaint.

One of the key challenges in addressing these queries is to effectively integrate the justifications provided into the LLM and ensure that it can follow them accurately. Rapid tuning techniques, such as those using reinforcement learning and reward models, can improve the LLM’s ability to conform to certain logics.

Masters can also be used to optimize their prompts. For example, DeepMind’s OPRO technique It uses multiple models to evaluate and optimize each other’s prompts.

Developers can also use LLMs’ chain-of-thought reasoning abilities to handle complex justifications. However, manually designing chain-of-thought prompts for interpretable justifications can be time-consuming. Techniques like Automate-CoT It can help automate this process by using the LLM itself to generate chain-of-thought examples from a small labeled dataset.

Implicit rational queries

Implicit logic queries present the most significant challenge. These queries involve domain-specific reasoning methods that are not explicitly stated in the data. The Master must uncover these hidden rationales and apply them to answer the question.

For example, the model can access historical data that implicitly contains the information needed to solve a problem. The model needs to analyze this data, extract relevant patterns and apply them to the current situation. This may involve adapting existing solutions to a new coding problem or using documents from previous legal cases to make inferences about a new problem.

“Navigating hidden logical queries…requires complex analytical techniques to decode and leverage the hidden wisdom embedded in disparate data sources,” the researchers write.

Challenges of implicit rational queries include retrieving information that is logically or thematically related to the query, even if it is not semantically similar. Additionally, the information needed to answer the question often needs to be combined from multiple sources.

Some methods use: in-context learning capabilities teaching them how to select and extract relevant information from multiple sources and how to construct logical justifications. Other approaches focus on creating logical justification examples for multi-step and multi-step claims.

However, effectively handling implicit logic queries often requires some form of fine-tuning, especially in complex domains. This fine-tuning is often domain-specific and involves training the LLM on examples that allow it to reason on the query and determine what type of external information it needs.

Implications for building LLM applications

The survey and framework, compiled by the Microsoft Research team, show how far LLMs have come in using external data for practical applications. But it’s also a reminder that many problems have yet to be solved. Businesses can use this framework to make more informed decisions about the best techniques for integrating external knowledge into Master’s Programs.

RAG techniques can go a long way towards overcoming many of the shortcomings of vanilla Masters. However, developers also need to be aware of the limitations of the techniques they use and know when to upgrade to more complex systems or avoid using Masters.