How companies can apply the power of LLMs to automate workflows and gain cost efficiencies
Photo by Gerard Siderius on Unsplash
Introduction
In a recent collaboration with manufacturing executives at a biopharma company, we delved into the world of generative AI, specifically large language models (LLMs), to explore how they may be used to expedite quality investigations. Quality investigations are triggered when deviations are identified in manufacturing or testing of products. Due to potential risks to patient health and regulatory requirements, the batch is put on hold, and depending on impact, even production might be suspended. Accelerating investigations to carry out a root cause analysis, and implement a corrective and preventive action plan as quickly as possible is paramount. Our objective was to accelerate this process using GenAI, wherever possible.
As we started thinking about the minimum viable product (MVP), we faced several options regarding how GenAI could automate different stages of the process to improve cycle time and unfreeze batches as quickly as possible. The executives were experts in their field and had undergone GenAI trainings. However, it was necessary to delve deeper into LLM capabilities and various GenAI solution patterns to determine what stage of the quality investigation process to prioritize for the MVP, balancing short-term solution feasibility and expected improvement in cycle time.
While in our case, the discussion focused on a specific process, the same solution pattern is being leveraged across industries and functions to extract cost efficiencies and accelerate outcomes. So, how can GenAI solutions help with such a process?
The Unique Capabilities of LLMs
Before the recent surge in GenAI’s popularity, automation solutions in the corporate world primarily targeted routine, rule-based tasks or relied on Robotic Process Automation (RPA). Machine learning applications primarily revolved around analytics, such as using regression models to predict outcomes like sales volumes. However, the latest LLMs stand out due to their remarkable properties:
Content Understanding: LLMs are able to “understand” the meaning of textOn-the-fly Training: LLMs are able to perform new tasks that were not a part of their original training (aka zero-shot learning) guided by natural language instructions and optionally a few examples (few-shot learning)Reasoning: LLMs are able to “think” and “reason out” potential actions to a degree (though with some limitations and risks)
In “conventional” machine learning, the process to build and use a model generally involved gathering data, defining a ‘target’ manually, and training the model to predict the ‘target’ given other properties. So, the models could perform one specific task or answer one specific type of question. On the contrary, you could just ask a pre-trained LLM to evaluate a customer review on specific aspects important to your business that neither the LLM has seen before nor are explicitly mentioned in the review.
The Mechanics of LLM-based Solutions
Many LLM solutions in the industry center around designing and feeding detailed instructions to have LLMs perform specific tasks (this is known as prompt-engineering). A potent way to amplify LLM impact is by allowing them access to a company’s proprietary information in an automated manner. Retrieval Augmented Generation (RAG) has emerged as one of the most common solution patterns to achieve this.
Overview — A 10,000 ft view
A High-Level Overview of GenAI Solution Workflow (source: Image by the author)
In simple terms, the solution has two stages:
Search: Retrieve company data relevant to the user’s request. For example, if the ask is to write a report in a particular format or style, a previous report’s text is fetched and sent to the LLM as an exampleGenerate: Compile instructions and examples (or any other relevant information) retrieved in the previous stage into a text prompt and send it to the LLM to generate the desired output. In the case of the report, the prompt could be,Please compile the following information into a report, using the format and style of the provided example.
Here is the content: [report content…. ].
Here is the example: [ Previous Report Title
Section 1 …
Section 2 …
Conclusion ]
The RAG Workflow — 1,000 ft view
Let’s dive a little deeper into both the Search and Generate stage of the solution pattern.
A Closer Look at the Retrieval-Augmented Generation (RAG) Solution Workflow (source: Image by the author)
1. Word Embeddings — the basis for language “understanding”: To facilitate natural language understanding, text is run through algorithms or through LLMs to get a numerical representation (known as vector embeddings) that captures the meaning and context of data. The length is determined by the model used to create this representation — some models such as word2vec have vector lengths of up to 300, whereas GPT-3 uses vectors of up to a length of 12,288. For example,
Vector Embeddings (i.e. Numerical Representations) of Sample Sentences (source: Image by the author)
Here’s how these numerical representations map to a two-dimensional space. It is interesting to see that sentences about similar topics are mapped close to each other.
Sample Sentences Vector Embeddings Plotted in a Two-Dimensional Space (source: Image by the author, inspired by a demonstration in the course Large Language Models With Semantic Search at Deeplearning.ai)
2. Creating the Knowledge Corpus: The search component of the solution takes an incoming question and runs a semantic search to find the most similar pieces of information available in the knowledge corpus. So, how is this knowledge corpus created? Files that need to be in the knowledge corpus are processed by an embeddings model, which creates numerical representations. These numerical representations are loaded into a specialized database — generally a vector database purpose-built for storing this type of information efficiently and retrieving it quickly when needed.
3. Searching Similar Information in Knowledge Corpus (i.e. Retrieval): When a user submits a question or a task to the solution, the solution uses an embeddings model to convert the question text into a numerical representation. This question vector is run against the knowledge corpus to find the most similar pieces of information. There can be one or more search results returned, which can be passed on to the next stage to generate a response or output.
4. Generating Output Using LLM (i.e. Generation): Now that the solution has managed to find relevant pieces of information that can help the LLM generate a meaningful output, the entire package, known as a “prompt”, can be sent to the LLM. This prompt includes one or more standard sets of instructions that guide the LLM, the actual user question, and finally the pieces of information retrieved in the Search stage. The resulting output from the LLM can be processed, if necessary (for example, load outputs into a specific format in a word document), before being delivered back to the user.
A Deeper Dive into Solution Components
Let’s dig one step deeper into the components of the solution
Deep Dive into the Components of the RAG Solution Workflow (purple blocks display optional components). (source: Image by the author)
1. Create Knowledge Corpus: Loading relevant documents into a Knowledge Corpus has some nuances and involves several considerations.
Document Loading: Different relevant documents (pdfs, word, online sources, etc.) might need to be imported into a data repository. Depending on the use case, only select portions of some documents might be relevant. For example, in a solution designed for financial analysts to query company 10-K reports, sections like the title page, table of contents, boilerplate compliance information, and some exhibits may be irrelevant for financial analysis. Hence, these sections can be omitted from the knowledge bank. It’s crucial to avoid redundant information in the knowledge bank to ensure diverse and high-quality responses from the LLM model.Document Splitting: Once the relevant document sections are identified for inclusion in the knowledge bank, the next step is to determine how to divide this information and load it into the vector database. The choice can vary depending on the use case. An effective approach is to split by paragraphs with some overlap. This entails setting a word (or ‘token,’ the units LLMs use for text processing) limit for retaining a paragraph as a whole. If a paragraph exceeds this limit, it should be divided into multiple records for the vector database. Typically, some word overlap is deliberately maintained to preserve context. For instance, using a limit of 1,000 words per vector with a 40-word overlap.Additional Metadata: Enhancing the information in the knowledge bank involves tagging each record with meaningful metadata. Basic metadata examples include the original document title from which the information was extracted and section hierarchy. Additional metadata can further enhance the quality of search and retrieval. For instance, balance sheet data extracted from a 10-K report could be tagged with metadata like:Original Document Title: Company XYZ 10-K
Year: 2022
Section: Financial Statements & Supplementary Data | Balance SheetStorage: There are various available options for storing information. A vector database solution such as chroma, or Faiss on top of Postgres / MySQL can be used. However, SQL databases, NoSQL databases or Document stores, Graph databases can also be used. Additionally, there are possibilities for in-memory storage to reduce latency, as well as horizontal scaling to improve scalability, availability, load balancing, etc.
2. Search Similar Information from Knowledge Corpus (Retrieval): For straightforward use cases, a retrieval approach based on searching for similar vectors within the knowledge bank, as discussed in the previous section, should suffice. A common two-stage approach balances search speed with accuracy:
Dense Retrieval: Initially, a rapid scan of the extensive knowledge corpus is conducted using fast approximate nearest neighbor searches for a search query. This yields tens or hundreds of results for further evaluation.Reranking: Among the candidates fetched, more compute-intensive algorithms can be used to discern between more and less relevant results. Additional relevance scores can be calculated either by taking a second pass over the candidates fetched by Dense Retrieval stage or by using additional features such as number of links pointing to each search result (indicating credibility or topical authority), TF-IDF scores, or just asking a LLM to review all candidates and rank them for relevance.
For advanced capabilities, such as improving the quality of search results by selecting diverse information, applying filters based on plain-language user prompts, and more, a more sophisticated approach may be necessary. For example, in a financial query where a user asks, “What was the net profit for Company XYZ in 2020?” the solution must filter documents for Company XYZ and the year 2020. One possible solution involves using an LLM to split the request into a filtering component that narrows down the semantic search targets by filtering on the year 2020 using metadata. Then, a semantic search is performed to locate “net profit for Company XYZ” within the knowledge bank.
3. Generate Output using LLM (Generation) : The final step in the process involves generating output using an LLM
Direct Approach: The straightforward approach is to pass all the information retrieved from the Search Stage to the LLM, along with the human prompt and instructions. However, there is a limitation on how much information can be passed to an LLM. For instance, the Azure OpenAI base GPT-4 model has a context size of 1024 tokens, approximately equivalent to 2 pages of text. Depending on the use case, workarounds for this limitation may be necessary.Chains: To circumvent the context size limit, one approach is to successively provide pieces of information to the language model and instruct it to build and refine its answer with each iteration. The LangChain framework offers methods like “refine,” “map_reduce,” and “map_rerank” to facilitate the generation of multiple parts of answers using a language model and ultimately combine them using another LLM call.
Conclusion
In an era of escalating data generation, harnessing the power of GenAI, our context-aware and trainable assistant, has never been more impactful. This solution pattern, as outlined in the article, seamlessly tackles the challenge of automating data processing and freeing up human resources for more complex tasks. With the increasing commoditization of Large Language Models (LLMs) and the standardization of solution components, it’s foreseeable that such solutions will soon become ubiquitous.
Frequently Asked Questions (FAQs)
Will the generated content become a part of the LLM’s memory and affect future output? For example, will bad outputs generated by less-experienced users affect other users’ output quality?
No. In this solution approach, the LLM has no “memory” of what it generates — each request starts with a clean slate. Unless the LLM is fine-tuned (trained further) or the generated output is also added to the knowledge base, the future outputs will not be impacted.Is the LLM learning and becoming better with use?
Not automatically. The RAG solution pattern is not a reinforcement learning system. However, the solution can be designed so that users are able to provide feedback on the quality of output, which can then be used to fine-tune the model. Updates to the knowledge base or using upgraded LLMs can also improve the quality of the solution’s output.Will the vector embeddings be saved in the source data warehouse?
Generally, no. While the vectors of document chunks can technically be stored in the source data warehouse, the purpose of the source data warehouse and the vector db (or for that matter a SQL db explicitly used to store vectors for the solution) are different. Adding vectors to the source database will likely create operational dependencies and overheads, which may not be necessary or result in any reward.How will the solution be updated with new data?
The data loading pipeline (identifying documents to load, processing, chunking, vectorization, loading to vector db) will need to be run on new data as it becomes available. This can be a periodic batch process. The frequency of updates can be tailored to the use case.How can we ensure that the sensitive information in documents stored in the knowledge base isn’t accessible to the public or to LLM vendors?
Enterprises can use Azure OpenAI service as a single-tenant solution with a private instance of OpenAI’s LLMs. This can ensure data privacy and security. Another solution is to deploy Hugging Face LLMs on the company’s private infrastructure so that no data leaves the company’s security perimeter (unlike when using a publicly hosted LLM)
Recommended Resources
Explore these resources to deepen your understanding of LLMs and their applications:
Generative AI Defined: How It Works, Benefits and Dangers (techrepublic.com)What Business Leaders Should Know About Using LLMs Like ChatGPT (forbes.com)
For a closer look at the RAG solution pattern:
Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart | AWS Machine Learning BlogDeeplearning.ai course: Large Language Models With Semantic SearchDeeplearning.ai course: LangChain: Chat with Your Data
How GenAI Solutions Revolutionize Business Automation: Unpacking LLM Applications for Executives was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.