Speculative RAG consists of two components: (1) a specialist RAG drafter, and (2) a generalist RAG verifier. First, the base model’s knowledge retriever retrieves related documents from the knowledge base. Then, Speculative RAG offloads computational burden to the specialist RAG drafter, a small LM specialized in answering questions using retrieved documents and not expected to cope with general problems. This smaller module excels at reasoning over retrieved documents and can rapidly produce responses with their corresponding rationale. It serves as an efficient and robust RAG module for the generalist LM. The specialist drafter enables the generalist verifier to bypass the detailed review of potentially repetitive documents, focusing instead on validating the drafts and selecting the most accurate answer.
For example, when answering, “Which actress or singer starred as Doralee Rhodes in the 1980 film, Nine to Five?”, we retrieve a number of documents from the knowledge base with a retriever. We feed subsets of retrieved documents into the RAG drafter and generate multiple answer drafts with corresponding rationale in parallel. This guarantees a high processing speed of the large number of documents.
We determine that some retrieved documents are not relevant due to the limited capability of the knowledge retriever. In this example, the retrieved documents contain information about both the Nine to Five movie (1980) and the Nine to Five musical (2010). To determine the most accurate draft, the generalist RAG verifier, a general LLM, calculates the conditional generation probability of the answer drafts with rationales and outputs a confidence score. Since answer drafts based on the Nine to Five musical would be inaccurate, the generalist RAG verifier assigns those drafts lower scores and filters them out. Finally, the generalist verifier selects the answer draft with the highest confidence score, which is based on the Nine to Five movie, as the final answer.