{"arxiv_id":"2409.05591","preview":"MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation\n=============================================================================================\n\nHongjin Qian[0000-0003-4011-5673](https://orcid.org/0000-0003-4011-5673 \"ORCID identifier\")Peking UniversityBeijingChinaBeijing Academy of Artificial IntelligenceBeijingChina[chienqhj@gmail.com](mailto:chienqhj@gmail.com),Zheng Liu[0000-0001-7765-8466](https://orcid.org/0000-0001-7765-8466 \"ORCID identifier\")Hong Kong Polytechnic UniversityHong KongChina[zhengliu1026@gmail.com](mailto:zhengliu1026@gmail.com),Peitian Zhang[0009-0007-1926-7433](https://orcid.org/0009-0007-1926-7433 \"ORCID identifier\"),Kelong Mao[0000-0002-5648-568X](https://orcid.org/0000-0002-5648-568X \"ORCID identifier\")Gaoling School of Artificial Intelligence Renmin University of ChinaBeijingChina,Defu Lian[0000-0002-3507-9607](https://orcid.org/0000-0002-3507-9607 \"ORCID identifier\")School of Computer Science and TechnologyUniversity of Science and Technology of ChinaHefeiChina[liandefu@ustc.edu.cn](mailto:liandefu@ustc.edu.cn),Zhicheng Dou[0000-0002-9781-948X](https://orcid.org/0000-0002-9781-948X \"ORCID identifier\")Gaoling School of Artificial Intelligence Renmin University of ChinaBeijingChina[dou@ruc.edu.cn](mailto:dou@ruc.edu.cn)andTiejun HuangSchool of Computer SciencePeking UniversityBeijingChina[tjhuang@pku.edu.cn](mailto:tjhuang@pku.edu.cn)\n\n(2025)\n\n###### Abstract.\n\nProcessing long contexts presents a significant challenge for large language models (LLMs). While recent advancements allow LLMs to handle much longer contexts than before (e.g., 32K or 128K tokens), it is computationally expensive and can still be insufficient for many applications. Retrieval-Augmented Generation (RAG) is considered a promising strategy to address this problem. However, conventional RAG methods face inherent limitations because of two underlying requirements: 1) explicitly stated queries, and 2) well-structured knowledge. These conditions, however, do not hold in general long-context processing tasks.\n\nIn this work, we propose MemoRAG, a novel RAG framework empowered by global memory-augmented retrieval. MemoRAG features a dual-system architecture. First, it employs a light but long-range system to create a global memory of the long context. Once a task is presented, it generates draft answers, providing useful clues for the retrieval tools to locate relevant information within the long context. Second, it leverages an expensive but expressive system, which generates the final answer based on the retrieved information. Building upon this fundamental framework, we realize the memory module in the form of KV compression, and reinforce its memorization and cluing capacity from the Generation quality’s Feedback (a.k.a. RLGF). In our experiments, MemoRAG achieves superior performances across a variety of long-context evaluation tasks, not only complex scenarios where traditional RAG methods struggle, but also simpler ones where RAG is typically applied. Our source code is available at [this repository](https://github.com/qhjqhj00/MemoRAG \"\").\n\nRetrieval-Augmented Generation, Long Context Processing\n\n††journalyear: 2025††copyright: acmlicensed††conference: Proceedings of the ACM Web Conference 2025; April 28-May 2, 2025; Sydney, NSW, Australia††booktitle: Proceedings of the ACM Web Conference 2025 (WWW ’25), April 28-May 2, 2025, Sydney, NSW, Australia††doi: 10.1145/3696410.3714805††isbn: 979-8-4007-1274-6/25/04††ccs: Computing methodologies Natural language generation\n\n1. Introduction\n----------------\n\n<img src='x1.png' alt='Refer to caption' title='' width='830' height='512' />\n\n*Figure 1. Comparison of MemoRAG with Standard RAG and human cognition of a long document. Figure (a) shows standard RAG, where retrieval and generation take place in a sequential pipeline. Figure (b) illustrates how humans tackle a task about the document: 1. going through the document and forming the memory, 2. thinking about the clues to the presented task (i.e., recalling), checking the document for needed details (i.e., retrieving), 3. making a response to the task based on the memory-enhanced retrieval result. Inspired by the human cognition process, Figure (c) demonstrates MemoRAG, which creates a global memory of the long context, recalling useful clues based on memory, and retrieving information based on the clues to generate a high-quality response.*\n\nLarge language models (LLMs) need to process long contexts in many real-world scenarios, such as long-document QA and summarization*(Bai et al., [2024]; Zhang et al., [2024a])*. While some recent LLMs can handle much longer contexts than before (e.g., Mistral-32K, Phi-128K)*(Jiang et al., [2023a]; Abdin et al., [2024])*, they can still be insufficient for certain applications. Meanwhile, it’s computationally expensive to process long contexts directly due to the considerable costs on inference time and GPU memory*(Dong et al., [2023])*.\n\nRetrieval-Augmented Generation (RAG) is widely regarded as a promising strategy for addressing long-context processing challenges*(Izacard and Grave, [2021b]; Gao et al., [2024])*. RAG allows LLMs to complete tasks more cost-effectively by focusing only on the relevant parts retrieved from the long input context*(Xu et al., [2023]; Zhu et al., [2024])*.\nHowever, traditional RAG methods face inherent limitations when applied to general long-context tasks, due to two key constraints.\nFirst, the search intent must be explicitly expressed (or easily clarified through query rewriting)*(Chan et al., [2024]; Zhu et al., [2024])*. Second, the external dataset must be well-structured for effective encoding and indexing (e.g., Wikipedia passages)*(Nguyen et al., [2016]; Metzler et al., [2021])*. Unfortunately, neither of these conditions is typically met in general long-context tasks.\nOn one hand, there may be no clear search intent (e.g., summarizing the main characters in a book, or clarifying the relationships between characters)*(Edge et al., [2024]; Qian et al., [2024b])*. On the other hand, the input context is often unstructured (e.g., a 100-page text file, or multi-year financial reports), making it difficult to partition, encode, and index in a straightforward manner*(Ram et al., [2023]; Qian et al., [2024a]; Zhu et al., [2024])*.\n\nHuman cognition of a long document, unlike standard RAG, is significantly more effective (as shown in Figure [1]). When a person is presented with a long document, they first skim through it to form a global memory of its high-level information. When tasked with a document understanding question—such as “What are the mutual relationships between the main characters?”—the person recalls useful clues from their memory and uses these clues to locate specific details within the document. Based on the retrieved information, they can then generate a high-quality response to the task*(Adolphs, [1999])*.\n\nInspired by the human cognitive process, we propose MemoRAG, a novel framework for long-context processing on top of global-memory enhanced retrieval augmentation. MemoRAG features a dual-system architecture: a light but long-range system to realize the memory module and a heavy but expressive system to generate the final answer. For each presented task, MemoRAG prompts its memory module to generate retrieval clues. These clues are essentially drafted answers based on the compact memory. While these clues may contain some inaccuracies or lack details, they effectively reveal the underlying information needs of the task and can be directly linked to the source information. By using these clues as queries, MemoRAG can effectively retrieve the necessary knowledge from the external knowledge base.\n\nThe memory module is the core of MemoRAG. It is expected to be 1) length-scalable: cost-effectively handling long-contexts, 2) retentive: memorizing the crucial information within long-contexts, and 3) instructive: generating useful clues for the presented task. Therefore, we introduce the following techniques to optimize its performance. First, we realize the memory module in the form of a KV-compressible LLM with configurable compression rates. This structure can flexibly support a wide range of context lengths and can be optimized in an end-to-end manner. Second, we design a novel algorithm that learns to reinforce the memory module’s memorization and cluing capacity from the generation quality’s feedback (a.k.a. RLGF). That is, 1) the generated clues are positively rewarded if they can support the generation of high-quality answers, and 2) the memory module is reinforced to generate the positively rewarded clues.\n\n<img src='x2.png' alt='Refer to caption' title='' width='665' height='304' />\n\n*Figure 2. Illustration of (a) task background, (b) framework comparison, and (c) application scenarios. When processing long inputs like the entire Harry Potter series, most LLMs struggle with million-token contexts. Standard RAG methods also face challenges with queries unsuitable for direct searching. MemoRAG overcomes these limitations by constructing a global memory that generates clues, guiding the retrieval of relevant evidence and enabling more accurate and comprehensive answers.*\n\nWe perform comprehensive experiments to evaluate MemoRAG. In our experiment, we leverage a variety of datasets from two popular long-context benchmarks: LongBench*(Bai et al., [2024])* and InfiniteBench*(Zhang et al., [2024a])*. The two benchmarks contain both QA-style tasks, e.g., HotPotQA, NarrativeQA, which are well-suited for traditional RAG methods, and non-QA tasks, like government report summarization, which are unfavorable to traditional RAG methods. We also curate a general long-document understanding benchmark, containing general tasks related to long documents from 20 diverse domains, such as law, finance, physics, and programming, etc. Our experiment results lead to a series of critical insights. Firstly, M","is_truncated":true,"total_characters":83376,"preview_characters":10000}