A simple but effective approach to improve long-context understanding
Previous studies have mainly explored two major directions: input reduction and window extension. Input reduction reduces the length of the input context — for example, by directly truncating the input — before feeding to downstream LLMs. RAG extends this direction by breaking the input into chunks and then retrieving answers to the most relevant chunks based on embedding similarity. However, because of low retrieval accuracy, LLMs could receive an incomplete context for solving the task, hurting performance. Window extension extends the context window of LLMs via fine-tuning, training the model to consume longer inputs. For example, Gemini is able to directly process 2M tokens for each input. However, when the window becomes longer even than their extended input capacities, such LLMs still struggle to focus on the needed information to solve the task and suffer from ineffective context utilization. This long context approach is further complicated by the fact that the cost increases quadratically with length due to the design of the transformer architecture that underlies most LLMs.
Motivated by the aforementioned challenges, we designed CoA with inspiration from the way people interleave reading and processing of long contexts under our own limited working memory constraints. Whereas input reduction approaches need to start processing over shorter inputs (“read-then-process”), CoA breaks the input into chunks and then assigns workers to process each chunk sequentially before reading all of the input (“interleaved read-process”). Further, in contrast to context extension, CoA leverages the capacity of LLMs to communicate between agents rather than trying to feed a large number of tokens into the LLM. CoA is also compute cost–effective, significantly improving over full-context approaches, in particular, by reducing time complexity from n2 to nk, where n is the number of input tokens and k is the context limit of the LLM.