TMLR 2025
LitLLM is a powerful AI toolkit that transforms how researchers write literature reviews, using advanced Retrieval-Augmented Generation (RAG) to create accurate, well-structured related work sections in seconds rather than days.
Writing comprehensive literature reviews is one of the most time-consuming aspects of academic research, particularly in rapidly evolving fields like machine learning. As the volume of scientific publications grows exponentially, researchers face increasing challenges in identifying, synthesizing, and contextualizing relevant prior work. In this work we explore the potential of Large Language Models (LLMs) to assist in this critical task, evaluating both their current capabilities and limitations.
Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write. this work explores the zero-shot abilities of recent Large Language Models (LLMs) for writing of literature reviews based on an abstract. We decompose the task into two components: (1) Retrieving related works given a query abstract and (2) Writing a literature review based on the retrieved results.
This research sits at the intersection of scientific writing automation, retrieval-augmented generation (RAG), and multi-document summarization. We approach the challenge by decomposing the literature review task into two core components: Retrieval (finding relevant papers given a query abstract) and Generation (creating a coherent literature review based on the retrieved papers). This decomposition addresses a fundamental challenge in scientific writing: ensuring that LLM-generated text is factually accurate, relevant, and properly contextualized. Our work contributes to several important research areas including LLMs for scientific writing, RAG systems for grounding outputs in external knowledge, and evaluation protocols for assessing LLM performance on literature review tasks.
We developed a novel interactive pipeline for literature review generation with two main components: a Retrieval Module that finds relevant papers, and a Generation Module that synthesizes them into a coherent review. Here we describe our approach and key results.
Our retrieval process integrates multiple complementary techniques to maximize the discovery of relevant papers. We begin with keyword extraction, using LLMs to identify meaningful keywords from the query abstract. These keywords are then used for search methods that query external knowledge bases like Google Scholar and Semantic Scholar. We enhance this with embedding-based search, transforming abstracts into embeddings using SPECTER for similarity comparisons. Finally, our re-ranking system implements a prompting-based mechanism with attribution to prioritize the most relevant papers. We evaluated these approaches using precision (the proportion of retrieved papers that are relevant) and normalized recall (the proportion of relevant papers retrieved, normalized by the total number of relevant papers). Our results demonstrate that combining keyword-based and embedding-based search significantly improves performance.
The verification step in the retrieval process plays an important role in balancing precision and recall:
Once relevant papers are retrieved, our generation module synthesizes them into a coherent literature review using several innovative approaches. We developed a planning-based approach that follows a two-step process: first outlining a plan for the review structure, then executing steps in that plan to create the final text. We tested various prompting strategies including vanilla zero-shot prompting, plan-based prompting, sentence-by-sentence generation, and per-citation prompting to understand the optimal generation technique. To evaluate quality, we employed multiple metrics: ROUGE scores to measure text overlap with reference reviews, BERTScore for semantic similarity assessment, Llama-3-Eval for LLM-based quality evaluation, and human expert assessment to identify hallucinations and overall quality. Our findings show that plan-based approaches consistently outperform vanilla prompting strategies, with GPT-4 achieving the best results among all tested models.
Model | ROUGE1 ↑ | ROUGE2 ↑ | ROUGEL↑ | BERTScore↑ | Llama-3-Eval↑ |
---|---|---|---|---|---|
CodeLlama 34B-Instruct | 22.608 | 5.032 | 12.553 | 82.418 | 66.898 |
CodeLlama 34B-Instruct (Plan) | 27.369 | 5.829 | 14.705 | 83.386 | 67.362 |
Llama 2-Chat 7B | 23.276 | 5.104 | 12.583 | 82.841 | 68.689 |
Llama 2-Chat 13B | 23.998 | 5.472 | 12.923 | 82.855 | 69.237 |
Llama 2-Chat 70B | 23.769 | 5.619 | 12.745 | 82.943 | 70.980 |
GPT-3.5-turbo (0-shot) | 25.112 | 6.118 | 13.171 | 83.352 | 72.434 |
GPT-4 (0-shot) | 29.289 | 6.479 | 15.048 | 84.208 | 72.951 |
Llama 2-Chat 70B (Plan) | 30.919 | 7.079 | 15.991 | 84.392 | 71.354 |
GPT-3.5-turbo (Plan) | 30.192 | 7.028 | 15.551 | 84.203 | 72.729 |
GPT-4 (Plan) | 33.044 | 7.352 | 17.624 | 85.151 | 75.240 |
Table 1: Zero-shot results on the proposed RollingEval-Aug dataset.
The human evaluation component provides crucial insights into the practical utility of these LLM-generated literature reviews. The figure below shows the instructions provided to experts during annotation:
The results from human evaluation strongly favor plan-based approaches as demonstrated in the figures below:
Our combined approach shows that plan-based generation with GPT-4 produces the highest quality literature reviews, while our novel retrieval pipeline with debate-based re-ranking significantly improves the discovery of relevant papers. This integration of optimized retrieval and structured generation enables the creation of more accurate, comprehensive literature reviews.
Despite the promising results, our research faces several key limitations. Retrieval challenges are significant, as current methods still miss many relevant citations, with substantial room for improvement in search strategies. Generation hallucinations remain problematic; while our planning approach reduces them, LLMs still sometimes generate false citations or misrepresent source materials. Evaluation complexity presents another obstacle, as assessing literature review quality is inherently subjective and difficult to quantify with automated metrics alone. Finally, domain specificity is a concern, as our approaches were primarily tested on machine learning papers, and performance may vary significantly across different scientific disciplines.
Our future research directions include improving embedding-based retrieval for better coverage of relevant work, developing more sophisticated querying strategies that better capture semantic relationships between papers, evaluating the system end-to-end with human researchers in the loop, and extending the approach to other scientific disciplines beyond machine learning.
The paper "LitLLMS, LLMs for Literature Review: Are we there yet?" provides a comprehensive analysis of how LLMs can assist in literature review writing. The research demonstrates that while LLMs show significant promise in this domain, they are not yet ready to fully automate the process. The most effective approach combines Hybrid retrieval methods integrating keyword and embedding-based search, Attribution-based re-ranking to prioritize relevant papers, and Plan-based generation to create structured, coherent literature reviews with fewer hallucinations.
The ultimate vision is not to replace human researchers but to create interactive systems where LLMs assist in the most time-consuming aspects of literature review, allowing researchers to focus on higher-level synthesis and insight generation. This work takes an important step toward that vision by identifying both the current capabilities and limitations of LLMs for literature review tasks.
LitLLM is developed by researchers at ServiceNow Research and Mila - Quebec AI Institute who are passionate about improving scientific workflows through advanced AI techniques.
If you find LitLLM useful in your research, please cite both of our papers:
@article{agarwal2024llms, title={LitLLMs, LLMs for Literature Review: Are we there yet?}, author={Agarwal*, Shubham and Sahu*, Gaurav and Puri*, Abhay and Laradji, Issam H and Dvijotham, Krishnamurthy DJ and Stanley, Jason and Charlin, Laurent and Pal, Christopher}, journal={arXiv preprint arXiv:2412.15249}, year={2024} }
@article{agarwal2024litllm, title={Litllm: A toolkit for scientific literature review}, author={Agarwal*, Shubham and Sahu*, Gaurav and Puri*, Abhay and Laradji, Issam H and Dvijotham, Krishnamurthy DJ and Stanley, Jason and Charlin, Laurent and Pal, Christopher}, journal={arXiv preprint arXiv:2402.01788}, year={2024} }