REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs
0
🏷️
Sign in to get full access
Overview
• This research paper investigates whether large language models (LLMs) can automatically generate citations and references for sentences in a document or report.
• The researchers introduce a dataset called REASONS, which includes abstracts from the 12 most popular scientific research domains on arXiv, to evaluate the performance of state-of-the-art LLMs on this task.
• The paper explores two types of citation queries: direct queries, where the LLM is asked to provide the author names of a given research article, and indirect queries, where the LLM is asked to provide the title of a mentioned article when given a sentence from a different article.
Plain English Explanation
The ability to automatically generate citations and references for sentences in a document or report is crucial for intelligence analysts, cybersecurity professionals, news agencies, and education personnel. This research explores whether large language models, such as GPT-4 and GPT-3.5, can handle this task effectively.
The researchers created a dataset called REASONS, which includes abstracts from the 12 most popular scientific research domains on arXiv. They then tested the LLMs on two types of citation queries: direct queries, where the LLM is asked to provide the author names of a given research article, and indirect queries, where the LLM is asked to provide the title of a mentioned article when given a sentence from a different article.
The results showed that state-of-the-art LLMs, like GPT-4 and GPT-3.5, had a high "pass percentage" (PP) to minimize the "hallucination rate" (HR), meaning they were often unable to provide accurate citations. However, when the researchers augmented the LLMs with relevant metadata, the PP decreased, and the HR was reduced significantly. Additionally, the Mistral model, which uses a retrieval-augmented generation (RAG) approach, demonstrated consistent and robust citation support on indirect queries, matching the performance of GPT-3.5 and GPT-4.
Overall, the study provides valuable insights into the reliability of RAG for automated citation generation tasks, as well as the challenges that LLMs still face in understanding context and providing accurate citations, especially when tested with adversarial samples.
Technical Explanation
The researchers used the REASONS dataset, which includes abstracts from the 12 most popular scientific research domains on arXiv, to evaluate the performance of state-of-the-art LLMs on two types of citation queries:
- Direct Queries: The LLMs were asked to provide the author names of a given research article.
- Indirect Queries: The LLMs were asked to provide the title of a mentioned article when given a sentence from a different article.
The key findings from the study include:
a) State-of-the-art LLMs, often referred to as anthropomorphic GPT-4 and GPT-3.5, had a high "pass percentage" (PP) to minimize the "hallucination rate" (HR), meaning they were often unable to provide accurate citations.
b) Augmenting the LLMs with relevant metadata lowered the PP and gave the lowest HR.
c) The Mistral model, which uses a retrieval-augmented generation (RAG) approach, demonstrated consistent and robust citation support on indirect queries, matching the performance of GPT-3.5 and GPT-4. The HR across all domains and models decreased by an average of 41.93%, and the PP was reduced to 0% in most cases. The average F1 Score and BLEU were 68.09% and 57.51%, respectively.
d) Testing with adversarial samples showed that LLMs, including Mistral, struggled to understand context, but the extent of this issue was smaller in Mistral and GPT-4-Preview.
Critical Analysis
The research paper provides valuable insights into the current capabilities and limitations of LLMs in generating accurate citations and references. While the augmentation of relevant metadata and the use of RAG approaches, such as Mistral, have shown promise, the paper also highlights the challenges that LLMs face in understanding context and providing accurate citations, especially when tested with adversarial samples.
One potential limitation of the study is the use of a specific dataset, REASONS, which may not fully represent the diversity of citation styles and requirements across different domains and applications. The researchers acknowledge this and suggest that further research is needed to assess the performance of LLMs on a broader range of datasets and citation tasks.
Additionally, the paper does not delve into the potential biases or ethical implications of using LLMs for automated citation generation, which could be an important area for future research. For example, the reliance on LLMs could potentially lead to the perpetuation of existing biases in the research literature or the generation of inaccurate or misleading citations.
Overall, this research provides a valuable contribution to the understanding of LLM capabilities in the context of automated citation generation, and it encourages readers to think critically about the potential benefits and drawbacks of this technology.
Conclusion
This research paper investigates the ability of large language models (LLMs) to automatically generate citations and references for sentences in a document or report. The researchers introduce the REASONS dataset, which includes abstracts from the 12 most popular scientific research domains on arXiv, and evaluate the performance of state-of-the-art LLMs, such as GPT-4 and GPT-3.5, on both direct and indirect citation queries.
The key findings from the study suggest that while LLMs can struggle with providing accurate citations, particularly when faced with adversarial samples, the use of retrieval-augmented generation (RAG) approaches, such as the Mistral model, can significantly improve the reliability and robustness of citation support. The researchers also highlight the importance of augmenting LLMs with relevant metadata to enhance their citation generation capabilities.
This research provides valuable insights for developers and practitioners working on automated citation generation systems, as well as a foundation for further exploration of the potential and limitations of LLMs in this domain. As the use of LLMs becomes more widespread, it is essential to continue studying their capabilities and limitations, with a focus on ensuring the reliability and ethical use of these technologies.
This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!
Related Papers
🏷️
0
REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs
Deepa Tilwani, Yash Saxena, Ali Mohammadi, Edward Raff, Amit Sheth, Srinivasan Parthasarathy, Manas Gaur
Automatic citation generation for sentences in a document or report is paramount for intelligence analysts, cybersecurity, news agencies, and education personnel. In this research, we investigate whether large language models (LLMs) are capable of generating references based on two forms of sentence queries: (a) Direct Queries, LLMs are asked to provide author names of the given research article, and (b) Indirect Queries, LLMs are asked to provide the title of a mentioned article when given a sentence from a different article. To demonstrate where LLM stands in this task, we introduce a large dataset called REASONS comprising abstracts of the 12 most popular domains of scientific research on arXiv. From around 20K research articles, we make the following deductions on public and proprietary LLMs: (a) State-of-the-art, often called anthropomorphic GPT-4 and GPT-3.5, suffers from high pass percentage (PP) to minimize the hallucination rate (HR). When tested with Perplexity.ai (7B), they unexpectedly made more errors; (b) Augmenting relevant metadata lowered the PP and gave the lowest HR; (c) Advance retrieval-augmented generation (RAG) using Mistral demonstrates consistent and robust citation support on indirect queries and matched performance to GPT-3.5 and GPT-4. The HR across all domains and models decreased by an average of 41.93%, and the PP was reduced to 0% in most cases. In terms of generation quality, the average F1 Score and BLEU were 68.09% and 57.51%, respectively; (d) Testing with adversarial samples showed that LLMs, including the Advance RAG Mistral, struggle to understand context, but the extent of this issue was small in Mistral and GPT-4-Preview. Our study contributes valuable insights into the reliability of RAG for automated citation generation tasks.
Read more5/10/2024
0
LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction
Aishik Nagar, Viktor Schlegel, Thanh-Tung Nguyen, Hao Li, Yuping Wu, Kuluhan Binici, Stefan Winkler
Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extration. To breach this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end we evaluate various open LLMs -- including BioMistral and Llama-2 models -- on a diverse set of biomedical datasets, using standard prompting, Chain-of-Thought (CoT) and Self-Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter-intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.
Read more8/23/2024
155
Improving Retrieval Augmented Language Model with Self-Reasoning
Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, Haifeng Huang
The Retrieval-Augmented Language Model (RALM) has shown remarkable performance on knowledge-intensive tasks by incorporating external knowledge during inference, which mitigates the factual hallucinations inherited in large language models (LLMs). Despite these advancements, challenges persist in the implementation of RALMs, particularly concerning their reliability and traceability. To be specific, the irrelevant document retrieval may result in unhelpful response generation or even deteriorate the performance of LLMs, while the lack of proper citations in generated outputs complicates efforts to verify the trustworthiness of the models. To this end, we propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs, whose core idea is to leverage reasoning trajectories generated by the LLM itself. The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process. We have evaluated our framework across four public datasets (two short-form QA datasets, one long-form QA dataset, and one fact verification dataset) to demonstrate the superiority of our method, which can outperform existing state-of-art models and can achieve comparable performance with GPT-4, while only using 2,000 training samples.
Read more8/6/2024
0
How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
Bojana Bav{s}aragin, Adela Ljaji'c, Darija Medvecki, Lorenzo Cassano, Milov{s} Kov{s}prdi'c, Nikola Milov{s}evi'c
Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This is especially true for sensitive domains such as biomedicine, where there is a higher need for factually correct answers. This paper introduces a biomedical retrieval-augmented generation (RAG) system designed to enhance the reliability of generated responses. The system is based on a fine-tuned LLM for the referenced question-answering, where retrieved relevant abstracts from PubMed are passed to LLM's context as input through a prompt. Its output is an answer based on PubMed abstracts, where each statement is referenced accordingly, allowing the users to verify the answer. Our retrieval system achieves an absolute improvement of 23% compared to the PubMed search engine. Based on the manual evaluation on a small sample, our fine-tuned LLM component achieves comparable results to GPT-4 Turbo in referencing relevant abstracts. We make the dataset used to fine-tune the models and the fine-tuned models based on Mistral-7B-instruct-v0.1 and v0.2 publicly available.
Read more7/9/2024