How faithful is text summarisation?
What's this about?
GenAI and Large Language Modells (LLMs) can be used for summarising reports, meeting notes, research papers and many more texts including book length works, provided the text is split into chuncks. This requirement is due to the inability of most LLMs at this point to ingest text over a certain number of words.
QuillBot's summarizer is trusted by millions worldwide to condense long articles, papers, or documents and into key summary paragraphs using state-of-the-art AI.
QuillBot website
Measuring the aithfulness of GenAI text summarising.
Researchers at the Allen Institute of AI and Princeton University, set out evaluate how faithful various LLMs were at summarising books. They chose books that were published after current LLMs had been trained to avoid data contamination. Human readers were used to assess how well each of the LLMs had summarised the books by evaluating claims, extracted from each of the summarised using ChatGPT, also an LLM based Gen AI system. The researchers called the process, Faithfulness Annotations in Book Length Summarisation (FABLES).
Pipeline for collecting faithfulness annotations in book-length summarization (FABLES).
SOURCE: Evaluating faithfulness and content selection in book-length summarization, Yekyung Kim metal, arXiv:2404.01261v1 [cs.CL] 1 Apr 2024
Extracting claims for evaluation.
The longform summaries of the fiction books were decomposed into “atomic claims” for human readers to evaluate. The claims were produced automatically by prompting ChatGPT4 to produce that had to be fully understandable on their own and as far as possible, “situated within its relevant temporal, locational, and causal context”.
Example summary from the “Romantic Comedy”, by Curtis Sittenfeld, output by Claude 3 Opus. Adapted from Figure 2 in Yekyung Kim et al, April 2024.
Extracted claims output by ChatGPT4 from the example text summary produced by Claude 3 Opus. Claim numbers correspond to the annotated portions of the text summary shown above. Prompts were engineered for ChatGPT to ensure the claims were understandable and situated within the relevant text. Adapted from Figure 2 in Yekyung Kim et al, April 2024.
Human validation by the authors of a random sample of 100extracted claims demonstrated 100% precision (i.e., each claim can be traced to the summarywithout any extra or incorrect information).
Yekyung Kim et al Allen Institute for AI & Princeton.
How LLM's performed.
Some LLMs did better than other with Claude 3 performing the best. None of them were completely faithful to the text of the books that they were summarising.
Faithfulness of various LLMs in summarising fiction books when evaluated by humans for accuracy of claims made that were extracted from the summaries by ChatGPT. Derived from data in Yekyung Kim et al, April 2024.
A qualitative analysisof FABLES reveals that the majority of claims marked as unfaithful are related to events orstates of characters and relationships. Furthermore, most of these claims can only beinvalidated via multi-hop reasoning over the evidence, highlighting the task‘s complexityand its difference from existing fact-verification settings
Yekyung Kim et al Allen Institute for AI & Princeton.
Types of error.
The researchers at the Allen Institute for AI and Princeton University produced a taxonomy of errors. The percentage of summaries displaying each type of error levels are shown in the table below with omissions, factuality and chronology being the most problematic.
Percentage of summaries per model identified with specific issues shown in red boxes, based on annotator comments. The green boxes indicates categories where the models received compliments. Adapted from Table 6 Yekyung Kim et al, 2024.
... omission of key information plaguesall LLM summarizers.
Yekyung Kim et al Allen Institute for AI & Princeton.
Trustworthy or not?
This evaluation provides an important assessment of whether current GenAI systems can be trusted to produce a faithful summary of the texts that they have been fed. In particular it is alarming that such a high percentage of summaries were noted as having factual errors along with a not dissimilar percentage of omissions. This should cause us to pause and consider whether such systems should be relied upon at all in business, education and many other fields.
References
Evaluating faithfulness and content selection in book-length summarization Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella , Varun Manjunatha , Kyle Lo, Tanya Goyalº, Mohit Iyyer UMass Amherst, Adobe , Allen Institute for AI, Princeton. April 2024
Impact on Human Values
Human Values Risk Analysis for text summarising.
Truth & Reality
HIGH RISK
Significant percentage of Factual errors
Omissions
Authentic Relationships
LOW RISK
MEDIUM RISK
Replaces humans in producing summaries
Privacy & Freedom
HIGH RISK
LLMs use copyright data
Moral Autonomy
LOW RISK
Cognition & Creativity
MEDIUM RISK
Can impact critical thinking and creativity
Governance Pillars
Transparency
Companies opaque about what data they trained LLMs on although most acknowledge copyright data used.
Independent “faithfulness” metrics should be published.
Justice
Copyright clearly has been infringed, law suites currently the only redress.
Accountability
Companies should be held to account for infringement of copyright, and output of LLMs where there is consequential loss.
Policy Recommendations
Organisations deploying a chatbot for use by the public or clients must be accountable for the output of the chatbot where there is consequential loss due to unfaithful summarisation of documents. Legislation may be needed to assign ‘product’ liability where the chatbot is the ‘product’.
Copyright protection should be enforced and no exception made for AI companies. Chapter 8 of the House of Lords Report cited, deals with Copyright in some detail and highlights various policy options and limitations of different approaches, such licensing and opt in or opt out of data crawling on websites.
Developers and companies should be required to make information available on what data their system has been trained on and what accuracy they can be expected to achieve based on independent tests.