On social media, one comes across much casual commentary on outputs of AI tools like ChatGPT of the type “Look, it wrote this beautiful poem” or “Look, it generated this amazing summary of the document”. Recently, I took a forensic look at examples of text summarization in order to provide a balanced view of the strengths and improvement directions for such tools.
The blog post is broken up into four parts, which are released all today so you can binge if you want.
Part 1: Exploring the length of summary, and selection of contents
Part 2: Exploring selection of contents, grammar, and word choices
Part 3: Exploring word choices, and hallucinations
Part 4: Exploring hallucinations
This post is Part 4 of the series.
***
HALLUCINATIONS
An acknowledged problem with LLM tools is invention of untrue facts, usually referred to as "hallucinations". I found a few examples of mild hallucinations in the six summaries I reviewed.
Claude turned these lines:
Disney has begun reporting more detailed results from its ESPN sports network as it seeks strategic partners to invest in the flagship sports network’s future.
ESPN’s operating income for fiscal 2023 fell 1.7% to $2.8 billion, while revenue rose 2% to $16.4 billion. Disney owns 80% of ESPN through a joint venture with Hearst, and Iger has said the company is working to transform the network into a fully direct-to-consumer platform, with live sports and other sports content streamed to consumers outside the cable bundle.
Excluding ESPN, Disney’s traditional TV networks saw revenue fall 9.1% for the quarter to $2.62 billion. Operating income from the networks was flat at $805 million. A weeklong standoff in September between Disney and cable provider Charter Communications over carriage rates for Disney’s cable channels and whether or not Charter’s 15 million customers should get free access to Disney+ raised concerns about the future of the cable television model.
into
Traditional TV networks like ESPN saw declines in revenue and income as the cable TV model faces challenges. Iger aims to transform ESPN into a direct-to-consumer streaming platform.
This is the first mention of ESPN by Claude in the entire summary. It was never explained that ESPN is a division of Disney. The first sentence reads like it is about the overall cable industry which includes ESPN as another key player, independent of Disney. (ChatGPT suffers from a similar issue.)
Besides, in the original paragraph, ESPN’s operating income fell but revenue rose so it is incorrect to say both revenue and income declined.
Claude also dropped the word “fully” from direct-to-consumer streaming platform, but that qualifier is crucial in this context. (Intriguingly, the one ChatGPT summary that mentions this point also omitted “fully”.)
***
In a summarization task, “hallucinations” are not just about made-up facts not found in the text, but also about true information not found in the text (the LLM may have been trained on other documents, which may give it prior knowledge that it inserts into the summary).
A recurring summarization technique used by LLMs is to reduce a long list by dropping some elements. How does it decide which elements are important to retain? I suspect that the LLMs might appeal to prior knowledge (not present in the article); if not, it may be arbitrarily dropping elements.
Let’s look at this example:
Other bright spots included Disney’s Experiences segment, which includes theme parks, cruise ships, a family-adventure travel-guide business and merchandise licensing. The unit’s operating income rose 31% from the year-earlier quarter, to $1.76 billion. Disney has raised prices at its theme parks and invested heavily in its cruise ship business in the hopes of capitalizing on rising demand for in-person entertainment experiences.
All three platforms decided that the growth at the Experiences segment is an “essential” idea. In the WSJ article, this unit is described as comprising four things, apparently too many to an LLM’s liking.
Claude renamed the unit as “theme parks and experiences segment/business”, and omitted cruise ships, family-adventure-travel-guide, and merchandise licensing from its summaries.
ChatGPT renamed the unit as “theme parks and related experiences” in one case, while omitting the other items. In the second example, ChatGPT kept the Experiences segment name but in its description, only included theme parks and cruise ships, grouping the rest into “related”.
Mistral typically makes few word changes so it’s not surprising that it retained the Experiences segment name. However, like ChatGPT, it also abridged the description, by dropping “family-adventure-travel-guide business” completely.
What is driving the inclusion/exclusion decision? Could it be the order of the items? That might explain Claude and ChatGPT but not Mistral.
***
I ran a further test. I altered the text by reordering the list of items: from “theme parks, cruise ships, a family-adventure travel-guide business and merchandise licensing” to “a family-adventure travel-guide business, merchandise licensing, theme parks, and cruise ships”. The item most dropped is now listed in front while the most well-known business of Disney (theme parks) appears in position three of four. How did the LLM tools react?
With the original order of items, in the two previous runs, ChatGPT always retained theme parks, always dropped travel guides and merchandise licensing, and sometimes included cruise ships. In the test using the new order of items, ChatGPT printed
“The company's Experiences segment, encompassing travel guides, merchandise, theme parks, and cruise ships”
so it kept all four items but reduced “family-adventure travel-guide business” to “travel guides” and “merchandise licensing” to “merchandise”. The latter contraction is troubling.
This suggests that the LLM tool considers an item listed earlier as more important. It’s still not clear whether ChatGPT has learned from other training data that theme parks and cruise ships are major revenue sources for Disney, or learned within this article as those categories have been mentioned elsewhere.
Mistral, on the other hand, responded with more or less the same output regardless of the order of those items.
“The company's Experiences segment, which includes theme parks, cruise ships, and merchandise licensing, saw operating income rise 31%...”
This suggests Mistral does not interpret the sequence of items within the list as informative. Weirdly, it reordered the items to show theme parks and cruise ships first. Mistral may be biased against long descriptors, or it uses external information to determine how to order and cull longer lists.
Finally, Claude somehow decided to not mention the Experiences section at all. This is a bit worrying since the only thing I changed in the text is the order of the items in this list, which should not affect the LLM’s opinion of whether the growth at the Experiences segment is a significant idea in the WSJ article!
CONCLUSION
LLM tools show promise in the text summarization task. They generally produce highly readable, grammatical summaries that cover most of the essential ideas. Numbers seem to have survived even when the logical structure has changed. Each LLM platform has an internally consistent idea of how much compression should be in the summary although the lengths differ significantly across platforms. Severe hallucination is absent but a milder form exists, leading to subtle changes of meaning due to word changes, incorrect inference of relationships, or turning speculative statements into observed facts. More work is needed to understand whether these instances are caused by external knowledge resident in the LLMs, or by randomness built into their structures.
Given a fixed document, each LLM platform is moderately consistent across runs, but only about two-thirds of the ideas recur while the other third appear as if at random. The last test of altering the order of four items presented as a list suggests that the outputs from these LLM tools are not robust, in the sense that small changes to the input may lead to material changes in the output.
Comments
You can follow this conversation by subscribing to the comment feed for this post.