Reverse linguistics
People infer meaning from text and images quite differently.
While LLMs are equally fluent in generating both text and images, they tend to get away with more “bullshit” in text because humans are far less accurate and precise in their scrutiny of text.
How does this work?
With text, people are naturally inclined to infer meaning, fill in gaps, and overlook inconsistencies. Human working memory can only maintain a limited number of elements simultaneously. As a sentence unfolds, the brain compresses earlier details into a simplified representation of meaning. Exact wording, numerical values, and logical constraints are discarded once the gist has been extracted. That compression is what allows readers to follow long passages (e.g. a book) efficiently, but it reduces the likelihood that errors will be detected.
In contrast, in images, errors like misaligned elements, unrealistic proportions, or incorrect labels are immediately noticeable. The visual system evolved to rapidly detect opportunities or threats. The brain processes large portions of a visual scene in parallel rather than sequentially. Errors are immediately flagged.
Text processing relies heavily on top-down inference. Readers reconstruct meaning by combining the words with prior knowledge, expectations, and within the constraints of the inferred narrative. Experiments show that readers frequently fail to detect missing words, duplicated words, or semantic contradictions if the sentence structure remains plausible.
An example is the sentence “Paris in the the spring”, where many readers miss the duplication because the brain predicts the phrase rather than verifying each word.
Another example is “the company’s revenue grew 30 percent from $100 million to $120 million”. It’s not 30 percent, it’s 20 percent, but rarely would a person notice.
A final semantic example is the sentence above: “In contrast, in images, errors like misaligned elements, unrealistic proportions, or incorrect labels are immediately noticeable”. Noticeable means it could be noticed, whereas the point being made is that it is actually “noticed”, and is not just “noticeable”.
Also, unless you were really paying attention you probably didn’t even recall that sentence from four paragraphs prior. And some of you, as you read the last paragraph, just thought “which sentence?”
The purveyors of large language models exploit this tolerance. Their models generate text sequences with strong statistical coherence at the sentence level, which produces the subjective experience of competent text. If the narrative holds together, readers rarely audit each element. The result is a high rate of unnoticed hallucination: semantic meaning can be inferred even if, logically, it isn’t present.
In contrast LLMs are fairly useless at creating accurate and precise images. The irony is that they are just as useless at creating accurate and precise text: we just don’t notice.
Politicians, business leaders, and just about every other human that wants more than their fair share: they also exploit this tolerance in their fellow fuzzy thinkers.
And get this, it’s even worse in speech. Listeners must construct meaning in real time while retaining only fragments of the stream in working memory. Prosody, cadence, and confidence substitute for logical structure. If the narrative sounds coherent, the brain accepts it.
Read this blog entry carefully – it explains how LLMs work.