Thoughts
Semantic fidelity loss is related to hallucination, bias, synonym smoothing, and context drift. All of them have the same root cause; input text is mapped into vector space and then mapped back into language. This isn’t how we do it, so the differences in approaches express themselves as errors in our minds.
Language is already a second-order encoding: a representation of perception and experience. What an LLM does is build yet another layer on top of that, mapping language into vectors of statistical relationships. Calling this meta-perception captures the idea; a LLM is grounded in the derivatives of the patterns of how humans have described the world.
For example; A language model tends to substitute the deictics this for that because both sit close together in statistical space. To the model they are interchangeable. To us, not so. That/This is the essence of semantic fidelity loss.
One hypothetical solution to this and in fact all current error modes in LLMs is to reduce the training set to the canonical set of English words via the LDV. All LLM processing would be done in LDV space and we’d need a translator to get back to everyday English word soup.
LDV is the Longman Defining Vocabulary, a core set of about 2,200 words used to define all English words.