But what ARE embeddings, really?
Think of embeddings as coordinates in meaning-space. Every word, sentence, or document gets converted into a list of numbers (a vector) that captures its semantic essence.
“cat” → [0.2, 0.8, 0.1, 0.4, …] (hundreds or thousands of numbers)
“kitten” → [0.19, 0.79, 0.09, 0.41, …] (very close!)
“car” → [0.7, 0.1, 0.9, 0.2, …] (far away)
Words with similar meanings get similar coordinates. It’s like GPS, but for concepts.
ChatGPT’s latest embedding system has 3,072 dimensions. That is, each word gets 3,072 digits in its GPS coordinates.
Traditional keyword search is dumb. If you search for “automobile” and the document says “car,” you get nothing.
Embeddings are smart. They know “automobile” and “car” point to nearly the same spot in meaning-space. So when you search, you find what you meant, not just what you said.
This is the basis of how LLMs see words, how LLMs measure meaning instead of spelling.