The Surprising Truth About LLM Generation

Every day, millions of people type prompts into systems like ChatGPT, Claude, or Grok and receive responses that feel almost human. What most people don’t realise is that the model has no idea what it is about to say — not the full sentence, not even the next word.

Large language models generate text one token at a time. Each token is selected as a probabilistic guess from a vocabulary that can exceed 100,000 possible options. There is no hidden script or preplanned response — only a sequence of decisions made step by step.

The Five-Step Generation Pipeline

When you send a prompt to an LLM, five core steps occur:

Tokenization
Embeddings
Transformer processing with attention
Probability distribution over next tokens
Sampling and repetition of the loop

Step 1: Tokenization

LLMs do not read words; they read tokens. Tokenizers split text into efficient pieces based on patterns learned from large text corpora. Common words often map to a single token, while rarer or longer words are broken into subword units.

Token limits are therefore not word limits. A context window of 4,096 tokens typically corresponds to roughly 3,000 English words.

Step 2: Embeddings and Meaning Space

Tokens themselves are just integers. To carry meaning, each token is mapped to a high-dimensional vector known as an embedding. These vectors position tokens within a semantic space where related concepts cluster together.

This is why the model understands that “Python” and “JavaScript” are related, while “Python” the snake appears elsewhere in the space. These relationships are learned implicitly from data, not hard-coded rules.

Step 3: The Transformer and Attention

Embedding vectors flow into the transformer network. The key mechanism here is attention. Attention allows the model to decide which parts of the input are most relevant when processing each token.

For example, in the sentence “The cat sat on the mat because it was tired,” attention correctly links “it” to “the cat,” not “the mat.” This behaviour emerges from learned patterns across millions of examples.

Step 4: Probability Distributions

After processing the context, the model assigns a score to every possible next token in its vocabulary. These scores are converted into probabilities using a softmax function, producing a distribution that sums to one.

The model does not choose what to say. It produces a probability landscape. The final output is one path through this enormous space of possibilities.

Step 5: Sampling and the Autoregressive Loop

Sampling selects one token from the probability distribution. Parameters such as temperature and top-p control how conservative or exploratory this selection is.

Once a token is selected, it is appended to the input, and the entire process repeats. This autoregressive loop continues until an end-of-sequence token is produced or a length limit is reached.

Three Practical Insights

When LLMs hallucinate, they are not lying. They are producing text that matches patterns of confident-sounding responses, not verified truth.
Temperature does not create creativity. It increases the likelihood of selecting lower-probability tokens. Creativity is a human interpretation of that randomness.
Context limits reflect computational reality. Attention scales quadratically, and longer contexts are genuinely more expensive to process.

LLMs are not magic. They are mechanisms. Understanding those mechanisms makes you a better builder.