Machine learning systems based on LLMs (large language models) are everywhere! Many of us use systems like OpenAI’s ChatGPT, Anthropic’ s Claude, Meta’s Llama, Google’s Gemini, or Microsoft’s CoPilot directly every day. Even more of us encounter them indirectly through services or systems that we work with – whether that is a search engine, composing an email, or dealing with customer services. And some of us are working on building or integrating them into new applications.
But most of us know very little about what is going on behind the curtain or inside the black box. Interacting with them sometimes feels like interacting with a human, but they also have some very non-human behaviours and quirks. Understanding even a little about how they work helps us to make sense of their surprising behaviours and enables us to use these systems more effectively.
In this article, I’m going to slide back the lid of the black box to shed some light on what’s inside. I will try to be clear and make this accessible for anyone who wants to learn about how LLMs work. There won’t be any math, calculus, or complex algorithms here. If you are looking for a technical discussion, this is not the place! I will also be focusing on LLMs that deal with language only, rather than multi-modal systems that can work with audio, images, or video. I will also leave an explanation of how these models are trained for another day.
Over the course of a few articles, we’re going to look at:
- The basic loop an LLM uses to create an answer.
- The output text and how it is chosen.
- The prompt and how it is often structured.
- The differences between words, tokens, and embeddings.
This should help us to understand how they give different responses to the same prompt, how the response is highly sensitive to the first words generated, why long answers take more time to generate, how a chatbot remembers your discussion, and even some tips for making better prompts.
The Basic Loop
All the current LLMs or Chatbots work the same way. They receive a prompt. For the moment, think of this as some instructions combined with a question. They produce a sequence of words in response. We read those words and interpret them as an answer to our question or the next step in a longer conversation. Most of the time the words that come back are sensible and the sort of thing that a well-informed person might say. This reinforces the impression that they are truly intelligent.
There are several simplifications in this account! We’re going to unpack four of them in this article. I’m going to save the biggest for the end – the input and output aren’t really words. But for the moment, let’s pretend that they are.
One of the remarkable and most non-human aspects of an LLM is that it is not a understanding machine, but it is a prediction engine. It predicts the next word in its response. At each step, it has an input that has two parts – the prompt and words that form its response so far. At the beginning, the response is empty. You can imagine how this goes. Suppose the prompt is: “Space, the final frontier.”, then on the first iteration, it might predict “these” as a highly likely next word. On the next iteration, the input would be “Space, the final frontier. These”. In response to this, it might predict “ are” to be highly likely – followed by “the”, “voyages”, “of”, “ the”, ”Starship”, and so on.
In fact, the LLM doesn’t just predict the most likely next word given the input. It predicts likelihoods for every word that it knows of being the next word. So, “Enterprise” will be predicted with a high likelihood; “banana” less likely, and “gastroenterologist” even less so.
It is worth noting that the core LLM is deterministic. That is for a given prompt and partial response, it will predict the same likelihoods for the next word every time. There is no creativity going on here. It gives the same predictions every time! Totally fixed!
There is a separate component that looks at the likelihoods and selects the next word to add to the response. It could just pick the most likely next word, but that would be super boring. Imagine if every high school student who had ChatGPT write an essay for them turned in the exact same one.
To make the responses more diverse, LLM implementations have some [GU1] dials, called hyper-parameters, that influence how varied and surprising a response is. Three commonly used ones are:
- Temperature. If the temperature is set to zero, the most likely word is picked. A setting of one will use the initial distribution of likelihood. So, if the first word is 25%, the second 15%, and the third is 10%, then the second word will be selected 15% of the time. A setting of more than one stretches the distribution out, so the most likely words are selected less often and the less likely ones will be chosen (relatively) more often. Cranking this dial up too high can give some nutty or malformed responses.
- Top P. This limits the choices to a portion of the distribution. If the value is 1.0, then all of the words are included. In the value is 0.5, then only the words adding up to 50% of the distribution would be considered.
You can see how these two work nicely together. Using a lower value for Top P and a higher value for Temperature allows lots of creative choice within a limited (hopefully sensible) subset of the possible words.
- Repetition penalty. This can result in a word being rejected if it has already been used in the response. It helps to reduce redundancy and matches our expectations for good writing.
To summarise: the output of an LLM system is composed in many iterations. Each time through the loop, the LLM predicts the likelihood of the next words, a component picks a word, and adds that to the partial response. The prompt and partial response are then fed back into the LLM.
The loop stops either because the next word is a special one to signal that the response is complete or because the maximum length has been reached. We’ll come back to the special tokens like this later.
So now we know quite a bit about the inner loop of an LLM. We also know about some of the dials that we can use to influence its output. There is one additional thing to mention. You may have seen some LLM implementations offer several different responses. They do this by several responses being generated at once. Remember – the predictions are completely determined by the input, so it is easy to keep a list of partial responses and switch between them as you like. They will have made some different choices early on, so they become increasingly different as they go along. Suppose with the prompt “Once upon a time a“ the next word might be “ princess” or “ dragon” or “boy” – each would give rise to very different continuations.
If you’ve enjoyed this article, please let me know in the comments. If you have any questions or topics that you would like us to write up, let us know. If there is anything that we could do to make this clearer or easier to understand, that would especially helpful.
In the next article in this series, we’re going to look at tokens and embeddings.