A mix-salad names in AI world
So You Want to Know How Big an AI Is? Good Luck With That.
A brief, chaotic history of how the AI industry kept changing the ruler
There is a long and proud tradition in technology of inventing new words for things that already have words. AI researchers, never ones to be outdone, took this tradition and absolutely ran with it — except instead of running in one direction, they ran in six directions simultaneously, dropped the baton twice, and then argued about whether “baton” was even the right unit of measurement.
This is the story of how we went from symbols to parameters to tokens, and why none of it makes as much sense as anyone would like you to believe.
Act I: Symbols (or, “We Shall Name Every Thought”)
It starts, as most overconfident projects do, in the 1950s.
Early AI researchers — brilliant people wearing very sensible sweaters — believed that intelligence was fundamentally about symbols. Not the emoji kind. The logical kind. The idea was that if you just gave a machine enough symbolic rules about the world, it would eventually think. Like a very patient librarian who has read every book and can therefore answer any question.
The unit of this era was the symbol: a discrete, meaningful atom of knowledge. A thing. A concept. A node in a vast web of logic.
It was elegant. It was clean. It was completely wrong about how intelligence works.
But it did give us chess computers, so we kept some of it.
Act II: Words (or, “Let’s Just Count the Words, How Hard Can It Be”)
By the 1980s and 1990s, a different crowd had taken over. These were the statisticians, and they had a refreshingly humble proposal: forget meaning. Just count words.
The unit became the word. Then the n-gram (a sequence of n words, because why use one word when you can use a cluster of words and give it a Greek letter). You trained a model by feeding it text and telling it: predict the next word. Do this enough times and eventually something useful happens.
This era gave us autocomplete, spam filters, and the unsettling feeling that maybe language is just statistics in a trench coat.
The measurement of a model’s power? Usually just: how much text did you feed it, and how surprised does it get by new sentences. (Technically called perplexity, which is also what your non-technical friends feel when you explain any of this to them.)
Act III: Parameters (or, “My Number Is Bigger Than Your Number”)
Then came deep learning, and with it, the glorious era of the parameter race.
A parameter is a number inside a neural network — one of millions, then billions, of tiny numerical dials that get tuned during training. More parameters, the thinking went, meant a smarter model. And so the announcements began.
“Our model has 1 billion parameters.”
“Oh yeah? Ours has 10 billion.”
“We have 175 billion.” (That was GPT-3 in 2020, said with the energy of someone dropping a microphone they definitely rehearsed dropping.)
“We have 540 billion.” (That was Google’s PaLM, said with the energy of someone who watched the first person drop the microphone and went home to practice.)
The parameter count became the AI industry’s version of horsepower — a number that sounds impressive, that nobody fully understands in context, and that manufacturers love to put in headlines. It was the chest-puffing unit of the 2017–2022 era, and every lab leaned into it hard.
The only problem: more parameters didn’t always mean better. A smaller model trained on better data could humiliate a larger one. The ruler was wrong. Again.
Interlude: FLOPs (or, “Actually, What Matters Is How Much You Spent”)
While the parameter race was raging, a more austere faction of researchers insisted that the real measure of an AI was FLOPs — floating-point operations — a measure of how much computation was spent training it.
FLOPs are what you use when you want to sound serious at a conference. Nobody at a dinner party has ever said “oh that model? barely two exaFLOPs.” And yet, in the right rooms, this was the number that mattered.
FLOPs remain the preferred unit of researchers who find parameters too gauche and tokens too trendy. There are three of them. They are correct, and they are not having fun.
Interlude II: WordPiece and BPE (or, “A Word Is Too Big, Let’s Cut It Up”)
Somewhere in the late 2010s, researchers quietly introduced a problem and its own solution in the same paper.
The problem: words are inconsistent units. “run”, “running”, “runner” are related but treated as totally separate things. Languages with long compound words (German says hello) make word-based models weep. And what about typos? Emojis? Made-up words?
The solution: don’t use whole words. Instead, slice words into sub-word pieces using algorithms called WordPiece (Google’s invention, used in BERT) and BPE — Byte Pair Encoding (originally a data compression trick that someone brilliantly repurposed for language).
Under these schemes, “running” might become “run” + “##ning”. “tokenization” might become “token” + “##ization”. A very long German word becomes a small pile of syllable-shaped confetti.
This was genuinely clever and also, in hindsight, a clear sign that everyone was building up to just calling the pieces tokens and being done with it.
Act IV: Tokens (or, “We Give Up Naming Things, Here Is a Token”)
And then OpenAI said: tokens.
Not words. Not symbols. Not BPE units. Tokens. A deliberately vague, deliberately neutral term for “whatever chunk the model processes.” Could be a word. Could be part of a word. Could be a punctuation mark. Could be a space. The model doesn’t care, and frankly neither does the marketing department.
GPT-3 launched in 2020 and the framing shifted overnight. Suddenly the flex wasn’t how many parameters your model had. It was how many tokens you trained it on. Then the Chinchilla paper (2022) dropped the bombshell that smaller models trained on more tokens beat larger models trained on fewer — and the token count became the new arms race.
“Trained on 1 trillion tokens.”
“Trained on 15 trillion tokens.”
The number goes up. The unit changes. The vibe remains the same.
Where We Are Now
Here is the honest state of things:
Symbols still live inside knowledge graphs and some reasoning systems. The sweater-wearing logicians weren’t entirely wrong.
Words are still how humans think about language, even if machines quietly disagree.
Parameters are still reported, still matter, and still get cited in every press release — but everyone now knows they don’t tell the whole story.
FLOPs are still the unit of choice for people who want to feel rigorous. Respect.
Tokens are the current winner. They appear on pricing pages, in capability benchmarks, in headlines. “Context window of 200,000 tokens.” “Costs $3 per million tokens.” The token has won, for now.
Until it hasn’t, and someone invents a new unit. Probably something like “cognitive cycles” or “semantic quanta” or, god help us, “intelligence points.”
The Moral of the Story
Every era of AI invented a measurement unit that captured something real but not everything. Symbols captured structure. Words captured language. Parameters captured scale. FLOPs captured cost. Tokens captured data.
None of them captured intelligence — which is, depending on who you ask, either the whole point or the whole problem.
The industry will keep changing the ruler, because the thing being measured keeps changing too. And somewhere, right now, a researcher is writing a paper introducing a new unit, convinced this time they’ve finally found the right one.
They have not. But the paper will be very well cited.
If you made it this far: congratulations, you now know more about AI measurement units than most people who write about AI. Use this power responsibly, or at least entertainingly.
