Tokens
There’s a lot of discussion about AI cost today, and much of it centers on tokens. With all the drivers of costs in the AI industry, how has all of this been rolled into one measure? Are tokens accepted as the logical units to measure AI simply because every leading AI company charges for tokens on a $/M basis? Let’s dig into what tokens are — to see how they hold up as a meaningful proxy for GenAI's work.
Image credit: Shubham Dhage via free license from Unsplash
For LLMs a “token” simply represents a chunk of text. As AI companies hoovered up the Internet, zettabytes of text could be subdivided into vocabularies of 100-200k unique chunks (each group of characters is one specific token). Neural networks perform what is called "next token prediction", where when they are given a sequence of tokens, they predict which tokens will follow. In a simple LLM implementation, “tokens in” and “tokens out” represents the work of a neural network's transformer model that does the prediction. In the early days of text based chatbots, these token in/out measures made sense as a proxy for the work of the machine and an understood transaction between AI company sellers and enterprise chatbot buyers. In those days, AI use was usually done via SaaS type per-seat fixed costs, and other than for API use, tokens were mainly used to set account usage thresholds or caps.
As the AI industry matures, some problems are emerging with a token-centric paradigm.
First, there has never been a standard definition of a token. The LinuxFoundation just initiated the Tokenomics foundation this month to work on such standards. AI companies can arbitrarily generate variable numbers of tokens in responses. Companies can redefine the meaning and basis of a “token” at will - even as inflexible enterprise contracts depend on them as the basis for performance and pricing.
Second, even if people were able to agree on how to define a token, it’s not clear it's possible to represent the components of AI resource use or value in a single unit. Today AI models aren’t simple LLMs. Tokens do not represent any consistent unit of work, production, resource utilization or output from one model to the other. They are arbitrary in how they are constructed, counted, and valued. Another parameter defined by the industry as “effort” —perhaps analogous to intensity and duration— influences token consumption for a given prompt. In addition, will tokens be made to account for different security levels or quality assurance levels? If an AI model runs inside a complex harness with access to built-in tools and licensed resources, how will its operating costs be measurable in token-compatible units? The next few years will see AI companies releasing thousands of embedded tools and services that complement the work of their AI models. How these subsystems will impact token costs is not clear.
Third, how models are defined and how all the components operate together is opaquely controlled by the companies, and changes day to day. Some things have always been fairly clear to customers. For example, we know that fast models use less compute than larger slower ones. Published rates and public benchmarks allow end users to deploy the right model for the complexity of the work and their budget. But then there are “thinking models”. So-called thinking models burn internal tokens which are difficult or impossible to verify. Since model thinking is not reliably auditable, users bear the risks of models whose thinking goes haywire, for example.
It’s easy to lose track of how early we are in the AI era. It will be interesting to see how things evolve as we depart the per-seat era for the usage era of GenAI. Hopefully the AI industry will find a way to represent usage through some reliable measures. For now, one thing that is clear is that enterprises need to measure and learn what works and doesn't, and manage their AI expenses closely.