O’Reilly Media – Balancing Cost, Power, and AI Performance

The next time you use a tool like ChatGPT or Perplexity, stop and count the total words being generated to fulfill your request. Each word results from a process called inference—the revenue-generation mechanism of AI systems where each word generated can be analyzed using basic financial and economic business principles. The goal of performing this economic analysis is to ensure that AI systems we design and deploy into production are capable of sustainable positive outcomes for a business.

The Economics of AI Inference

The goal of performing economic analysis on AI systems is to ensure that production deployments are capable of sustained positive financial outcomes. Since today’s most popular mainstream applications are text-generation model based, we adopt the token as our core unit of measure. Tokens are vector representations of text; language models process input sequences of tokens and produce tokens to formulate responses.

When you ask an AI chatbot, “What are traditional home remedies for the flu?” that phrase is first converted into vector representations passed through a trained model. As these vectors flow through the system, millions of parallel matrix computations extract meaning and context to determine the most likely combination of output tokens for an effective response.

We can think about token processing as an assembly line in an automobile factory. The factory’s effectiveness is measured by how efficiently it produces vehicles per hour. This efficiency makes or breaks the manufacturer’s bottom line, so measuring, optimizing, and balancing it with other factors is paramount to business success.

Price-Performance vs. Total Cost of Ownership

For AI systems, particularly large language models, we measure the effectiveness of these “token factories” through price-performance analysis. Price-performance differs from total cost of ownership (TCO) because it’s an operationally optimizable measure that varies across workloads, configurations, and applications, whereas TCO represents the cost to own and operate a system.

In AI systems, TCO primarily consists of compute costs—typically GPU cluster lease or ownership costs per hour. However, TCO analysis often omits the significant engineering costs to maintain service level agreements (SLA), including debugging, patching, and system augmentation over time. Tracking engineering time remains challenging even for mature organizations, which is why it’s typically excluded from TCO calculations.

Like any production system, focusing on optimizable parameters provides the greatest value. Price-performance or power-performance metrics enable us to measure system efficiency, evaluate different configurations, and establish efficiency baselines over time. The two most common price-performance metrics for language model systems are cost efficiency (tokens per dollar) and energy efficiency (tokens per watt).

Tokens per Dollar: Cost Efficiency

Tokens per dollar (tok/$) expresses how many tokens you can process for each unit of currency spent, integrating your model’s throughput with compute costs:

Where tokens/s is your measured throughput, and $/second of compute is your effective cost of running the model per second (e.g., GPU-hour price divided by 3,600).

Here are a some key factors that determine cost efficiency:

Model size: Larger models, despite generally having better language modeling performance, require much more compute per token, directly impacting cost efficiency.

Model architecture: Dense (traditional LLMs) architecture compute per token grows linearly or superlinearly with model depth or layer size. Mixture of experts (newer sparse LLMs) decouple per-token compute from parameter count by activating only select model parts during inference—making them arguably more efficient.

Compute cost: TCO varies significantly between public cloud leasing versus private data center construction, depending on system costs and contract terms.

Software stack: Significant optimization opportunities exist here—selecting optimal inference frameworks, distributed inference settings, kernel optimizations can dramatically improve efficiency. Open source frameworks like vLLM, SGLang, and TensorRT-LLM provide regular efficiency improvements and state-of-the-art features.

Use-case requirements: Customer service chat applications typically process fewer than a few hundred tokens per complete request. Deep research or complex code-generation tasks often process tens of thousands of tokens, driving costs significantly higher. This is why services limit daily tokens or restrict deep research tools even for paid plans.

To further refine cost efficiency analysis, it’s practical to separate the compute resources consumed for the input (context) processing phase and the output (decode) generation phase. Each phase can have distinct time, memory, and hardware requirements, affecting overall throughput and efficiency. Measuring cost per token for each phase individually enables targeted optimization—such as kernel tuning for fast context ingestion or memory/cache improvements for efficient generation—making operation cost models more actionable for both engineering and capacity planning.

Tokens per Watt: Energy Efficiency

As AI adoption accelerates, grid power has emerged as a chief operational constraint for data centers worldwide. Many facilities now rely on gas-powered generators for near-term reliability, while multigigawatt nuclear projects are underway to meet long-term demand. Power shortages, grid congestion, and energy cost inflation are directly impacting feasibility and profitability making energy efficiency analysis a critical component of AI economics.

In this environment, tokens per watt-second (TPW) becomes a critical metric for capturing how infrastructure and software convert energy into useful inference outputs. TPW not only shapes TCO but increasingly governs the environment footprint and growth ceiling for production deployments. Maximizing TPW means more value per joule of energy—making it a key optimizable parameter for achieving scale. We can calculate TPW using the following equation:

Let’s consider an ecommerce customer service bot, focusing on its energy consumption during production deployment. Suppose its measured operational behavior is:

Tokens generated per second: 3,000 tokens/s

Average power draw of serving hardware (GPU plus server): 1,000 watts

Total operational time for 10,000 customer requests: 1 hour (3,600 seconds)

Optionally, scale to tokens per kilowatt-hour (kWh) by multiplying by 3.6 million joules/kWh.

In this example, each kWh delivers over 10 million tokens to customers. If we use the national average kWh cost of $0.17/kWh, the energy cost per token is $0.000000017—so even modest efficiency gains through things like algorithmic optimization, model compression, or server cooling upgrades can produce meaningful operational cost savings and improve overall system sustainability.

Power Measurement Considerations

Manufacturers define thermal design power (TDP) as the maximum power limit under load, but actual power draw varies. For energy efficiency analysis, always use measured power draw rather than TDP specifications in TPW calculations. Table 1 below outlines some of the most common methods for measuring power draw.

Power measurement methodDescriptionFidelity to LLM inferenceGPU power drawDirect GPU power measurement capturing context and generation phasesHighest: Directly reflects GPU power during inference phases. Still fails to capture full picture since it omits the CPU power for tokenization or KV cache offload.Server-level aggregate powerTotal server power including CPU, GPU, memory, peripheralsHigh: Accurate for inference but problematic for virtualized servers with mixed workloads. Useful for cloud service provider per server economic analysis.External power metersPhysical measurement at rack/PSU level including infrastructure overheadLow: Can lead to inaccurate inference-specific energy statistics when mixed workloads are running on the cluster (training and inference). Useful for broad data center economics analysis.Table 1. Comparison of common power measurement methods and their accuracy for LLM inference cost analysis

Power draw should be measured for scenarios close to your P90 distribution. Applications with irregular load require measurement across broad configuration sweeps, particularly those with dynamic model selection or varying sequence lengths.

The context processing component of inference is typically short but compute bound due to highly parallel computations saturating cores. Output sequence generation is more memory bound but lasts longer (except for single token classification). Therefore, applications receiving large inputs or entire documents can show significant power draw during the extended context/prefill phase.

Cost per Meaningful Response

While cost per token is useful, cost per meaningful unit of value—cost per summary, translation, research query, or API call—may be more important for business decisions.

Depending on use case, meaningful response costs may include quality or error-driven “reruns” and pre/postprocessing components like embeddings for retrieval-augmented generation (RAG) and guardrailing LLMs:

where:

E𝑡 is the average tokens generated per response, excluding input tokens. For reasoning models, reasoning tokens should be included in this figure.

AA is the average attempts per meaningful response

C𝑡 is your cost per token (from earlier).

P𝑡 is the average number of pre/post processing tokens

C𝑝 is the cost per pre/post processing token, which should be much lower than C𝑡

Let’s expand our previous example to consider an ecommerce customer service bot’s cost per meaningful response, with the following measured operational behavior and characteristics:

Average response: 100 reasoning tokens + 50 standard output tokens (150 total)

Success rate: 1.2 tries on average

Cost per token: $0.00015

Guardrail processing: 150 tokens at $0.000002 per token

This calculation, combined with other business factors, determines sustainable pricing to optimize service profitability. A similar analysis can be performed to determine the power efficiency by replacing the cost per token metric with a joule per token measure. In the end, each organization must determine what metrics capture bottomline impact and how to go about optimizing them.

Beyond Token Cost and Power

The tokens per dollar and tokens per watt metrics we’ve analyzed provide the foundational building blocks for AI economics, but production systems operate within far more complex optimization landscapes. Real deployments face scaling trade-offs where diminishing returns, opportunity costs, and utility functions intersect with practical constraints around throughput, demand patterns, and infrastructure capacity. These economic realities extend well beyond simple efficiency calculations.

The true cost structure of AI systems spans multiple interconnected layers—from individual token processing through compute architecture to data center design and deployment strategy. Each architectural choice cascades through the entire economic stack, creating optimization opportunities that pure price-performance metrics cannot reveal. Understanding these layered relationships is essential for building AI systems that remain economically viable as they scale from prototype to production.