AI FinOps on Azure: How to Measure and Optimize the Cost of Models, Tokens, and Agents
Abstract
Generative AI changes the unit economics of cloud applications. A traditional web request usually maps to a predictable set of compute, storage, and database operations. A single AI request can trigger prompt construction, retrieval, embeddings, multiple model calls, tool execution, retries, content safety checks, evaluation, and heavy observability. Looking only at token cost hides most of the system.
This article explains a practical AI FinOps on Azure operating model for Azure OpenAI and Microsoft Foundry workloads. It covers cost drivers, telemetry, Azure Cost Management, Azure Monitor, request-level allocation metadata, budgets, Prometheus metrics, and a working FastAPI reference implementation. The main thesis is simple: the most important AI FinOps metric is not cost per token. It is cost per successfully completed business task.
The repository that accompanies this article is available at github.com/SBajonczak/finops.
1. Introduction
Traditional cloud cost management is good at answering resource-oriented questions: which subscription, resource group, service, or meter generated cost? That is still necessary, but it is not enough for generative AI systems.
One user action in an enterprise AI application may trigger:
- several model calls,
- retrieval operations against Azure AI Search or a database,
- embedding generation,
- tool calls into internal APIs,
- retries after rate limits or transient errors,
- agent planning steps,
- evaluation calls,
- content safety checks,
- logging, tracing, and audit events,
- and supporting compute, storage, networking, and gateway costs.
If an employee clicks “summarize this case”, the invoice might show Azure OpenAI, Azure AI Search, Azure Container Apps, Azure API Management, Log Analytics, and Storage. The product owner wants to know something else: did the case summary complete successfully, did it save time, and what did one useful outcome cost?
That is why AI FinOps has to go beyond Azure OpenAI token pricing. Token prices matter, but reducing token price alone does not guarantee lower business cost. A cheaper model that fails more often can increase retries, manual correction, latency, and support effort. A smaller context window can reduce token spend while lowering task success. A verbose agent can look inexpensive per call and still become expensive because it loops.
The most important AI FinOps metric is therefore not cost per token. It is cost per successfully completed business task.
2. What AI FinOps Means
AI FinOps is the combination of financial accountability, technical telemetry, workload ownership, cost allocation, optimization, governance, and business-value measurement for AI systems.
Traditional FinOps often starts with:
Which Azure resource generated the cost?
AI FinOps must additionally answer:
- Which application generated the request?
- Which use case generated the cost?
- Which AI agent was involved?
- Which user group or tenant initiated the process?
- Which model deployment was used?
- How many agent steps were executed?
- How many retries occurred?
- Was the task completed successfully?
- What was the business value of the result?
That requires joining two worlds that do not naturally line up. Azure Cost Management provides financial truth at billing and resource scopes. Azure Monitor and application telemetry provide technical usage. The application must provide business context: success, tenant, cost center, agent name, task type, and correlation ID.
The important design choice is to treat cost allocation as an application architecture concern, not only a finance report. If three products share one Azure OpenAI account, Azure tags on the account cannot tell you which product consumed which model capacity. The application must emit request-level metadata.
3. Where AI Costs Actually Occur on Azure
The model is often only part of the total workload cost.
| Cost category | Azure services | Typical cost driver |
|---|---|---|
| Model inference | Azure OpenAI, Microsoft Foundry | Tokens, requests, deployment type |
| Agent execution | Foundry Agent Service, custom agents | Agent steps, loops, retries |
| Retrieval | Azure AI Search, databases | Queries, indexes, replicas, partitions, capacity |
| Embeddings | Azure OpenAI embedding models | Input volume and refresh frequency |
| Compute | Azure Functions, Container Apps, App Service | CPU, memory, execution time, replicas |
| API gateway | Azure API Management | Requests, capacity, policy execution |
| Observability | Application Insights, Log Analytics | Data ingestion, retention, sampling |
| Storage | Blob Storage, Azure SQL, Cosmos DB | Data volume, transactions, throughput |
| Network | Azure networking | Data transfer, private endpoints, NAT, firewall |
| Evaluation | AI evaluation pipelines | Model calls, compute, dataset size |
| Safety | Azure AI Content Safety or app-level checks | Requests, model calls, policy evaluation |
| Supporting operations | Key Vault, Container Registry, CI/CD | Transactions, build minutes, image storage |
A retrieval-augmented generation workflow may spend money before the chat model is called. Documents are stored, indexed, chunked, embedded, refreshed, and queried. An agent workflow may spend money after the first response because it plans, calls tools, evaluates tool output, retries failed steps, and writes traces.
This is why “Azure AI token costs” should be treated as one signal, not the whole FinOps model.
4. The Three Sources of Cost Truth
AI FinOps on Azure needs three sources of truth.
List price
List price is the publicly documented price of a model or Azure service. It is useful for estimation and architecture trade-offs. It is not the same as your invoice. Prices vary by model, model version, Azure region, deployment type, input tokens, output tokens, cached tokens, provisioned throughput, serverless deployment, pricing date, and commercial agreement. Do not hard-code model prices into application logic. Link to the official Azure pricing pages or retrieve pricing dynamically where possible.
Technical usage
Technical usage is operational telemetry:
- input tokens,
- output tokens,
- total tokens,
- model requests,
- latency,
- errors,
- retries,
- tool calls,
- cache usage,
- agent steps.
Azure Monitor exposes metrics for Azure resources, but the available metric names vary by resource and service. For Azure OpenAI and Microsoft Cognitive Services accounts, Microsoft documents supported metrics through Azure Monitor. In practice, a collector should not assume every resource exposes exactly InputTokens, OutputTokens, or ModelRequests. Query a configurable list and handle missing metrics gracefully.
Financial cost
Financial cost comes from Azure Cost Management, cost exports, invoices, actual cost, and amortized cost. This data may not match technical telemetry immediately because of delayed cost processing, different aggregation windows, time zone differences, billing adjustments, reservations, negotiated pricing, credits, shared resources, and rounding.
For reporting, distinguish clearly between:
- list price: published price,
- estimated cost: modeled projection,
- observed technical usage: tokens, requests, retries, steps,
- amortized Azure cost: normalized commitment/reservation view,
- actual invoiced cost: the amount that lands in billing.
5. Azure AI FinOps Reference Architecture
flowchart LR
User[User or Application] --> Gateway[Azure API Management]
Gateway --> AI[Azure OpenAI or Microsoft Foundry]
AI --> Agent[AI Agent]
Agent --> Tools[Tools and APIs]
Agent --> Search[Azure AI Search]
Search --> Data[Storage and Databases]
Gateway --> AppInsights[Application Insights]
AI --> Monitor[Azure Monitor]
Agent --> AppInsights
CostManagement[Azure Cost Management] --> Collector[AI FinOps Collector]
Monitor --> Collector
AppInsights --> Collector
Collector --> Dashboard[Grafana, Power BI or Azure Workbook]
Collector --> Alerts[Budgets and Alerts]
The client application or user initiates the business task. Azure API Management can act as an AI gateway for authentication, rate limits, quotas, header normalization, and central logging. Azure OpenAI or Microsoft Foundry provides model inference. AI agents coordinate model calls, tool use, retrieval, and task completion. Tools and APIs represent internal systems such as CRM, ERP, ticketing, document management, or workflow systems. Azure AI Search and databases support retrieval-augmented generation.
Azure Monitor captures platform metrics. Application Insights captures application telemetry, traces, exceptions, dependency calls, and custom events. Log Analytics stores queryable observability data, which is useful but also has ingestion and retention cost. Azure Cost Management provides financial cost data. The AI FinOps collector joins cost and technical telemetry into snapshots and Prometheus metrics. Dashboards in Grafana, Power BI, or Azure Workbooks make the data visible. Budgets and alerts create governance feedback loops.
6. The AI FinOps Operating Process
Phase 1: Discover
Identify AI workloads, model deployments, Azure resources, owners, environments, cost centers, business use cases, agent tools, and supporting infrastructure.
Deliverables:
- resource inventory,
- ownership matrix,
- architecture diagram,
- cost-driver list.
The discovery phase should include shared resources. A single Azure OpenAI account, Azure AI Search service, or Log Analytics workspace may support several products.
Phase 2: Measure
Collect financial and technical telemetry.
Financial data includes actual cost, amortized cost, forecast, budgets, and cost exports. Technical data includes input tokens, output tokens, model requests, agent steps, retries, tool calls, latency, error rate, and successful outcomes.
The reference implementation exposes /api/costs, /api/token-usage, /api/snapshot, and /metrics to make this measurable.
Phase 3: Allocate
Allocate costs by application, product, team, environment, tenant, use case, cost center, AI agent, and model deployment.
Showback reports cost to teams without directly charging them. Chargeback moves the cost into the consuming team’s budget. Start with showback if the organization does not yet trust the allocation model. Move to chargeback only when metadata quality and ownership are good enough.
Phase 4: Control
Use budgets, forecast alerts, anomaly alerts, rate limits, maximum token limits, maximum output lengths, context limits, agent step limits, retry limits, concurrency limits, model allowlists, and deployment policies.
A critical limitation: Azure budgets generate notifications and can trigger actions. They do not automatically stop Azure resources or model deployments. If you need a hard control, implement it in the gateway, application, deployment policy, or automation layer.
Phase 5: Optimize
Prioritize optimization in this order:
- Remove unnecessary requests.
- Prevent agent loops.
- Reduce retries.
- Reduce unnecessary context.
- Limit output length.
- Route tasks to appropriate models.
- Use caching.
- Improve retrieval quality.
- Use batch processing where appropriate.
- Compare pay-as-you-go and provisioned throughput.
- Optimize telemetry retention and sampling.
This order matters. A 20 percent cheaper model does not help much if the agent calls it five unnecessary times.
Phase 6: Govern and Repeat
AI FinOps is continuous. Define regular cost reviews, ownership, budget approval, model onboarding rules, quality thresholds, security requirements, and business KPI reviews. New models, new agents, and new retrieval pipelines change cost behavior. The process must repeat.