Nobody Knows What Their System Costs

system cost

There is a conversation that happens in engineering organisations with a reliable regularity. The cloud bill went up. Leadership wants to know why. Engineering is asked to explain it. Engineering produces a general account of where the money is going, some hypotheses about what caused the increase, and a vague plan to investigate. Two weeks later the investigation has produced some findings. Some resources are cleaned up. The bill comes down slightly. The conversation ends.

Next quarter the bill goes up again.

The reason this cycle repeats is not that engineering teams are careless or that cloud providers are opaque, though cloud pricing is genuinely complex. The reason is that most engineering teams have never set up the infrastructure required to actually understand their costs in real time, at the level of individual services and features, before the bill arrives rather than after.

They are flying with a monthly receipt instead of a dashboard. The receipt tells you what you spent. It does not tell you what you got for it, which part of the system spent it, or whether the spending was justified by the value it produced.

The attribution problem

The fundamental difficulty with cloud cost management is attribution. A cloud account receives a bill. The bill is divided into categories: compute, storage, networking, data transfer, managed services. These categories tell you what kind of resource was used. They do not tell you which product feature used it, which team is responsible for it, or whether the usage was expected.

A company with five engineering teams sharing a cloud account has no native way to know that one team’s new data pipeline is responsible for forty percent of this month’s increase. The bill shows compute costs went up. The data pipeline is compute. So are the other things those five teams are running. The connection requires instrumentation that was not set up when the pipeline was built.

This is the attribution problem and it is the source of almost every cloud cost mystery. The mystery is not that costs increased. Costs increase as systems grow. The mystery is that the increase cannot be explained precisely because the costs have never been tracked at the level of granularity required to explain them.

The solution is not complicated but it requires doing something at build time that teams routinely defer: tagging every resource with enough information to attribute its cost to a team, a service, and an environment.

# A tagging standard applied at infrastructure creation time.
# Not optional. Not aspirational.
# Every resource that incurs cost gets these tags or it does not get created.

from dataclasses import dataclass
from typing import Optional
import boto3


@dataclass
class ResourceTags:
    team: str               # Which team owns this resource
    service: str            # Which service it belongs to
    environment: str        # production, staging, development
    component: str          # What it does: api, worker, cache, database
    cost_center: str        # For financial attribution
    created_by: str         # Engineer or pipeline that created it
    feature: Optional[str]  # Feature flag or product area if applicable

    def to_dict(self) -> dict[str, str]:
        tags = {
            "team": self.team,
            "service": self.service,
            "environment": self.environment,
            "component": self.component,
            "cost_center": self.cost_center,
            "created_by": self.created_by,
        }
        if self.feature:
            tags["feature"] = self.feature
        return tags

    def to_aws_format(self) -> list[dict]:
        return [
            {"Key": k, "Value": v}
            for k, v in self.to_dict().items()
        ]


def create_tagged_instance(
    instance_type: str,
    tags: ResourceTags,
    ami_id: str,
    subnet_id: str,
) -> dict:
    ec2 = boto3.client("ec2")

    response = ec2.run_instances(
        ImageId=ami_id,
        InstanceType=instance_type,
        MinCount=1,
        MaxCount=1,
        SubnetId=subnet_id,
        TagSpecifications=[
            {
                "ResourceType": "instance",
                "Tags": tags.to_aws_format(),
            }
        ],
    )

    instance_id = response["Instances"][0]["InstanceId"]

    return {
        "instance_id": instance_id,
        "tags": tags.to_dict(),
        "instance_type": instance_type,
    }


# Enforced at the CI level: resources without required tags fail deployment.
def validate_tags(tags: ResourceTags) -> list[str]:
    errors = []
    if not tags.team:
        errors.append("team tag is required")
    if not tags.service:
        errors.append("service tag is required")
    if tags.environment not in ("production", "staging", "development"):
        errors.append("environment must be production, staging, or development")
    if not tags.cost_center:
        errors.append("cost_center tag is required for financial attribution")
    return errors

The tagging policy only works if it is enforced. A policy that is aspirational produces a system where some resources are tagged and others are not, which is actually worse than no tagging because it creates false confidence in the attribution data. The tagged resources look like the whole picture. The untagged ones are invisible.

Enforcement happens at the infrastructure provisioning layer. Resources without complete tags do not get created. This is the only approach that produces a complete picture.

The cost that scales with AI

Before 2023, cloud cost management was mostly a compute and storage problem. Both scale predictably with usage. A service that handles twice the traffic roughly doubles its compute cost. A database that stores twice the data roughly doubles its storage cost. The relationships are not exact but they are close enough to reason about.

AI workloads have changed this in ways that most teams have not fully accounted for. GPU compute does not scale linearly with usage. It scales with model size, with batch efficiency, with the specific patterns of when requests arrive, and with decisions made about serving infrastructure that most teams made quickly and have not revisited.

A team that deployed an LLM-powered feature and did not instrument its cost separately from the rest of the system has no idea what that feature is costing per request, per user, or per month. They know the GPU line on the bill went up. They do not know whether the feature is generating value proportional to that cost.

This is the specific version of the attribution problem that is causing the most expensive surprises in 2026. AI features are expensive to run. The expense is often not visible until the bill arrives. By then the feature is in production, users depend on it, and removing it is not simple.

# Cost tracking for AI inference that makes the expense visible
# in real time rather than at billing time.

import time
import logging
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Optional
from prometheus_client import Counter, Histogram, Gauge

log = logging.getLogger(__name__)

# Metrics that make AI costs visible before the bill arrives

AI_INFERENCE_COST_ESTIMATE = Counter(
    "ai_inference_estimated_cost_usd_total",
    "Estimated cost of AI inference in USD",
    ["model", "feature", "team"],
)

AI_INFERENCE_TOKENS = Counter(
    "ai_inference_tokens_total",
    "Total tokens processed",
    ["model", "feature", "token_type"],  # token_type: input, output
)

AI_INFERENCE_LATENCY = Histogram(
    "ai_inference_duration_seconds",
    "Time spent on AI inference",
    ["model", "feature"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)

AI_COST_PER_HOUR = Gauge(
    "ai_inference_cost_per_hour_usd",
    "Rolling hourly AI inference cost estimate",
    ["feature"],
)


# Approximate cost per token for common models.
# Update these when pricing changes.
MODEL_COSTS_PER_TOKEN = {
    "gpt-4o": {"input": 0.000005, "output": 0.000015},
    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
    "claude-haiku-4-5": {"input": 0.0000008, "output": 0.000004},
    "local/qwen2.5-coder-32b": {"input": 0.0, "output": 0.0},
}


@dataclass
class InferenceResult:
    content: str
    input_tokens: int
    output_tokens: int
    model: str
    latency_seconds: float
    estimated_cost_usd: float


class TrackedInferenceClient:
    """
    Wraps any AI inference client and tracks cost metrics automatically.
    The cost estimate is visible in dashboards before the bill arrives.
    """

    def __init__(self, client, model: str, feature: str, team: str):
        self.client = client
        self.model = model
        self.feature = feature
        self.team = team
        self.costs = MODEL_COSTS_PER_TOKEN.get(model, {"input": 0.0, "output": 0.0})

    async def complete(self, messages: list[dict], **kwargs) -> InferenceResult:
        start = time.perf_counter()

        response = await self.client.messages.create(
            model=self.model,
            messages=messages,
            **kwargs,
        )

        latency = time.perf_counter() - start

        input_tokens = response.usage.input_tokens
        output_tokens = response.usage.output_tokens
        cost = (
            input_tokens * self.costs["input"]
            + output_tokens * self.costs["output"]
        )

        # Record metrics
        AI_INFERENCE_TOKENS.labels(
            model=self.model,
            feature=self.feature,
            token_type="input",
        ).inc(input_tokens)

        AI_INFERENCE_TOKENS.labels(
            model=self.model,
            feature=self.feature,
            token_type="output",
        ).inc(output_tokens)

        AI_INFERENCE_COST_ESTIMATE.labels(
            model=self.model,
            feature=self.feature,
            team=self.team,
        ).inc(cost)

        AI_INFERENCE_LATENCY.labels(
            model=self.model,
            feature=self.feature,
        ).observe(latency)

        log.info(
            "ai.inference.completed",
            model=self.model,
            feature=self.feature,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            estimated_cost_usd=round(cost, 6),
            latency_seconds=round(latency, 3),
        )

        return InferenceResult(
            content=response.content[0].text,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            model=self.model,
            latency_seconds=latency,
            estimated_cost_usd=cost,
        )

With this wrapper in place, you can answer the question “what did the document summarisation feature cost yesterday” in thirty seconds. You can set alerts when per-feature cost crosses a threshold. You can see in the dashboard whether the new AI feature you shipped last Tuesday is trending toward a hundred dollars a month or a hundred thousand.

Without it, you find out at the end of the month what you spent on AI in aggregate, with no breakdown by feature, and no ability to connect the cost to the value it produced.

The orphaned resource problem

Every cloud account accumulates orphaned resources. Development environments that were spun up for a project and never torn down. Snapshots taken before a migration that are still running six months after the migration completed. Load balancers pointed at target groups that have no instances. Elastic IPs allocated but not attached to anything.

Each of these is small. Together, across an organisation that has been running in the cloud for several years, they can represent ten to thirty percent of the monthly bill. They are paying for nothing. The nothing has been there long enough that nobody remembers creating it.

Finding them requires regular automated scanning. Not annual audits by a consultant. Automated daily scanning that produces a report of resources that match patterns suggesting they are orphaned.

import boto3
from dataclasses import dataclass
from datetime import datetime, timezone, timedelta
from typing import Optional
import logging

log = logging.getLogger(__name__)


@dataclass
class OrphanedResource:
    resource_id: str
    resource_type: str
    region: str
    estimated_monthly_cost: float
    age_days: int
    reason: str
    last_activity: Optional[str] = None
    tags: dict = None


class OrphanedResourceScanner:
    """
    Finds resources that are likely orphaned.
    Run daily. Review weekly. Clean up monthly.
    The discomfort of reviewing the report is less than
    the cost of not reviewing it.
    """

    def __init__(self, regions: list[str]):
        self.regions = regions

    def scan_unattached_ebs_volumes(self, region: str) -> list[OrphanedResource]:
        ec2 = boto3.client("ec2", region_name=region)
        orphaned = []

        response = ec2.describe_volumes(
            Filters=[{"Name": "status", "Values": ["available"]}]
        )

        for volume in response["Volumes"]:
            create_time = volume["CreateTime"]
            age = (datetime.now(timezone.utc) - create_time).days

            if age < 7:
                continue

            size_gb = volume["Size"]
            # Approximate: gp3 costs $0.08/GB/month
            monthly_cost = size_gb * 0.08

            tags = {t["Key"]: t["Value"] for t in volume.get("Tags", [])}

            orphaned.append(OrphanedResource(
                resource_id=volume["VolumeId"],
                resource_type="ebs_volume",
                region=region,
                estimated_monthly_cost=monthly_cost,
                age_days=age,
                reason=f"Volume unattached for {age} days",
                tags=tags,
            ))

        return orphaned

    def scan_idle_load_balancers(self, region: str) -> list[OrphanedResource]:
        elbv2 = boto3.client("elbv2", region_name=region)
        cw = boto3.client("cloudwatch", region_name=region)
        orphaned = []

        lbs = elbv2.describe_load_balancers()["LoadBalancers"]

        for lb in lbs:
            end = datetime.now(timezone.utc)
            start = end - timedelta(days=7)

            metrics = cw.get_metric_statistics(
                Namespace="AWS/ApplicationELB",
                MetricName="RequestCount",
                Dimensions=[
                    {
                        "Name": "LoadBalancer",
                        "Value": lb["LoadBalancerArn"].split("loadbalancer/")[1],
                    }
                ],
                StartTime=start,
                EndTime=end,
                Period=604800,
                Statistics=["Sum"],
            )

            total_requests = sum(
                p["Sum"] for p in metrics.get("Datapoints", [])
            )

            if total_requests == 0:
                # ALB costs roughly $16/month minimum plus LCU charges
                orphaned.append(OrphanedResource(
                    resource_id=lb["LoadBalancerArn"],
                    resource_type="application_load_balancer",
                    region=region,
                    estimated_monthly_cost=16.0,
                    age_days=(
                        datetime.now(timezone.utc) - lb["CreatedTime"]
                    ).days,
                    reason="Zero requests in past 7 days",
                    tags={
                        t["Key"]: t["Value"]
                        for t in lb.get("Tags", [])
                    },
                ))

        return orphaned

    def scan_all(self) -> list[OrphanedResource]:
        all_orphaned = []
        for region in self.regions:
            all_orphaned.extend(self.scan_unattached_ebs_volumes(region))
            all_orphaned.extend(self.scan_idle_load_balancers(region))

        all_orphaned.sort(key=lambda r: r.estimated_monthly_cost, reverse=True)

        total_monthly_waste = sum(r.estimated_monthly_cost for r in all_orphaned)
        log.info(
            "orphan.scan.completed",
            orphaned_resource_count=len(all_orphaned),
            estimated_monthly_waste_usd=round(total_monthly_waste, 2),
            estimated_annual_waste_usd=round(total_monthly_waste * 12, 2),
        )

        return all_orphaned

The scan output, sorted by estimated monthly cost, is the starting point for a cleanup conversation. The resources at the top of the list are the ones worth investigating first. Some of them will turn out to be in use in ways the scanner did not detect. Most of them will not be.

The cost conversation that needs to happen

Cloud cost management is treated in most engineering organisations as either a finance problem or an infrastructure problem. The finance team wants the number to go down. The infrastructure team manages the infrastructure. The conversation between them is reactive: the number goes up, someone asks questions, the infrastructure team explains, some things get cleaned up.

The more useful conversation is prospective, happens before costs increase, and involves the engineers making product and architecture decisions alongside the people who care about costs.

A new feature that requires running a model for every user request: what does that cost at current user volume? What does it cost at ten times current user volume? Is the value the feature produces proportional to that cost? Are there cheaper ways to achieve the same outcome?

These questions are not hostile to engineering. They are the questions that make engineering decisions better. An engineer who knows that their new feature will cost four thousand dollars a month at current scale makes different decisions than one who does not know. They might choose a smaller model. They might cache results more aggressively. They might scope the feature more narrowly. The cost awareness does not prevent the feature from being built. It changes how it is built in ways that are usually improvements.

# A cost estimation tool for new features before they are built.
# Not for finance. For engineers making architecture decisions.

from dataclasses import dataclass


@dataclass
class FeatureCostEstimate:
    feature_name: str
    monthly_active_users: int
    requests_per_user_per_day: float
    description: str

    def estimate_ai_cost(
        self,
        model: str,
        avg_input_tokens: int,
        avg_output_tokens: int,
        input_cost_per_token: float,
        output_cost_per_token: float,
    ) -> dict:
        requests_per_month = (
            self.monthly_active_users
            * self.requests_per_user_per_day
            * 30
        )
        cost_per_request = (
            avg_input_tokens * input_cost_per_token
            + avg_output_tokens * output_cost_per_token
        )
        monthly_cost = requests_per_month * cost_per_request

        return {
            "feature": self.feature_name,
            "model": model,
            "requests_per_month": int(requests_per_month),
            "cost_per_request_usd": round(cost_per_request, 6),
            "monthly_cost_usd": round(monthly_cost, 2),
            "annual_cost_usd": round(monthly_cost * 12, 2),
            "cost_per_mau_per_month": round(
                monthly_cost / self.monthly_active_users, 4
            ) if self.monthly_active_users > 0 else 0,
        }


# Before building: what does this feature cost?
document_summary = FeatureCostEstimate(
    feature_name="document_summarisation",
    monthly_active_users=10_000,
    requests_per_user_per_day=3.0,
    description="Summarise uploaded documents on the dashboard",
)

# With a frontier model:
frontier_estimate = document_summary.estimate_ai_cost(
    model="claude-sonnet-4-5",
    avg_input_tokens=8000,
    avg_output_tokens=500,
    input_cost_per_token=0.000003,
    output_cost_per_token=0.000015,
)

# With a smaller model:
efficient_estimate = document_summary.estimate_ai_cost(
    model="claude-haiku-4-5",
    avg_input_tokens=8000,
    avg_output_tokens=500,
    input_cost_per_token=0.0000008,
    output_cost_per_token=0.000004,
)

# frontier: ~$11,700/month
# smaller model: ~$2,520/month
# Same feature. Four times cheaper.
# The engineer who runs this before building makes a different choice
# than the one who runs it after the first billing cycle.

What good looks like

Engineering organisations that manage cloud costs well are not necessarily spending less than their peers. They are spending with intention rather than with inertia.

They know what each service costs, tracked in real time against a budget that was set when the service was built. They know what each AI feature costs per user and per request. They have automated scans that surface orphaned resources weekly rather than discovering them annually. New resources are tagged at creation or they do not get created. The cost dashboard is reviewed in architecture conversations alongside the performance dashboard, because cost is a product constraint like any other.

The monthly bill, when it arrives, is not a mystery. It is a confirmation of what the dashboards have been showing all month. If the number is higher than expected, the reason is already visible in the cost metrics. If it is lower, the cleanup effort from last month has done its job.

This state is achievable without specialised FinOps tooling or dedicated cost engineering teams, though both help. It requires tagging discipline, cost metric instrumentation, and the organisational habit of treating cost as something to understand continuously rather than something to react to quarterly.

The bill is not the information. The bill is the consequence of decisions made before it arrived. Understanding those decisions as they happen is the only way to manage the consequence before it is too late to change it.

The cloud knows exactly where every cent went. The question is whether you have built the infrastructure to ask.