Cutting AI costs with budgets

The Situation: Baseline Measurement

Picture an engineering team of 22 developers using Claude Code, Cursor, and the Anthropic API for about 8 months. They know roughly what they spend — about $1,450/month on average across all tools — but have no idea where that money is going. No attribution by developer, project, or task type. No visibility into whether the right models are used for the right tasks. No budgets.

The first step is to install FORG and connect all three adapters. Within 48 hours of data collection, the picture would be unsettling.

What We Found: The Waste Patterns

FORG's cost intelligence surface surfaced five distinct waste patterns immediately:

1. Idle sessions ($580/month)

The single biggest waste category. Idle sessions are LLM sessions that were opened, made a few calls, and then left active without being explicitly terminated. Because Claude Code holds context across a session, each idle session was continuing to incur costs even when the developer had moved on.

In this scenario, FORG surfaces 847 sessions in the first week of data where the session was active for more than 2 hours but had a gap of more than 45 minutes between calls. Those sessions were keeping a large context window warm for no reason.

Fix: session review alert for inactivity after 30 minutes.

alert:
  name: "idle-session-review"
  condition: "inactive_for > 30m"
  notify: "engineering-ops"

2. Oversized models for simple tasks ($420/month)

About 31% of our API calls were using claude-opus orgpt-4o for tasks that were clearly simple: short questions, boilerplate generation, docstring writing. FORG's model analysis showed that these calls had median output of 87 tokens — well within the capability of a much cheaper model.

We reviewed model usage by environment, reserved the most expensive models for our CI pipeline and architecture review tasks, and updated our adapter defaults to Sonnet/3.5 for everything else.

budget_alert:
  name: "expensive-model-usage"
  scope: global
  match:
    models:
      - "claude-opus*"
      - "gpt-4o"
  threshold_usd_daily: 25.00
  notify: "engineering-ops"

3. No prompt caching ($290/month)

We discovered that none of our Anthropic API calls were using prompt caching. For Claude Code specifically, this is significant — the system prompt is large and nearly identical across sessions. Enabling cache_control on the system prompt alone saved $290/month.

This wasn't a FORG budget — it was a configuration change in our adapter setup. But FORG was the tool that surfaced the gap by showing that ourcache_read token count was zero for 100% of API calls.

4. Duplicate API calls ($180/month)

FORG's session analysis detected 1,200 API calls per week that were semantically identical to a call made in the same session within the last 5 minutes, on the same file, with nearly identical context length. These were likely developer habits: re-running the same prompt, closing and reopening context, triggering the same autocomplete multiple times.

We implemented soft deduplication at the adapter level — a 60-second window where identical calls return cached results — and FORG anomaly alerts that warn developers when their duplicate call rate exceeds 15%.

5. No budgets = no accountability ($140/month)

This is the soft cost: without budgets, there's no incentive for developers to think about model choice or session hygiene. Once we set per-developer monthly budgets ($75/month, reasonable for a senior dev doing AI-intensive work), spending dropped simply because people became aware of it. This is the Hawthorne effect applied to AI costs.

budget:
  name: "dev-monthly-budget"
  scope: user
  limit_usd: 75.00
  period: monthly
  alert_at_percent: 80
  mode: "alert"
  gateway_hard_block: "opt-in"
  message: "You've reached your monthly AI budget.
            Contact #engineering-ops to request an increase."

Implementation: 3 Days of Work

The actual implementation took about 3 days:

Day 1: FORG setup, all three adapters connected, two weeks of historical data imported from local adapter logs.
Day 2: Data review, waste pattern analysis, budget and alert settings reviewed with team leads.
Day 3: Budgets and alerts deployed in notify-only mode first, webhook notifications configured, team briefed.

We ran notify-only mode for 5 days before enabling opt-in gateway hard-blocks for budget overruns. This gave developers time to adjust behavior without a sudden productivity impact. The transition was smooth — only 3 developers hit the block action in the first week, and each case was a legitimate exception that got a budget increase.

30-Day Results (Modeled)

Across 8 weeks in this illustrative model (2 baseline, 6 with budgets and alerts active):

Average weekly spend drops from $1,460 to $832 — a 43% reduction in this scenario
No modeled productivity impact (PR throughput unchanged)
Developer satisfaction with AI tools improves (less context loss from idle session termination)
No budget-limit friction when limits are set appropriately with good messaging

The annualized savings are around $33,000. FORG Team pricing starts at the Team plan. The ROI calculus isn't complicated.

What We'd Do Differently

Start with warn-only mode for everything. We were impatient on the idle-session alert and treated it like a hard stop on day 2. One developer had a long-running session interrupted mid-task and was (rightfully) annoyed. Alert first, measure behavior change, then enforce.

Also: segment your budgets by role. Our $75 flat budget was fine for most developers but too low for our three developers who were doing AI-intensive architecture work. Role-based budgets would have been more appropriate from day one.

The tools to do this are all in FORG. We just didn't use them as carefully as we should have on the first pass.

How a 22-Developer Team Could Cut AI Costs ~43% in 30 Days (Illustrative Model)

Weekly AI Spend: Before vs. After FORG Budgets