Your Claude Bill Just Hit $874. Here's How I Cut Mine to $40 — And the Output Got Better
I opened my Anthropic invoice last Tuesday and stared at it for a full minute. $874.23. For one month. Of one product. From one company. Claude Max 20x ($200) plus API overage on agent runs. Three Claude Routines firing daily. Two Claude Code sessions a day, sometimes four. A handful of one-off Opus 4.7 calls at xhigh effort that I now know cost $4 each. Add Cursor Pro ($40), GitHub Copilot Enterprise ($39), Perplexity Pro ($20) on the side, and ChatGPT Plus ($20) for the team account — total monthly AI spend: $993. Two weeks earlier I’d written an edition about replacing Claude Pro with open-source for general tasks — writing, summarization, research. That edition stuck with me because I lied to you slightly. Not on purpose. I just hadn’t tested the part that actually matters. The part that costs real money isn’t chat. It’s agents. The autonomous, tool-using, multi-step, file-editing, long-running agents. Those are the ones burning $4 per Opus call. Those are the ones that turn $200/month subscriptions into $874 invoices. So I ran the experiment again. This time on the expensive part. For two weeks I rebuilt my entire agentic stack on open-source models. GLM-5.1. Kimi K2.6. DeepSeek V4. All running through a free terminal agent called OpenCode, with selective fallback to a single paid Claude Pro tier for the cases where reasoning genuinely matters. By the end of week two, my projected monthly AI spend was $41. The output didn’t get worse. On three out of five workflows, it got measurably better. This is what I learned, what I cancelled, what I kept, and the exact stack I’m running now. Let me lay this out clearly because it’s the part most “open source vs. Claude” pieces skip: Claude Opus 4.7 costs $5 per million input tokens and $25 per million output tokens. With the new tokenizer, the same input now produces 1.0–1.35× more tokens than Opus 4.6. Effective cost increase on the same workload: 30–40% higher than what the rate card suggests. DeepSeek V4 Pro Max — the model that scores within 0.2 points of Opus 4.6 on SWE-bench Verified — costs $0.27 per million input tokens and $1.10 per million output tokens. Same workload. Different invoice. Roughly 18× cheaper on input, 22× cheaper on output. GLM-5.1 — released April 7th by Z.ai, leads SWE-bench Pro at 58.4%, beats Claude Opus 4.6 and GPT-5.4 on that benchmark, MIT licensed, open weights — costs $0 to self-host, or pennies on Z.ai’s API. Kimi K2.6 — released April 20th by Moonshot AI — sustained over 4,000 tool calls in a 13-hour uninterrupted agentic session in published benchmarks. That’s a stability ceiling Claude Opus 4.7 doesn’t currently match in publicly verified runs. A year ago, the gap between the best open-source coding model and Claude Opus was 25 points on SWE-bench. Today it’s 0.2 points. The gap closed by 99.2% in twelve months. And yet 89% of operators are still paying full Anthropic prices for workloads where the gap is statistically zero. I went through 30 days of API logs line by line. Here’s where the $874 went. $320 — Agentic coding sessions. Long-running Claude Code sessions on Opus 4.7. Refactors, debugging, building features. Average session: 90 minutes, ~180k tokens consumed, $4–7 per session. Multiplied across the month. $215 — Claude Routines. Six routines running on schedule. Most on Opus 4.7 xhigh because I was lazy about model selection and defaulted everything to the flagship. $140 — One-off “I need this fast” Opus calls. Long-context reads, document analysis, the kind of work I used to do in chat. Premium prices for what Sonnet would have done at one-fifth the cost. $199 — The Max 20x subscription itself. Includes a chunk of usage but also caps how many parallel sessions I can run. Once you hit the cap, overages bill at API rates anyway. The two patterns I missed for months: I was paying Opus prices for Sonnet work, and I was paying flagship-API prices for work that open-weight models now do at 0.2-point benchmark deltas. That’s not a Claude problem. It’s an operator problem. I was lazy. The default is the flagship. The flagship is expensive. So the default is expensive. The experiment had a strict structure. I split my workloads into five categories, then ran two weeks of A/B testing — same task, two outputs, blind comparison at the end. Category 1 — Agentic coding (the big one). Refactors, multi-file edits, debugging sessions, feature builds. Tested: Claude Code on Opus 4.7 vs. OpenCode on GLM-5.1 vs. OpenCode on Kimi K2.6 vs. OpenCode on DeepSeek V4 Pro. Category 2 — Long-running autonomous agents. Routines that fire overnight, agents that run for 4+ hours unattended. Tested: Claude Routines vs. self-hosted Bernstein orchestrator with Kimi K2.6. Category 3 — Code review on PRs. Triggered on every PR, posts comments, flags issues. Tested: Claude Code GitHub-triggered routine vs. OpenCode with GLM-5.1. Category 4 — Heavy reasoning / one-off complex tasks. Architecture decisions, complex algorithm design, the genuinely hard stuff. Tested: Opus 4.7 xhigh vs. DeepSeek V4 Pro Max vs. GLM-5.1 in thinking mode. Category 5 — High-volume routine work. Linting, summarizing, formatting, simple edits. Tested: Sonnet 4.6 vs. Qwen 3.6 Plus vs. local Qwen 2.5 Coder 32B on Ollama. Every comparison: same input, same prompt, blind ranking after the fact. I’ll give you the honest summary before the details. Where open-source has caught up or surpassed Claude: Code review on standard PRs (under 500 lines): GLM-5.1 produces more useful feedback than Opus 4.7. Specifically: it flags actual bugs more often and writes fewer “consider whether...” style non-comments. Long-running agentic loops: Kimi K2.6 sustained an 8-hour refactor session that Opus 4.7 abandoned at hour 3 due to context drift. Multi-file refactors under 50 files: DeepSeek V4 Pro is statistically tied with Opus 4.7. Output quality indistinguishable in blind review. High-volume routine work: indistinguishable. Anyone paying Sonnet prices for linting in 2026 is overpaying. Where Claude still wins: The hardest 5% of architectural reasoning. Genuinely novel problems with no clear precedent. Opus 4.7 xhigh still has an edge on these — and the edge is worth paying for.Polish in autonomous agentic behavior. Claude Code’s CLAUDE.md system, Agent Teams, and Routines orchestration are more mature than any open-source equivalent. Functional, not philosophical. Anything where “voice” matters in the output. Open-source models write code well but produce flatter prose in their explanations. If your agent talks to humans, Claude still wins on tone. Where it gets interesting: The 80/20 rule from the previous edition still holds, but the split has shifted. 80% of agentic work — not just chat — now runs better on a hybrid open-source stack than on pure Claude. The remaining 20% justifies one paid tier, not five. Last edition I called it “the 80/20 split” — 80% of tasks don’t need a $200/month model. That edition was about chat. For agents, the right framework is different. Call it the Three-Tier Stack: Tier 1 — Self-hosted open weights for high-volume, context-bounded work. GLM-5.1 or Qwen 2.5 Coder running on a beefy GPU or rented VRAM. This is your linting, your code review, your routine refactors. Cost per task: pennies. Privacy: total. Latency: fast. Tier 2 — Cheap hosted open-weight APIs for sustained agentic runs. DeepSeek V4 Pro at $0.27/$1.10 per million tokens. Kimi K2.6 via Moonshot’s API. Z.ai for GLM-5.1. This is where your autonomous agents live — long-running, multi-step, expensive on Opus, nearly free here. Tier 3 — One paid Claude Pro tier ($20/month) for the genuinely hard 5%. Architectural reasoning, novel debugging, the times when Opus 4.7’s edge actually matters. Used surgically, not by default. The mistake I made for six months: defaulting Tier 1 and Tier 2 work to Tier 3 because “Claude is the best.” Yes, Claude is the best at some things. The cost of using the best for everything is a $874 invoice and a vague feelin…
Send this story to anyone — or drop the embed into a blog post, Substack, Notion page. Every play sends rev-share back to Future Digest.