
Hemant Bhatt
Practical Model evaluation for OpenCode and OpenRouter

Introduction
Here is the thing about agentic coding: you are not buying magic, and you are not choosing a permanent teammate. You are renting a level of intelligence for the task in front of you.
That makes model choice closer to renting transport than buying a trophy. Some days you need the truck because the job is heavy. Some days you need the scooter because the trip is short. Treating both trips the same is how a practical tool quietly becomes an expensive habit.
The common trap is assuming the smartest model should also be the default model. It feels safe because bigger models do fail less often, but most coding sessions are not one long act of deep reasoning. They are made of small test updates, renames, file reads, log checks, config fixes, and the quiet maintenance work that keeps a project moving.
So welcome to this evaluation of open-source models for agentic coding. The goal is practical: find which models are affordable, which models are genuinely intelligent, and which ones give you the best value for money when the work moves from a prompt box into a real codebase.
For the main field, I am using popular OpenRouter models that people reach for in OpenCode-style coding workflows. Those are the open models in the comparison. Claude Opus stays in the article as the closed-source frontier ceiling, giving us a top marker so we can see where each open model sits on the intelligence and cost map.
The Rule:
The goal is not to use the cheapest model. The goal is to use the cheapest model that can finish the job cleanly.How to Judge Intelligence
For agentic coding, intelligence is not just about writing a neat patch. Most real software work falls into three connected buckets: writing code, deploying it, and maintaining it inside environments that rarely behave as cleanly as the docs promised.
That is why we will use the following benchmark mix. SWE-bench will help us with understanding the coding capabilities of the model. Terminal-Bench with whether the model can deal with the environment around that patch: commands, tests, setup, failures, and all the small frictions that make agentic coding real. There scores to tell us exactly where a model stands in intelligence.
SWE-bench Pro
Harder professional software engineering tasks.
The closest thing here to a serious repo-work signal.
SWE-bench Verified
A cleaner subset of real GitHub issues.
Still useful, but too common and too clean to trust alone.
Terminal-Bench 2.0
Real command-line workflows inside a terminal.
The test that asks whether the agent can actually operate.
I am weighting SWE-bench Pro the highest because harder repo work is the best proxy for serious software engineering. SWE-bench Verified still helps as a familiar comparison point. Terminal-Bench earns a large share too because an agent that cannot operate the shell is not really an agentic coding partner. It is just a very confident autocomplete box with extra steps.
How to Judge Cost
Input tokens and output tokens are usually priced differently for each model, so comparing only one column gives you a crooked picture. A coding agent also reads much more than it writes, which makes input cost especially important. To keep the comparison practical, I use a blended cost: 80% input tokens and 20% output tokens.
Example Cost Calculation
Take DeepSeek V4 Flash. Its input price is $0.112 per 1M tokens, and its output price is $0.224 per 1M tokens.
80% input + 20% output = (0.8 x $0.112) + (0.2 x $0.224) = $0.1344 per 1M blended tokens
That blended number is what I use for the cost comparison, because it better matches how coding agents actually spend tokens.
Cost vs Intelligence
Cost alone only tells you how painful a model is to run. Intelligence tells you how often it can finish the work without needing rescue. The useful comparison is what happens when you hold both ideas together.
Cost Ranking
| Rank | Model | Input / 1M | Output / 1M | Blended Cost | Cost Multiple |
|---|---|---|---|---|---|
| 1 | Tencent Hy3 Preview | $0.066 | $0.26 | $0.1048 | 1.00X |
| 2 | DeepSeek V4 Flash | $0.112 | $0.224 | $0.1344 | 1.28X |
| 3 | Qwen3.6 Plus | $0.325 | $1.95 | $0.6500 | 6.20X |
| 4 | Kimi K2.6 | $0.73 | $3.49 | $1.2820 | 12.23X |
| 5 | Claude Opus 4.7 | $5 | $25 | $9.0000 | 85.88X |
The cost multiple makes the table easier to judge because it gives you a ruler. You can immediately see how expensive the models are relative to each other. Kimi is roughly 7 times cheaper than Opus, while Tencent Hy3 is nearly 86 times cheaper than Opus. Tencent Hy3 is given 1X because it is the cheapest model in the list.
Intelligence Ranking
| Rank | Model | SWE-Pro | SWE-Verified | Terminal | Composite |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 64.3 | 87.6 | 69.4 | 72.57 |
| 2 | Kimi K2.6 | 58.6 | 80.2 | 66.7 | 67.11 |
| 3 | Qwen3.6 Plus | 56.6 | 78.8 | 61.6 | 64.51 |
| 4 | DeepSeek V4 Flash | 52.6 | 79.0 | 56.9 | 61.60 |
| 5 | Tencent Hy3 Preview | 46.0 | 74.4 | 54.4 | 56.62 |
The composite score is calculated from the three benchmarks with this weighting: 45% SWE-bench Pro, 30% SWE-bench Verified, and 25% Terminal-Bench 2.0. SWE-bench Pro gets the largest weight because harder repo work is the closest signal for real software engineering. Terminal-Bench still gets a serious share because agentic coding also means running commands, reading failures, and working inside an environment.
A normalized intelligence score makes this easier to judge. Instead of staring at three benchmark columns separately, the composite gives you one ruler for raw coding-agent ability. It is not perfect, but it is much better than comparing models by whichever benchmark happens to look nicest.
Intelligence vs Cost
| Model | Intelligence | Cost |
|---|---|---|
| Claude Opus 4.7 | 72.57 | 85.88X |
| Kimi K2.6 | 67.11 | 12.23X |
| Qwen3.6 Plus | 64.51 | 6.20X |
| DeepSeek V4 Flash | 61.60 | 1.28X |
| Tencent Hy3 Preview | 56.62 | 1.00X |
Model-by-Model Judgment
Tencent Hy3 Preview
Role
Ultra-cheap bulk worker
Cost
1.00X baseline
Score
56.62 composite
Tencent is the price anchor for the whole comparison. It shows you what the cheapest usable worker looks like, and it gives you a model you can run for low-risk work without making every prompt feel like a purchase order.
Best For
Simple explanations, log summaries, small edits, retry-heavy cleanup, and low-risk bulk work.
Watch Out
It has the weakest full benchmark profile here, and the 262K context window starts to matter when the repo or session history gets large.
Move When
Move up when the task needs repo-scale reasoning, repeated terminal correction, or more confidence than a cheap first pass can give.
DeepSeek V4 Flash
Role
Best default coding agent
Cost
1.28X Tencent
Score
61.60 composite
DeepSeek is the practical default because the first upgrade is unusually efficient. For only a small cost increase over Tencent, you get stronger scores across the board and a 1M context window.
Best For
Default OpenCode sessions, multi-file edits, cheap automated debugging, first-pass PRs, and long-context repo work.
Watch Out
It is still not a premium fixer. If it starts making confident but wrong turns, repeated retries can become more expensive than switching models.
Move When
Move to Qwen when the task needs stronger general reasoning and terminal behavior. Move to Kimi when execution quality matters more than context size.
Qwen3.6 Plus
Role
Mid-cost upgrade
Cost
6.20X Tencent
Score
64.51 composite
Qwen is the sensible paid upgrade. You pay a real jump over DeepSeek, but the trade is clear: stronger repo work, better terminal-agent behavior, and the same 1M context advantage.
Best For
Harder repo tasks, tool-heavy debugging, larger codebases, and sessions where DeepSeek almost gets there but keeps missing the last turn.
Watch Out
It is no longer ultra-cheap, so it should be chosen with intent rather than used as a nervous default.
Move When
Move to Kimi when the work becomes more about repeated execution than broad context. Move to Opus when correctness matters more than the bill.
Kimi K2.6
Role
Strongest non-Claude executor
Cost
12.23X Tencent
Score
67.11 composite
Kimi is the serious non-Claude executor. Its Terminal-Bench score is the headline because a lot of agentic coding happens in the loop between command output, edits, tests, and another attempt.
Best For
Test-fix loops, command-heavy debugging, complicated local setup, and coding sessions where terminal judgment matters most.
Watch Out
The context window is 262K, and the price is meaningfully above Qwen. It is strongest when you are paying for execution quality, not background assistance.
Move When
Use Opus if Kimi still cannot land the fix or if the change needs final high-confidence review.
Claude Opus 4.7
Role
Frontier escalation model
Cost
85.88X Tencent
Score
72.57 composite
Opus is the model for work where being wrong is expensive. It is not the cheapest path through the task, but it gives you the strongest available shot when the problem is hard, ambiguous, or important.
Best For
Critical bugs, migrations, final architecture review, and messy debugging after cheaper models have already tried.
Watch Out
The price curve is steep. If you use Opus for chores, the model is not the problem. The workflow is.
Move When
Move back down when the work becomes routine: renames, small test updates, file explanations, log summaries, or other chores that do not need frontier reasoning.
The Practical Model Picker
The simplest way to pick is to ask what failure costs. Not all mistakes have the same price.
A bad answer on a README summary is annoying. A bad answer on a database migration is a calendar event. The same model can be a good choice in one situation and a bad default in another because the stakes changed.
Use this picker when you are sitting in OpenCode and deciding what to run before the next chunk of work. It is intentionally practical: start with the task, then pick the model.
Tencent Hy3
Use when the work is safe and repetitive.
Renames, summaries, small cleanup, low-risk retries.
Move up when context or confidence starts to matter.
DeepSeek V4 Flash
Start here for most real coding-agent sessions.
Multi-file edits, first-pass debugging, repo reading, PR drafts.
Move up only when the task proves it needs more.
Qwen3.6 Plus
Use when DeepSeek is close, but not quite landing it.
Harder repo tasks, longer reasoning, larger context-heavy work.
Move to Kimi if the terminal loop becomes the real problem.
Kimi K2.6
Use when the agent needs to execute, inspect, fix, and repeat.
Failing tests, setup weirdness, command-heavy debugging.
Move to Opus when the fix still needs high-confidence review.
Claude Opus 4.7
Use when being wrong is more expensive than the model.
Production bugs, architecture review, migrations, risky fixes.
Move back down as soon as the task becomes chores again.
Conclusion
The model market is starting to look less like a ladder and more like a toolbox. That is good, because agentic coding is not one task. It is reading, planning, editing, running, failing, checking, deploying, and maintaining.
That is why the best model is not always the model with the highest score. The best model is the one that gives you enough reliability for the task without making every prompt feel expensive.
- Tencent Hy3: cheapest bulk worker.
- DeepSeek V4 Flash: best starter coding-agent model.
- Qwen3.6 Plus: best mid-cost upgrade with 1M context.
- Kimi K2.6: strongest non-Claude terminal-heavy agent.
- Claude Opus 4.7: best raw intelligence and final escalation model.
References
The rankings above are based on public benchmark reports, model announcements, and OpenRouter pricing pages. SWE-bench Pro, SWE-bench Verified, and Terminal-Bench provide the benchmark frame.
Scale Labs
SWE-bench Pro
Used for the harder software-engineering benchmark in the intelligence score.
SWE-bench
SWE-bench Verified
Used for the official SWE-bench Verified benchmark definition and task set.
Terminal-Bench
Terminal-Bench
Used for the terminal-agent benchmark that tests command-line execution.
Tencent
Tencent Hy3
Used for Hy3 model details and reported coding-agent benchmark data.
DeepSeek
DeepSeek V4 Flash
Used for DeepSeek V4 Flash release, pricing, context, and benchmark data.
OpenRouter
Qwen3.6 Plus
Used for Qwen3.6 Plus pricing, context window, release date, and benchmark listing.
Kimi
Kimi K2.6
Used for Kimi K2.6 model details and reported coding-agent benchmark data.
Anthropic
Claude Opus 4.7
Used as the closed-source frontier ceiling for capability and cost comparison.