Hemant Bhatt

May 16, 2026

Practical Model evaluation for OpenCode and OpenRouter

Black and white model guide marketplace for choosing coding agents

Introduction

Here is the thing about agentic coding: you are not buying magic, and you are not choosing a permanent teammate. You are renting a level of intelligence for the task in front of you.

That makes model choice closer to renting transport than buying a trophy. Some days you need the truck because the job is heavy. Some days you need the scooter because the trip is short. Treating both trips the same is how a practical tool quietly becomes an expensive habit.

The common trap is assuming the smartest model should also be the default model. It feels safe because bigger models do fail less often, but most coding sessions are not one long act of deep reasoning. They are made of small test updates, renames, file reads, log checks, config fixes, and the quiet maintenance work that keeps a project moving.

So welcome to this evaluation of open-source models for agentic coding. The goal is practical: find which models are affordable, which models are genuinely intelligent, and which ones give you the best value for money when the work moves from a prompt box into a real codebase.

For the main field, I am using popular OpenRouter models that people reach for in OpenCode-style coding workflows. Those are the open models in the comparison. Claude Opus stays in the article as the closed-source frontier ceiling, giving us a top marker so we can see where each open model sits on the intelligence and cost map.

The Rule:

The goal is not to use the cheapest model. The goal is to use the cheapest model that can finish the job cleanly.

How to Judge Intelligence

For agentic coding, intelligence is not just about writing a neat patch. Most real software work falls into three connected buckets: writing code, deploying it, and maintaining it inside environments that rarely behave as cleanly as the docs promised.

That is why we will use the following benchmark mix. SWE-bench will help us with understanding the coding capabilities of the model. Terminal-Bench with whether the model can deal with the environment around that patch: commands, tests, setup, failures, and all the small frictions that make agentic coding real. There scores to tell us exactly where a model stands in intelligence.

SWE-bench Pro

Harder professional software engineering tasks.

The closest thing here to a serious repo-work signal.

SWE-bench Verified

A cleaner subset of real GitHub issues.

Still useful, but too common and too clean to trust alone.

Terminal-Bench 2.0

Real command-line workflows inside a terminal.

The test that asks whether the agent can actually operate.

I am weighting SWE-bench Pro the highest because harder repo work is the best proxy for serious software engineering. SWE-bench Verified still helps as a familiar comparison point. Terminal-Bench earns a large share too because an agent that cannot operate the shell is not really an agentic coding partner. It is just a very confident autocomplete box with extra steps.

How to Judge Cost

Input tokens and output tokens are usually priced differently for each model, so comparing only one column gives you a crooked picture. A coding agent also reads much more than it writes, which makes input cost especially important. To keep the comparison practical, I use a blended cost: 80% input tokens and 20% output tokens.

Example Cost Calculation

Take DeepSeek V4 Flash. Its input price is $0.112 per 1M tokens, and its output price is $0.224 per 1M tokens.

80% input + 20% output = (0.8 x $0.112) + (0.2 x $0.224) = $0.1344 per 1M blended tokens

That blended number is what I use for the cost comparison, because it better matches how coding agents actually spend tokens.

Cost vs Intelligence

Cost alone only tells you how painful a model is to run. Intelligence tells you how often it can finish the work without needing rescue. The useful comparison is what happens when you hold both ideas together.

Cost Ranking

Rank	Model	Input / 1M	Output / 1M	Blended Cost	Cost Multiple
1	Tencent Hy3 Preview	$0.066	$0.26	$0.1048	1.00X
2	DeepSeek V4 Flash	$0.112	$0.224	$0.1344	1.28X
3	Qwen3.6 Plus	$0.325	$1.95	$0.6500	6.20X
4	Kimi K2.6	$0.73	$3.49	$1.2820	12.23X
5	Claude Opus 4.7	$5	$25	$9.0000	85.88X

The cost multiple makes the table easier to judge because it gives you a ruler. You can immediately see how expensive the models are relative to each other. Kimi is roughly 7 times cheaper than Opus, while Tencent Hy3 is nearly 86 times cheaper than Opus. Tencent Hy3 is given 1X because it is the cheapest model in the list.

Intelligence Ranking

Rank	Model	SWE-Pro	SWE-Verified	Terminal	Composite
1	Claude Opus 4.7	64.3	87.6	69.4	72.57
2	Kimi K2.6	58.6	80.2	66.7	67.11
3	Qwen3.6 Plus	56.6	78.8	61.6	64.51
4	DeepSeek V4 Flash	52.6	79.0	56.9	61.60
5	Tencent Hy3 Preview	46.0	74.4	54.4	56.62

The composite score is calculated from the three benchmarks with this weighting: 45% SWE-bench Pro, 30% SWE-bench Verified, and 25% Terminal-Bench 2.0. SWE-bench Pro gets the largest weight because harder repo work is the closest signal for real software engineering. Terminal-Bench still gets a serious share because agentic coding also means running commands, reading failures, and working inside an environment.

A normalized intelligence score makes this easier to judge. Instead of staring at three benchmark columns separately, the composite gives you one ruler for raw coding-agent ability. It is not perfect, but it is much better than comparing models by whichever benchmark happens to look nicest.

Intelligence vs Cost

Model	Intelligence	Cost
Claude Opus 4.7	72.57	85.88X
Kimi K2.6	67.11	12.23X
Qwen3.6 Plus	64.51	6.20X
DeepSeek V4 Flash	61.60	1.28X
Tencent Hy3 Preview	56.62	1.00X

Move back down when the work becomes routine: renames, small test updates, file explanations, log summaries, or other chores that do not need frontier reasoning.

The Practical Model Picker

The simplest way to pick is to ask what failure costs. Not all mistakes have the same price.

A bad answer on a README summary is annoying. A bad answer on a database migration is a calendar event. The same model can be a good choice in one situation and a bad default in another because the stakes changed.

Use this picker when you are sitting in OpenCode and deciding what to run before the next chunk of work. It is intentionally practical: start with the task, then pick the model.

Cheap Chores

Tencent Hy3

Use when the work is safe and repetitive.

Renames, summaries, small cleanup, low-risk retries.

Move up when context or confidence starts to matter.

Default Worker

DeepSeek V4 Flash

Start here for most real coding-agent sessions.

Multi-file edits, first-pass debugging, repo reading, PR drafts.

Move up only when the task proves it needs more.

More Judgment

Qwen3.6 Plus

Use when DeepSeek is close, but not quite landing it.

Harder repo tasks, longer reasoning, larger context-heavy work.

Move to Kimi if the terminal loop becomes the real problem.

Terminal Grind

Kimi K2.6

Use when the agent needs to execute, inspect, fix, and repeat.

Failing tests, setup weirdness, command-heavy debugging.

Move to Opus when the fix still needs high-confidence review.

Final Escalation

Claude Opus 4.7

Use when being wrong is more expensive than the model.

Production bugs, architecture review, migrations, risky fixes.

Move back down as soon as the task becomes chores again.

Conclusion

The model market is starting to look less like a ladder and more like a toolbox. That is good, because agentic coding is not one task. It is reading, planning, editing, running, failing, checking, deploying, and maintaining.

That is why the best model is not always the model with the highest score. The best model is the one that gives you enough reliability for the task without making every prompt feel expensive.

Tencent Hy3: cheapest bulk worker.
DeepSeek V4 Flash: best starter coding-agent model.
Qwen3.6 Plus: best mid-cost upgrade with 1M context.
Kimi K2.6: strongest non-Claude terminal-heavy agent.
Claude Opus 4.7: best raw intelligence and final escalation model.

References

The rankings above are based on public benchmark reports, model announcements, and OpenRouter pricing pages. SWE-bench Pro, SWE-bench Verified, and Terminal-Bench provide the benchmark frame.

Scale Labs

SWE-bench Pro

Used for the harder software-engineering benchmark in the intelligence score.

SWE-bench

SWE-bench Verified

Used for the official SWE-bench Verified benchmark definition and task set.

Terminal-Bench

Used for the terminal-agent benchmark that tests command-line execution.

Tencent

Tencent Hy3

Used for Hy3 model details and reported coding-agent benchmark data.

DeepSeek

DeepSeek V4 Flash

Used for DeepSeek V4 Flash release, pricing, context, and benchmark data.

OpenRouter

Qwen3.6 Plus

Used for Qwen3.6 Plus pricing, context window, release date, and benchmark listing.

Kimi

Kimi K2.6

Used for Kimi K2.6 model details and reported coding-agent benchmark data.

Anthropic

Claude Opus 4.7

Used as the closed-source frontier ceiling for capability and cost comparison.