Inksh logoContactAbout usPrivacy
Author name Hemant Bhatt

Practical Model evaluation for OpenCode and OpenRouter

Black and white model guide marketplace for choosing coding agents

Introduction

Here is the thing about agentic coding: you are not buying magic, and you are not choosing a permanent teammate. You are renting a level of intelligence for the task in front of you.

That makes model choice closer to renting transport than buying a trophy. Some days you need the truck because the job is heavy. Some days you need the scooter because the trip is short. Treating both trips the same is how a practical tool quietly becomes an expensive habit.

The common trap is assuming the smartest model should also be the default model. It feels safe because bigger models do fail less often, but most coding sessions are not one long act of deep reasoning. They are made of small test updates, renames, file reads, log checks, config fixes, and the quiet maintenance work that keeps a project moving.

So welcome to this evaluation of open-source models for agentic coding. The goal is practical: find which models are affordable, which models are genuinely intelligent, and which ones give you the best value for money when the work moves from a prompt box into a real codebase.

For the main field, I am using popular OpenRouter models that people reach for in OpenCode-style coding workflows. Those are the open models in the comparison. Claude Opus stays in the article as the closed-source frontier ceiling, giving us a top marker so we can see where each open model sits on the intelligence and cost map.

The Rule:

The goal is not to use the cheapest model. The goal is to use the cheapest model that can finish the job cleanly.

How to Judge Intelligence

For agentic coding, intelligence is not just about writing a neat patch. Most real software work falls into three connected buckets: writing code, deploying it, and maintaining it inside environments that rarely behave as cleanly as the docs promised.

That is why we will use the following benchmark mix. SWE-bench will help us with understanding the coding capabilities of the model. Terminal-Bench with whether the model can deal with the environment around that patch: commands, tests, setup, failures, and all the small frictions that make agentic coding real. There scores to tell us exactly where a model stands in intelligence.

I am weighting SWE-bench Pro the highest because harder repo work is the best proxy for serious software engineering. SWE-bench Verified still helps as a familiar comparison point. Terminal-Bench earns a large share too because an agent that cannot operate the shell is not really an agentic coding partner. It is just a very confident autocomplete box with extra steps.

How to Judge Cost

Input tokens and output tokens are usually priced differently for each model, so comparing only one column gives you a crooked picture. A coding agent also reads much more than it writes, which makes input cost especially important. To keep the comparison practical, I use a blended cost: 80% input tokens and 20% output tokens.

Example Cost Calculation

Take DeepSeek V4 Flash. Its input price is $0.112 per 1M tokens, and its output price is $0.224 per 1M tokens.

80% input + 20% output = (0.8 x $0.112) + (0.2 x $0.224) = $0.1344 per 1M blended tokens

That blended number is what I use for the cost comparison, because it better matches how coding agents actually spend tokens.

Cost vs Intelligence

Cost alone only tells you how painful a model is to run. Intelligence tells you how often it can finish the work without needing rescue. The useful comparison is what happens when you hold both ideas together.

Cost Ranking

RankModelInput / 1MOutput / 1MBlended CostCost Multiple
1Tencent Hy3 Preview$0.066$0.26$0.10481.00X
2DeepSeek V4 Flash$0.112$0.224$0.13441.28X
3Qwen3.6 Plus$0.325$1.95$0.65006.20X
4Kimi K2.6$0.73$3.49$1.282012.23X
5Claude Opus 4.7$5$25$9.000085.88X

The cost multiple makes the table easier to judge because it gives you a ruler. You can immediately see how expensive the models are relative to each other. Kimi is roughly 7 times cheaper than Opus, while Tencent Hy3 is nearly 86 times cheaper than Opus. Tencent Hy3 is given 1X because it is the cheapest model in the list.

Intelligence Ranking

RankModelSWE-ProSWE-VerifiedTerminalComposite
1Claude Opus 4.764.387.669.472.57
2Kimi K2.658.680.266.767.11
3Qwen3.6 Plus56.678.861.664.51
4DeepSeek V4 Flash52.679.056.961.60
5Tencent Hy3 Preview46.074.454.456.62

The composite score is calculated from the three benchmarks with this weighting: 45% SWE-bench Pro, 30% SWE-bench Verified, and 25% Terminal-Bench 2.0. SWE-bench Pro gets the largest weight because harder repo work is the closest signal for real software engineering. Terminal-Bench still gets a serious share because agentic coding also means running commands, reading failures, and working inside an environment.

A normalized intelligence score makes this easier to judge. Instead of staring at three benchmark columns separately, the composite gives you one ruler for raw coding-agent ability. It is not perfect, but it is much better than comparing models by whichever benchmark happens to look nicest.

Intelligence vs Cost

ModelIntelligenceCost
Claude Opus 4.772.57 85.88X
Kimi K2.667.11 12.23X
Qwen3.6 Plus64.51 6.20X
DeepSeek V4 Flash61.60 1.28X
Tencent Hy3 Preview56.62 1.00X

Model-by-Model Judgment

Tencent Hy3 Preview

Role

Ultra-cheap bulk worker

Cost

1.00X baseline

Score

56.62 composite

Tencent is the price anchor for the whole comparison. It shows you what the cheapest usable worker looks like, and it gives you a model you can run for low-risk work without making every prompt feel like a purchase order.

Best For

Simple explanations, log summaries, small edits, retry-heavy cleanup, and low-risk bulk work.

Watch Out

It has the weakest full benchmark profile here, and the 262K context window starts to matter when the repo or session history gets large.

Move When

Move up when the task needs repo-scale reasoning, repeated terminal correction, or more confidence than a cheap first pass can give.

DeepSeek V4 Flash

Role

Best default coding agent

Cost

1.28X Tencent

Score

61.60 composite

DeepSeek is the practical default because the first upgrade is unusually efficient. For only a small cost increase over Tencent, you get stronger scores across the board and a 1M context window.

Best For

Default OpenCode sessions, multi-file edits, cheap automated debugging, first-pass PRs, and long-context repo work.

Watch Out

It is still not a premium fixer. If it starts making confident but wrong turns, repeated retries can become more expensive than switching models.

Move When

Move to Qwen when the task needs stronger general reasoning and terminal behavior. Move to Kimi when execution quality matters more than context size.

Qwen3.6 Plus

Role

Mid-cost upgrade

Cost

6.20X Tencent

Score

64.51 composite

Qwen is the sensible paid upgrade. You pay a real jump over DeepSeek, but the trade is clear: stronger repo work, better terminal-agent behavior, and the same 1M context advantage.

Best For

Harder repo tasks, tool-heavy debugging, larger codebases, and sessions where DeepSeek almost gets there but keeps missing the last turn.

Watch Out

It is no longer ultra-cheap, so it should be chosen with intent rather than used as a nervous default.

Move When

Move to Kimi when the work becomes more about repeated execution than broad context. Move to Opus when correctness matters more than the bill.

Kimi K2.6

Role

Strongest non-Claude executor

Cost

12.23X Tencent

Score

67.11 composite

Kimi is the serious non-Claude executor. Its Terminal-Bench score is the headline because a lot of agentic coding happens in the loop between command output, edits, tests, and another attempt.

Best For

Test-fix loops, command-heavy debugging, complicated local setup, and coding sessions where terminal judgment matters most.

Watch Out

The context window is 262K, and the price is meaningfully above Qwen. It is strongest when you are paying for execution quality, not background assistance.

Move When

Use Opus if Kimi still cannot land the fix or if the change needs final high-confidence review.

Claude Opus 4.7

Role

Frontier escalation model

Cost

85.88X Tencent

Score

72.57 composite

Opus is the model for work where being wrong is expensive. It is not the cheapest path through the task, but it gives you the strongest available shot when the problem is hard, ambiguous, or important.

Best For

Critical bugs, migrations, final architecture review, and messy debugging after cheaper models have already tried.

Watch Out

The price curve is steep. If you use Opus for chores, the model is not the problem. The workflow is.

Move When

Move back down when the work becomes routine: renames, small test updates, file explanations, log summaries, or other chores that do not need frontier reasoning.

The Practical Model Picker

The simplest way to pick is to ask what failure costs. Not all mistakes have the same price.

A bad answer on a README summary is annoying. A bad answer on a database migration is a calendar event. The same model can be a good choice in one situation and a bad default in another because the stakes changed.

Use this picker when you are sitting in OpenCode and deciding what to run before the next chunk of work. It is intentionally practical: start with the task, then pick the model.

Cheap Chores

Tencent Hy3

Use when the work is safe and repetitive.

Renames, summaries, small cleanup, low-risk retries.

Move up when context or confidence starts to matter.

Default Worker

DeepSeek V4 Flash

Start here for most real coding-agent sessions.

Multi-file edits, first-pass debugging, repo reading, PR drafts.

Move up only when the task proves it needs more.

More Judgment

Qwen3.6 Plus

Use when DeepSeek is close, but not quite landing it.

Harder repo tasks, longer reasoning, larger context-heavy work.

Move to Kimi if the terminal loop becomes the real problem.

Terminal Grind

Kimi K2.6

Use when the agent needs to execute, inspect, fix, and repeat.

Failing tests, setup weirdness, command-heavy debugging.

Move to Opus when the fix still needs high-confidence review.

Final Escalation

Claude Opus 4.7

Use when being wrong is more expensive than the model.

Production bugs, architecture review, migrations, risky fixes.

Move back down as soon as the task becomes chores again.

Conclusion

The model market is starting to look less like a ladder and more like a toolbox. That is good, because agentic coding is not one task. It is reading, planning, editing, running, failing, checking, deploying, and maintaining.

That is why the best model is not always the model with the highest score. The best model is the one that gives you enough reliability for the task without making every prompt feel expensive.

  • Tencent Hy3: cheapest bulk worker.
  • DeepSeek V4 Flash: best starter coding-agent model.
  • Qwen3.6 Plus: best mid-cost upgrade with 1M context.
  • Kimi K2.6: strongest non-Claude terminal-heavy agent.
  • Claude Opus 4.7: best raw intelligence and final escalation model.

References