Cohere open-sources coding agent that runs on single H100

Cohere released North Mini Code, a 30 billion parameter open-source coding agent that runs on a single H100 GPU, offering engineering teams an alternative to proprietary models like Claude Fable 5. The model is purpose-built for agentic software engineering tasks including code review, architecture mapping, and terminal work, but generates roughly three times more output tokens than comparable models in testing, creating a verbosity cost in high-volume production scenarios.
TL;DR
- North Mini Code is a 30B parameter mixture-of-experts model with 3B active parameters per token, available under Apache 2.0 license on Hugging Face
- Supports 256,000 token context window and 64,000 token max generation length, trained specifically for agentic coding tasks across three different agent scaffolds
- Achieves 2.8x higher output throughput and 30% inter-token latency advantage over Mistral Devstral Small 2 in internal testing, but generates 75 million tokens versus 25 million median on Artificial Analysis benchmarks
- Can run on Mac Studio with 20GB RAM via MLX, making it accessible for local development alongside cloud deployment options
Why It Matters
Open-source coding agents reduce vendor lock-in and lower deployment costs for teams building agentic software engineering pipelines. North Mini Code's efficiency on modest hardware (single H100, or even Mac Studio) democratizes access to capable coding agents that were previously limited to managed API services. However, the verbosity tradeoff means teams must evaluate token costs carefully in production workloads.
Business Impact
Engineering teams can now deploy a capable coding agent without relying on Claude Fable 5 or other proprietary models, reducing per-token costs and improving data privacy. The model's ability to run on a single H100 or local hardware makes it economically viable for organizations of various scales, though the higher token generation means actual inference costs require careful calculation against alternatives.
Key Implications
- Open-source coding agents are becoming competitive with proprietary offerings on speed and capability, shifting the competitive landscape away from pure API-based models
- The verbosity penalty (3x output tokens) suggests that efficiency gains from sparse MoE architecture come with a tradeoff in model conciseness, requiring teams to model total cost of ownership rather than parameter count alone
- Multi-harness training across different agent scaffolds (SWE-Agent, Mini-SWE-Agent, OpenCode) appears to improve generalization, potentially becoming a standard practice for agentic model development
What to Watch
Monitor whether the verbosity penalty becomes a limiting factor in production deployments and whether other vendors respond with similarly efficient open-source alternatives. Track adoption patterns across different deployment scenarios (local, single GPU, cloud) to understand where North Mini Code gains traction. Watch for improvements in token efficiency in future versions, as the 3x output token multiplier is a significant cost driver in high-volume agentic pipelines.
Our Briefing
Weekly signal. No noise. Built for founders, operators, and AI-curious professionals.
No spam. Unsubscribe any time.
