Anthropic: Claude Opus 4.6 vs Google: Gemini 3.1 Pro Preview: Coding Performance with 10 Evaluators

We compare Anthropic: Claude Opus 4.6 and Google: Gemini 3.1 Pro Preview in a head-to-head analysis of Coding Performance with 10 Evaluators.

Anthropic: Claude Opus 4.6

8.2

/ 10

Google: Gemini 3.1 Pro Preview

1.8

/ 10

Key Findings

Top PerformanceAnthropic: Claude Opus 4.6

Achieved a leading overall score of 8.16, significantly outperforming the competition.

Coding AccuracyAnthropic: Claude Opus 4.6

Demonstrated superior logic and syntax adherence across all 10 evaluator prompts.

Cost EfficiencyAnthropic: Claude Opus 4.6

Delivered higher quality results at a lower total cost per response compared to Gemini 3.1 Pro.

Specifications

SpecAnthropic: Claude Opus 4.6Google: Gemini 3.1 Pro Preview

Provideranthropicgoogle

Context Length1.0M1.0M

Input Price (per 1M tokens)$5.00$2.00

Output Price (per 1M tokens)$25.00$12.00

Max Output Tokens128,00065,536

Tieradvancedadvanced

Our Verdict

Anthropic: Claude Opus 4.6 is the clear leader in this benchmark, providing significantly higher accuracy and better instruction following than Google: Gemini 3.1 Pro Preview. While the latter generated more tokens, it failed to match the quality and reliability required for the coding tasks evaluated. For developers prioritizing code correctness and logical consistency, Claude Opus 4.6 is the recommended model.

Overview

In the rapidly evolving landscape of Large Language Models, developers require precise data to choose the right architecture for software engineering tasks. This analysis focuses on Anthropic: Claude Opus 4.6 vs Google: Gemini 3.1 Pro Preview, specifically examining their capabilities in Coding Performance with 10 Evaluators. By utilizing PeerLM's comparative evaluation framework, we provide a transparent look at how these models handle complex coding prompts and instruction following.

Benchmark Results

The benchmarking process involved 10 independent evaluators assessing the quality, accuracy, and adherence to constraints for both models. The results indicate a significant performance gap in specialized coding tasks.

Model	Overall Score	Accuracy	Instruction Following
Anthropic: Claude Opus 4.6	8.16	8.16	8.16
Google: Gemini 3.1 Pro Preview	1.84	1.84	1.84

Criteria Breakdown

The evaluation was centered on two primary pillars: Accuracy and Instruction Following. In coding contexts, accuracy is paramount—the model must generate syntactically correct and logically sound code. Instruction following ensures the model adheres to specific constraints, such as using particular libraries, maintaining style guides, or integrating with existing boilerplate.

Anthropic: Claude Opus 4.6 demonstrated a strong grasp of these requirements, consistently producing high-quality outputs that satisfied the evaluators. Conversely, Google: Gemini 3.1 Pro Preview struggled to meet the high bar set by the benchmark, resulting in a score differential of 6.32 points.

Cost & Latency

Efficiency is a critical factor for enterprise-scale coding assistants. Below is the breakdown of the investment required to run these models based on our evaluation dataset.

Model	Total Cost (USD)	Avg Completion Tokens
Anthropic: Claude Opus 4.6	$0.040785	360
Google: Gemini 3.1 Pro Preview	$0.079106	1612

While Claude Opus 4.6 proved to be more cost-effective per response, it is important to note the variation in token output. Google: Gemini 3.1 Pro Preview generated significantly more completion tokens, which contributed to its higher total cost per request despite lower performance outcomes.

Use Cases

Given the performance disparity observed in Coding Performance with 10 Evaluators:

Anthropic: Claude Opus 4.6 is recommended for production-grade coding tasks, architectural planning, and complex refactoring where precision is non-negotiable.
Google: Gemini 3.1 Pro Preview may be better suited for exploratory tasks or scenarios where large-scale text generation is preferred over strict code logic, though it currently lags behind in this specific coding benchmark.

Verdict

The comparative analysis clearly favors Anthropic: Claude Opus 4.6 for coding-heavy applications. Its ability to maintain high accuracy and instruction adherence makes it the superior choice for developers looking for reliability in their AI-assisted coding workflows.

Backed by real data

View the Full Evaluation Report

See every response, score, and evaluator judgment behind this comparison. All data from PeerLM's blind evaluation pipeline.

View Report

Run your own comparison

Test Anthropic: Claude Opus 4.6 vs Google: Gemini 3.1 Pro Preview with your own prompts and criteria. Get results in minutes.

Start Free

Get a free managed report

We'll run a full evaluation with your real prompts and deliver a detailed recommendation. Free for qualified teams.

Request Report

Methodology

Evaluated using PeerLM's blind evaluation pipeline with 4 responses per model across 2 criteria.