Anthropic: Claude Opus 4.6 vs xAI: Grok 4: Coding Performance with 10 Evaluators

In our latest benchmark, we evaluate Anthropic: Claude Opus 4.6 vs xAI: Grok 4 on Coding Performance with 10 Evaluators to determine which model leads in software engineering tasks.

Anthropic: Claude Opus 4.6

8.9

/ 10

xAI: Grok 4

1.1

/ 10

Key Findings

Coding AccuracyAnthropic: Claude Opus 4.6

Claude Opus 4.6 achieved an overall score of 8.95, significantly outperforming the competition.

Instruction AdherenceAnthropic: Claude Opus 4.6

Consistently followed complex coding constraints across all 10 evaluators.

Cost EfficiencyAnthropic: Claude Opus 4.6

Delivered higher quality results at a lower total cost than the comparison model.

Specifications

SpecAnthropic: Claude Opus 4.6xAI: Grok 4

Provideranthropicx-ai

Context Length1.0M256K

Input Price (per 1M tokens)$5.00$3.00

Output Price (per 1M tokens)$25.00$15.00

Tieradvancedadvanced

Our Verdict

Anthropic: Claude Opus 4.6 is the definitive winner in this coding performance evaluation, demonstrating superior accuracy and instruction following. xAI: Grok 4 failed to meet the rigorous standards of our 10-evaluator panel, resulting in a significantly lower overall score and higher cost per task. For developers and engineers, Claude Opus 4.6 remains the more reliable and performant choice for technical workflows.

Overview

In the rapidly evolving landscape of Large Language Models, choosing the right tool for software development is critical. Our latest evaluation, Coding Performance with 10 Evaluators, provides a side-by-side comparison of two industry leaders: Anthropic: Claude Opus 4.6 and xAI: Grok 4. This benchmark focuses on the ability of these models to interpret complex technical instructions and produce accurate, executable code.

Benchmark Results

The comparative evaluation reveals a significant performance gap between the two models when subjected to rigorous coding scenarios. PeerLM's 10-evaluator panel assessed both models on their ability to handle real-world programming challenges.

Model	Overall Score	Accuracy	Instruction Following
Anthropic: Claude Opus 4.6	8.95	8.95	8.95
xAI: Grok 4	1.05	1.05	1.05

Criteria Breakdown

The evaluation was conducted using a comparative ranking methodology, which emphasizes the consistency and reliability of the model's output rather than simple keyword matching. Anthropic: Claude Opus 4.6 demonstrated superior capability in interpreting multifaceted coding prompts, maintaining high structural integrity in its responses. Conversely, xAI: Grok 4 struggled to meet the specific constraints provided by the 10 evaluators, resulting in a lower overall ranking.

Cost & Latency

Understanding the economic trade-off is essential for developers integrating these models into production pipelines. Below is the cost breakdown for the evaluated requests:

Anthropic: Claude Opus 4.6: Total cost of $0.040785 for 4 response cycles, with an average of 360 completion tokens per request.
xAI: Grok 4: Total cost of $0.092487 for 4 response cycles, with an average of 1363 completion tokens per request.

While Grok 4 generated significantly more tokens on average, its performance on the coding tasks did not scale with the increased output volume, leading to a higher cost-per-task efficiency profile for Claude Opus 4.6.

Use Cases

Anthropic: Claude Opus 4.6 is currently recommended for complex software engineering tasks, including refactoring legacy codebases, writing unit tests, and designing architectural patterns where precision and instruction adherence are paramount. xAI: Grok 4, in its current state within this specific coding suite, may face challenges with highly structured technical prompts that require strict constraint satisfaction.

Verdict

The evaluation clearly identifies Anthropic: Claude Opus 4.6 as the superior performer for coding-related tasks. With a commanding score spread of 7.9, Claude Opus 4.6 provides more reliable and accurate code generation, proving to be the more cost-effective choice for developers prioritizing high-quality output.

Backed by real data

View the Full Evaluation Report

See every response, score, and evaluator judgment behind this comparison. All data from PeerLM's blind evaluation pipeline.

View Report

Run your own comparison

Test Anthropic: Claude Opus 4.6 vs xAI: Grok 4 with your own prompts and criteria. Get results in minutes.

Start Free

Get a free managed report

We'll run a full evaluation with your real prompts and deliver a detailed recommendation. Free for qualified teams.

Request Report

Methodology

Evaluated using PeerLM's blind evaluation pipeline with 4 responses per model across 2 criteria.