Software engineering benchmarks where the winning diff is verified — tests, static analysis, and SWE-bench-style tasks with full provenance on why a patch was accepted.

SWE-bench research

Bench: SWE-bench

Overview

SWE-bench (and Arkivist extensions) measure whether coding agents ship verified diffs. Multiple proposals compete; tests and symbolic checks run before merge — the best grounded patch wins, not the most eloquent explanation.

Why it matters

ArkivistCode customers cannot afford vibe-based commits. SWE-bench integration proves our agents improve on real repositories with the same claim networks used in production code review.

Methodology

  • Standard SWE-bench Verified tasks plus Exercism-style micro-domains for fast iteration
  • hallucination_detection and citation_validation domains on proposed changes
  • Bench-graded loops with human merge when Lagrangian falls below threshold
  • Artifact viewer captures reasoning chains linked to graph entities

Results & next steps

We track resolve rate, regression introduction, and calibration on "patch will pass CI." Public summaries inform roadmap priorities; enterprise pilots wire private repos under tenant isolation.

Arkivist Research

Updated February 1, 2026

Ready to pilot verifiable intelligence?