Software engineering benchmarks where the winning diff is verified — tests, static analysis, and SWE-bench-style tasks with full provenance on why a patch was accepted.

Bench: SWE-bench
Overview
SWE-bench (and Arkivist extensions) measure whether coding agents ship verified diffs. Multiple proposals compete; tests and symbolic checks run before merge — the best grounded patch wins, not the most eloquent explanation.
Why it matters
ArkivistCode customers cannot afford vibe-based commits. SWE-bench integration proves our agents improve on real repositories with the same claim networks used in production code review.
Methodology
- Standard SWE-bench Verified tasks plus Exercism-style micro-domains for fast iteration
hallucination_detectionandcitation_validationdomains on proposed changes- Bench-graded loops with human merge when Lagrangian falls below threshold
- Artifact viewer captures reasoning chains linked to graph entities
Results & next steps
We track resolve rate, regression introduction, and calibration on "patch will pass CI." Public summaries inform roadmap priorities; enterprise pilots wire private repos under tenant isolation.
Arkivist Research
Updated February 1, 2026





