SWE-bench

Overview

SWE-bench (and Arkivist extensions) measure whether coding agents ship verified diffs. Multiple proposals compete; tests and symbolic checks run before merge — the best grounded patch wins, not the most eloquent explanation.

Why it matters

ArkivistCode customers cannot afford vibe-based commits. SWE-bench integration proves our agents improve on real repositories with the same claim networks used in production code review.

Methodology

Standard SWE-bench Verified tasks plus Exercism-style micro-domains for fast iteration
hallucination_detection and citation_validation domains on proposed changes
Bench-graded loops with human merge when Lagrangian falls below threshold
Artifact viewer captures reasoning chains linked to graph entities

Results & next steps

We track resolve rate, regression introduction, and calibration on "patch will pass CI." Public summaries inform roadmap priorities; enterprise pilots wire private repos under tenant isolation.

Overview

Why it matters

Methodology

Results & next steps

Ready to pilot verifiable intelligence?