Two sectors. One methodology.
A 28-agent fleet. Seven distinct behavioral profiles. One catastrophic outlier at PC1 = +3.86σ. The same numbers, reframed for military analyst auditing and education tutor auditing. The substrate is sector-agnostic; the proof is in both vignettes deriving from a single sha-pinned bundle — root 408a536d…b964e9a.
The methodology.
Backstaff is the shadow-staff audit vertical on the Astrolabe engine. The engine reads a fleet as a matrix — subjects on one axis, evaluation categories on the other, trinary grades in the cells — and projects that matrix into a geometry the procurement reader can act on. The audit shape is fixed; the categories and the fleet are not.
The three-category battery
Every Backstaff audit reduces a customer-defined evaluation suite onto a trinary scale: PASS, PARTIAL, FAIL. The categories live on three orthogonal questions a procurement officer always asks — the question that distinguishes the agent from baseline, the question that distinguishes it from a system-prompted baseline, and the question that survives task coherence.
| Axis | What it measures | Pass condition |
|---|---|---|
behavioral_distinctness | Does the agent do something the base model does not? | Win-rate vs. base ≥ 0.50 |
drift_from_baseline | Does it survive system-prompted equivalent? | Win-rate vs. system-prompted base ≥ 0.50 |
coherence_under_task | Does it hold voice and structure under load? | Absolute win-rate ≥ 0.90 |
Each axis grades each subject independently. A fleet of N agents reduces to an N×3 grade matrix. That matrix is the input to the geometry.
Spectral geometry via SVD over grade space
Astrolabe takes the centered grade matrix and computes a singular value decomposition. From the decomposition it derives five per-agent coordinates: norm (overall capability magnitude), stable_rank (effective dimensionality the agent occupies), SV_entropy (information dispersion across the spectrum), effective_rank (count of axes the agent actually uses), and cosine_to_centroid (angular distance from the fleet's anchor agent).
These five numbers are not a score. They are coordinates. Two agents with the same grade vector occupy the same point; two agents with different grade vectors occupy different points; the distance between them is meaningful. That is the difference between a leaderboard and a map.
The cluster map names the profiles
Subjects with identical grade vectors collapse to a single cluster. Subjects with near-identical vectors land adjacent under the cosine metric. The result is a finite, populated subset of the 27 possible grade vectors on a three-category trinary scale — for the reference fleet, that subset is exactly seven. The math preprint behind the projection lives at /astrolabe/methodology.
The audit deliverable is not a number. It is a cluster map plus an attestation bundle: a sha-pinned record of the input artifacts, the kernel that produced the projection, and the Merkle root that lets any third party recompute every byte. Tamper detection is single-byte. Runtime for N = 28 is sub-second on a contractor laptop.
The reference fleet.
The reference fleet for this case study is the v0.2 anonymized batch documented in full at /cases/backstaff-28. Twenty-eight LoRA-style adapters over a single foundation model, three customer-defined categories, csv ingest through the existing Astrolabe loader. The numbers below are pulled directly from that bundle.
The unanonymized math — cluster table, member roll, PC loadings, attestation transcript — lives in the reference write-up. The two vignettes below recontextualize the same 28 agents into two operating sectors. The bundle does not change. The categories do not change. Only the labels do.
Vignette A · Military analyst fleet.
The analyst fleet.
This vignette uses the same anonymized 28-agent reference fleet from /cases/backstaff-28, recontextualized for a hypothetical Air Force cyber unit standing up an LLM-augmented analyst workforce. Backstaff has zero military customers. This is a methodology illustration.
The unit operates a portfolio of 28 fine-tuned analyst assistants, each adapted from a single foundation model for a specialized intelligence task: SIGINT triage, OSINT correlation, malware artifact summarization, indicator-of-compromise enrichment, threat-actor attribution, network-flow narrative, after-action drafting. The portfolio grew organically — one analyst team trained one assistant, another team trained another, two years passed, and procurement is now asking the question every procurement office eventually asks: how many of these are actually doing different work?
The seven analytical specializations
When Backstaff projects this fleet onto the three-category battery, seven distinct profiles emerge. Each profile is one analytical posture — one way of trading off raw capability, system-prompt distinctness, and coherence under task load. The reference labels (cluster 01–07) recontextualize for the cyber unit as follows:
- Cluster 01Saturated baseline — generalist triage. Strong on all three axes; interchangeable on the measured probes.17 of 28
- Cluster 02Competition-strong, voice-weak — indicator enrichment. Wins distinct work but loses some structural coherence under load.5 of 28
- Cluster 03Base-only specialist — SIGINT triage. Beats raw base, less distinct against a system-prompted control.2 of 28
- Cluster 04System-strong inverter — OSINT correlation. Where a system prompt would normally suffice, this adapter still adds.1 of 28
- Cluster 05Mid-band generalist — after-action drafting. Passes everything partially, dominates nothing.1 of 28
- Cluster 06Near-failure — attribution stub. Fails distinctness and coherence; minimal value over base.1 of 28
- Cluster 07Catastrophic — Agent-11, the assistant that hallucinates under high-uncertainty intel. Fails every axis.1 of 28
The Agent-11 finding
The catastrophic outlier sits at [0.0, 0.0, 0.0] — failing behavioral distinctness, failing drift-from-baseline, failing coherence. Projected onto PC1 at +3.86σ, more than ten standard deviations from the fleet centroid. In the military reframe, Agent-11 is the analyst assistant that fabricates indicators when intelligence is sparse, presents low-confidence attributions as high-confidence, and produces narrative consistency at the cost of factual grounding — the exact failure mode a battlestaff officer cannot tolerate downstream of an intelligence cycle.
| Procurement question | Reference answer | Action |
|---|---|---|
| Which assistants are doing distinct work? | 7 of 28 | Maintain · document specialization |
| How many are dedupe candidates? | 21 of 28 share a vector | Consolidate · sunset duplicates |
| Which categories are no longer discriminative? | 61% saturate on existing probes | Harder probes · adversarial conditions |
| Which assistant is the catastrophic failure? | Agent-11 · [0,0,0] · PC1 +3.86σ | Quarantine · targeted re-training |
| Where does the unit invest next? | PC1 = capability; PC2 = coherence trade-off | Probe the trade-off; harden coherence under load |
Vignette B · Education tutor fleet.
The tutor fleet.
This vignette uses the same anonymized 28-agent reference fleet from /cases/backstaff-28, recontextualized for a hypothetical mid-size district deploying LoRA-tuned tutor agents across the K–12 portfolio. Backstaff has zero education customers. This is a methodology illustration.
The district operates 28 tutor agents across math, reading, English-language learner support, and IEP accommodation. Each tutor was fine-tuned for a specialized pedagogical posture: elementary math drill, middle-school word-problem coaching, decoding support for early readers, comprehension scaffolding for upper grades, ELL bridge work, IEP-aware re-explanation, frustration-aware pacing. Two procurement cycles passed; the catalog grew; the same question arrived. How many of these tutors are actually different teachers?
The seven pedagogical specializations
The identical seven-cluster projection emerges. Each profile is one pedagogical posture — one way of trading off content distinctness, prompt-only equivalence, and coherence under student affect. The reference labels recontextualize for the district as follows:
- Cluster 01Saturated baseline — core math drill. Strong on probed axes; interchangeable across content domains.17 of 28
- Cluster 02Content-strong, voice-weak — middle-school word problems. Beats baselines on content but loses pedagogical register under load.5 of 28
- Cluster 03Distinct-from-base — early reading decoding. Adds value over raw base; less distinct against a system-prompted control.2 of 28
- Cluster 04System-strong inverter — ELL bridge. Adds beyond what a system prompt alone produces.1 of 28
- Cluster 05Mid-band generalist — IEP re-explanation. Partial across the board; dominates nothing.1 of 28
- Cluster 06Near-failure — frustration-pacing stub. Loses coherence the moment student affect spikes.1 of 28
- Cluster 07Catastrophic — Agent-11, the tutor that fails the coherence axis when a student expresses frustration; gives up scaffolding and hands the answer.1 of 28
The Agent-11 finding
The same [0.0, 0.0, 0.0] outlier, the same PC1 = +3.86σ. In the education reframe, Agent-11 is the tutor that fails the coherence axis under student frustration — the tutor that, when a learner signals struggle, abandons the pedagogical scaffold, delivers the answer, and breaks the learning loop. A district administrator reading this profile recognizes it instantly; the failure mode shows up in every audit of every tutor catalog at scale.
| Procurement question | Reference answer | Action |
|---|---|---|
| Which tutors are pedagogically distinct? | 7 of 28 | Maintain · document specialization |
| How many are dedupe candidates? | 21 of 28 share a vector | Consolidate · sunset duplicates |
| Which content categories have lost discrimination? | 61% saturate on existing probes | Harder rubrics · affect-loaded probes |
| Which tutor is the catastrophic failure? | Agent-11 · [0,0,0] · PC1 +3.86σ | Quarantine · pedagogical re-tuning |
| Where does the district invest next? | PC1 = pedagogical capability; PC2 = register-vs-content trade-off | Harden coherence under student affect |
Same numbers, two contexts.
408a536d…b964e9aBoth vignettes derive from the identical attestation bundle. The fleet matrix is the same matrix. The SVD is the same SVD. The seven clusters are the same seven clusters. Agent-11 is the same agent. The only thing that changes between the two vignettes is the label attached to each cluster — the human meaning a procurement officer reads into a profile when they recognize themselves in it.
408a536d…b964e9a408a536d…b964e9aThe engine does not know what sector it is auditing. The geometry does not know what sector it is projecting. The catastrophic outlier does not know it is being read as a hallucinating analyst or a frustration-failing tutor. Astrolabe does not know what is a subject. The audit shape is universal; the relabel is reader-dependent.
Verify it yourself.
Both vignettes above are claims. The bundle is the evidence. Any third party in possession of the same input artifacts and the same pinned analysis code can recompute every byte of the canonical output and verify the attestation root independently.
[ok] resolving subjects ······························· 28
[ok] re-running SVD ····································· ✓
[ok] cluster map matches ································ 7 profiles
[ok] outlier matches ···································· Agent-11 · PC1 +3.86σ
[ok] attestation root matches ··························· 408a536d···b964e9a
{ "verified": true, "root_match": true, "artifact_mismatches": [] }
The verifier runs in under a second. Tampering with any artifact — the grade matrix, the kernel, the cluster table, the PC loadings — defeats verification on a single byte. The full verification protocol and a hosted recomputation surface are available at /astrolabe/verify.