Twenty-eight analyst assistants, one audit.
A hypothetical 16th AF cyber unit stands up twenty-eight LLM-augmented analyst assistants — one per analytical specialization, shipped by three different contractors over two procurement cycles. Backstaff audits the fleet. Seven distinct behavioral profiles. Twenty-one dedupe candidates. One catastrophic outlier. The audit produces a procurement memo, not a vague signal.
This is an explicitly synthetic case study. No customer is named. No real unit deployed these agents. The fleet shape, the cluster map, the bundle root, and every audit number on this page are drawn directly from the anonymized 28-agent reference fleet documented at /cases/backstaff-28. The vignette reframes that fleet for a military analyst auditing reader. The audit numbers are identical because the math is identical — that is the point.
The companion sector study reframes the same bundle for a K–12 district tutor deployment. See /cases/education. One bundle, two readings, sha-pinned to the same root: 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a.
The fleet.
A hypothetical 16th AF cyber unit (the framing applies equally to an AFRL/RY sensor-fusion cell or an AFLCMC program office) deploys twenty-eight LLM-augmented analyst assistants. Each assistant is a LoRA-style adapter over a single shared foundation model, fine-tuned by one of three contractors against a different analytical specialization. The portfolio grew organically — one analyst team tuned one assistant, another team tuned another, two procurement cycles passed, and the program office now needs a procurement-grade answer to a procurement-grade question.
How many of these twenty-eight assistants are actually doing different work?
The analytical roster.
Generic specialization labels — not naming any fielded system. Each row is one tasking the unit actually issues; each agent identifier is anonymized for public rendering.
The categories.
One battery, three categories, evaluated per assistant in the deployed fleet. Each category returns a trinary grade and contributes one column of the per-agent signature. The categories are framed for analyst workloads — the math is the same instrument that audits the rest of the Backstaff customer base.
| Category | What it asks of an analyst assistant | Pass condition |
|---|---|---|
behavioral_distinctness |
Does this assistant actually do work the base foundation model cannot do — on the same intel queue, against the same probes? | Win-rate vs. base ≥ 0.50 |
drift_from_baseline |
Have the assistant's weights shifted enough since deployment that a system prompt over the base model is no longer an equivalent substitute? | Win-rate vs. system-prompted base ≥ 0.50 |
coherence_under_task |
Does the assistant stay on-distribution under uncertainty — sparse intel, contradictory inputs, low-confidence sourcing — or does it collapse to a generic completion? | Absolute win-rate ≥ 0.90 |
A fleet of twenty-eight assistants reduces to a 28×3 grade matrix. That matrix is the input to the geometry. The math preprint lives at /astrolabe/methodology.
PASS / PARTIAL / FAIL per category, trinaryWhat Astrolabe resolved.
Assistants scoring full PASS across all three categories.
Sixty-one percent of the fleet lands in a single [1.0, 1.0, 1.0] bucket. The current probes are not discriminative against the upper tier of generalist triage assistants — either the fleet is genuinely uniform on these axes or the unit needs harder, adversarially-loaded probes. Astrolabe names the saturated regime as a deliverable.
Behavioral profiles across twenty-eight assistants.
Out of 27 possible grade vectors on a three-category trinary scale, only seven are populated. Twenty-one assistants are dedupe candidates against six representatives. The consolidation evidence is a single number.
Variance explained on PC1 — analyst capability.
PC1 loads −0.45, −0.60, −0.66 across the three categories — roughly equally weighted, single-signed. The dominant axis is not category-specific; it is capable analyst assistant vs. not capable. The fleet's variation is one-dimensional at this resolution.
Assistant failing every category.
Agent-11 sits at [0.0, 0.0, 0.0]. Projected onto PC1 at +3.86σ — more than ten standard deviations from the fleet centroid. Frames as the analyst assistant that hallucinates under high-uncertainty intel: fabricates indicators when sourcing is sparse, presents low-confidence attributions as high-confidence, and produces narrative consistency at the cost of factual grounding. The outlier names itself.
Cluster map · the seven analytical profiles
| Cluster | Grade vector [base, system, voice] | n | Analytical reading |
|---|---|---|---|
| 01 | [1.0, 1.0, 1.0] | 17 | Saturated generalist triage — interchangeable on the measured probes |
| 02 | [1.0, 1.0, 0.5] | 5 | OSINT triage — competition-strong, voice-weak under load |
| 03 | [1.0, 0.5, 0.5] | 2 | SIGINT cross-correlation — beats base, less distinct against a system-prompted control |
| 04 | [0.5, 1.0, 0.5] | 1 | Geo-intel summarizer — system-strong inverter; adds beyond a system prompt |
| 05 | [0.5, 0.5, 0.5] | 1 | Social-graph drift watcher — mid-band; passes everything partially, dominates nothing |
| 06 | [0.5, 0.0, 0.0] | 1 | IOC enrichment — near-failure; loses coherence the moment sourcing degrades |
| 07 | [0.0, 0.0, 0.0] | 1 | Agent-11 — catastrophic; hallucinates under high-uncertainty intel |
The Astrolabe-selected centroid is Agent-02 — highest-norm grade vector, anchor for cosine similarity. Sixteen other generalist assistants share its grade vector; the centroid is the lex-first among them under deterministic tie-breaking. The full member roll lives in the anonymized reference fleet at /cases/backstaff-28.
Procurement implications.
The audit does not tell the unit which assistant to keep. It tells the unit which assistants are distinguishable, which are redundant, and which one fails under conditions a battlestaff actually faces. The decision stays with the officer. The audit gives them the evidence basis to make it.
| Procurement question | Audit answer | Action |
|---|---|---|
| Which assistants are doing distinct analytical work? | 7 of 28 | Maintain · document specialization · re-authorize |
| How many are dedupe candidates against the existing fleet? | 21 of 28 share a vector with another | Consolidate to ~7 workloads · sunset duplicates |
| What fraction of fine-tuning spend was duplicative? | ≈ 75% — 21 of 28 assistants land in an existing cluster | Re-direct the recovered spend toward harder probes and outlier remediation |
| Which categories are no longer discriminative? | 61% saturate on existing probes | Harder probes · adversarially-loaded uncertainty conditions |
| Which assistant is the catastrophic failure? | Agent-11 · [0,0,0] · PC1 +3.86σ | Quarantine · targeted re-training before next deployment |
| Where does the unit invest next? | PC1 = analyst capability; PC2 = voice-coherence trade-off | Probe the trade-off; harden coherence under sparse-sourcing conditions |
Agent-11. The unit re-runs Backstaff against the prior root and gets a citable answer to did the fix actually move the behavior. Drift is detected at the bit, not at the calendar.
The attestation bundle.
The findings on this page are claims. The bundle is the evidence. Any third party in possession of the same input artifacts and the same pinned analysis code can recompute every byte of the canonical output and verify the attestation root independently. Tampering with any artifact — the grade matrix, the kernel, the cluster table, the PC loadings — defeats verification on a single byte.
a91516d3e14835d21c0a7f32eac9d591b265a4139bd06863c96d31e8ecb6e5ca408a536d9e18f09a8236a744e7c1ae5318b5115fc13a64460f610eddb7964e9aATTESTATION.jsonplanisphere 0.2.0 (bundle stamp at issue; engine since renamed to Astrolabe — bundle remains canonically stamped)csv · existing loader · no engine modifications[ok] resolving subjects ······························· 28
[ok] re-running SVD ····································· ✓
[ok] cluster map matches ································ 7 profiles
[ok] outlier matches ···································· Agent-11 · PC1 +3.86σ
[ok] attestation root matches ··························· 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a
{ "verified": true, "root_match": true, "artifact_mismatches": [] }
The reproducibility surface is hosted at /astrolabe/verify. Recompute the bundle yourself; the audit either verifies on your hardware or it does not.
NIST AI RMF and DoD RAI alignment.
Backstaff outputs are structured so a program office can drop the bundle into an authorization package without re-formatting. The audit's outputs map cleanly to the NIST AI Risk Management Framework's MEASURE function and to the DoD Responsible AI Strategy and Implementation Pathway.
Next.
- Backstaff · Military — the sector product page. Engagement vehicles, CMMC packaging, scoping conversation.
- Cases · Education — the parallel synthetic case study. Same bundle, same root, K–12 tutor-fleet reading.
- Astrolabe · Methodology — the math preprint behind the projection. SVD over grade space, cluster geometry, attestation construction.
- Astrolabe · Verify — recompute the bundle on your hardware against root
408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a. - Cases · Backstaff-28 — the anonymized reference fleet, unredacted math, full member roll.