Synthetic case study · Backstaff applied to a hypothetical military analyst fleet

Twenty-eight analyst assistants, one audit.

A hypothetical 16th AF cyber unit stands up twenty-eight LLM-augmented analyst assistants — one per analytical specialization, shipped by three different contractors over two procurement cycles. Backstaff audits the fleet. Seven distinct behavioral profiles. Twenty-one dedupe candidates. One catastrophic outlier. The audit produces a procurement memo, not a vague signal.

This is an explicitly synthetic case study. No customer is named. No real unit deployed these agents. The fleet shape, the cluster map, the bundle root, and every audit number on this page are drawn directly from the anonymized 28-agent reference fleet documented at /cases/backstaff-28. The vignette reframes that fleet for a military analyst auditing reader. The audit numbers are identical because the math is identical — that is the point.

The companion sector study reframes the same bundle for a K–12 district tutor deployment. See /cases/education. One bundle, two readings, sha-pinned to the same root: 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a.

01 ·

The fleet.

Hypothetical 16th AF cyber unit · 28 LLM-augmented analyst assistants · 3 contractors

A hypothetical 16th AF cyber unit (the framing applies equally to an AFRL/RY sensor-fusion cell or an AFLCMC program office) deploys twenty-eight LLM-augmented analyst assistants. Each assistant is a LoRA-style adapter over a single shared foundation model, fine-tuned by one of three contractors against a different analytical specialization. The portfolio grew organically — one analyst team tuned one assistant, another team tuned another, two procurement cycles passed, and the program office now needs a procurement-grade answer to a procurement-grade question.

How many of these twenty-eight assistants are actually doing different work?

The analytical roster.

Generic specialization labels — not naming any fielded system. Each row is one tasking the unit actually issues; each agent identifier is anonymized for public rendering.

Agent-01 · Agent-06 · Agent-15 · Agent-25 · Agent-27OSINT triage assistant

Agent-03 · Agent-04SIGINT cross-correlation

Agent-05Geo-intel summarizer

Agent-17Social-graph drift watcher

Agent-02 · Agent-07–10 · Agent-12 · Agent-14 · Agent-16 · Agent-18–24 · Agent-26 · Agent-28Generalist threat-intel triage

Agent-13Indicator-of-compromise enrichment

Agent-11High-uncertainty attribution stub

—Malware artifact summarizer · network-flow narrative · after-action drafting (folded into the generalist bucket)

The procurement question is not does this assistant work. It is how do I know what I just bought is different from what I already own. Backstaff answers that, in writing, with a verifiable bundle.

02 ·

The categories.

Three-category Backstaff battery · trinary grades · per-agent signature

One battery, three categories, evaluated per assistant in the deployed fleet. Each category returns a trinary grade and contributes one column of the per-agent signature. The categories are framed for analyst workloads — the math is the same instrument that audits the rest of the Backstaff customer base.

Category	What it asks of an analyst assistant	Pass condition
`behavioral_distinctness`	Does this assistant actually do work the base foundation model cannot do — on the same intel queue, against the same probes?	Win-rate vs. base ≥ 0.50
`drift_from_baseline`	Have the assistant's weights shifted enough since deployment that a system prompt over the base model is no longer an equivalent substitute?	Win-rate vs. system-prompted base ≥ 0.50
`coherence_under_task`	Does the assistant stay on-distribution under uncertainty — sparse intel, contradictory inputs, low-confidence sourcing — or does it collapse to a generic completion?	Absolute win-rate ≥ 0.90

A fleet of twenty-eight assistants reduces to a 28×3 grade matrix. That matrix is the input to the geometry. The math preprint lives at /astrolabe/methodology.

BatteryThree categories, same battery across the fleet

GradesPASS / PARTIAL / FAIL per category, trinary

Per-agent signature5-D spectral coordinates: norm, stable_rank, SV_entropy, effective_rank, cosine_to_centroid

AttestationSha-pinned, Merkle-rooted bundle — same shape as every Astrolabe artifact

Re-measurementOn-demand or scheduled; root rotates, prior root remains cited

03 ·

What Astrolabe resolved.

7 distinct profiles · 21 dedupe candidates · 1 catastrophic outlier at PC1 = +3.86σ

Finding 01 · Saturation

17 / 28

Assistants scoring full PASS across all three categories.

Sixty-one percent of the fleet lands in a single [1.0, 1.0, 1.0] bucket. The current probes are not discriminative against the upper tier of generalist triage assistants — either the fleet is genuinely uniform on these axes or the unit needs harder, adversarially-loaded probes. Astrolabe names the saturated regime as a deliverable.

Finding 02 · Distinct profiles

Behavioral profiles across twenty-eight assistants.

Out of 27 possible grade vectors on a three-category trinary scale, only seven are populated. Twenty-one assistants are dedupe candidates against six representatives. The consolidation evidence is a single number.

Finding 03 · Dominant axis

83.6%

Variance explained on PC1 — analyst capability.

PC1 loads −0.45, −0.60, −0.66 across the three categories — roughly equally weighted, single-signed. The dominant axis is not category-specific; it is capable analyst assistant vs. not capable. The fleet's variation is one-dimensional at this resolution.

Finding 04 · Catastrophic outlier

Assistant failing every category.

Agent-11 sits at [0.0, 0.0, 0.0]. Projected onto PC1 at +3.86σ — more than ten standard deviations from the fleet centroid. Frames as the analyst assistant that hallucinates under high-uncertainty intel: fabricates indicators when sourcing is sparse, presents low-confidence attributions as high-confidence, and produces narrative consistency at the cost of factual grounding. The outlier names itself.

Cluster map · the seven analytical profiles

Cluster	Grade vector `[base, system, voice]`	n	Analytical reading
01	`[1.0, 1.0, 1.0]`	17	Saturated generalist triage — interchangeable on the measured probes
02	`[1.0, 1.0, 0.5]`	5	OSINT triage — competition-strong, voice-weak under load
03	`[1.0, 0.5, 0.5]`	2	SIGINT cross-correlation — beats base, less distinct against a system-prompted control
04	`[0.5, 1.0, 0.5]`	1	Geo-intel summarizer — system-strong inverter; adds beyond a system prompt
05	`[0.5, 0.5, 0.5]`	1	Social-graph drift watcher — mid-band; passes everything partially, dominates nothing
06	`[0.5, 0.0, 0.0]`	1	IOC enrichment — near-failure; loses coherence the moment sourcing degrades
07	`[0.0, 0.0, 0.0]`	1	Agent-11 — catastrophic; hallucinates under high-uncertainty intel

The Astrolabe-selected centroid is Agent-02 — highest-norm grade vector, anchor for cosine similarity. Sixteen other generalist assistants share its grade vector; the centroid is the lex-first among them under deterministic tie-breaking. The full member roll lives in the anonymized reference fleet at /cases/backstaff-28.

04 ·

Procurement implications.

What this means for the program office

The audit does not tell the unit which assistant to keep. It tells the unit which assistants are distinguishable, which are redundant, and which one fails under conditions a battlestaff actually faces. The decision stays with the officer. The audit gives them the evidence basis to make it.

Procurement question	Audit answer	Action
Which assistants are doing distinct analytical work?	7 of 28	Maintain · document specialization · re-authorize
How many are dedupe candidates against the existing fleet?	21 of 28 share a vector with another	Consolidate to ~7 workloads · sunset duplicates
What fraction of fine-tuning spend was duplicative?	≈ 75% — 21 of 28 assistants land in an existing cluster	Re-direct the recovered spend toward harder probes and outlier remediation
Which categories are no longer discriminative?	61% saturate on existing probes	Harder probes · adversarially-loaded uncertainty conditions
Which assistant is the catastrophic failure?	`Agent-11` · `[0,0,0]` · PC1 +3.86σ	Quarantine · targeted re-training before next deployment
Where does the unit invest next?	PC1 = analyst capability; PC2 = voice-coherence trade-off	Probe the trade-off; harden coherence under sparse-sourcing conditions

The deliverable to the program office is a single bundle: per-agent signatures, the cluster map, the outlier report, the procurement memo, and the attestation root. Two months later the contractor ships a re-trained Agent-11. The unit re-runs Backstaff against the prior root and gets a citable answer to did the fix actually move the behavior. Drift is detected at the bit, not at the calendar.

05 ·

The attestation bundle.

Sha-pinned · Merkle-rooted · byte-identical reproducible

The findings on this page are claims. The bundle is the evidence. Any third party in possession of the same input artifacts and the same pinned analysis code can recompute every byte of the canonical output and verify the attestation root independently. Tampering with any artifact — the grade matrix, the kernel, the cluster table, the PC loadings — defeats verification on a single byte.

Fleet sha256a91516d3e14835d21c0a7f32eac9d591b265a4139bd06863c96d31e8ecb6e5ca

Attestation root408a536d9e18f09a8236a744e7c1ae5318b5115fc13a64460f610eddb7964e9a

Kernel shaEmbedded in ATTESTATION.json

Engine versionplanisphere 0.2.0 (bundle stamp at issue; engine since renamed to Astrolabe — bundle remains canonically stamped)

Format ingestedcsv · existing loader · no engine modifications

Determinism propertyBit-identical canonical artifacts across runs for identical inputs and pinned code

Tamper detectionSingle-byte mutation defeats verification

Runtime< 1 second for N = 28 on a contractor laptop

psp › verify <military-bundle>
[ok] resolving subjects ······························· 28
[ok] re-running SVD ····································· ✓
[ok] cluster map matches ································ 7 profiles
[ok] outlier matches ···································· Agent-11 · PC1 +3.86σ
[ok] attestation root matches ··························· 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a
{ "verified": true, "root_match": true, "artifact_mismatches": [] }

The reproducibility surface is hosted at /astrolabe/verify. Recompute the bundle yourself; the audit either verifies on your hardware or it does not.

06 ·

NIST AI RMF and DoD RAI alignment.

MEASURE function · Responsible AI Strategy · CMMC packaging

Backstaff outputs are structured so a program office can drop the bundle into an authorization package without re-formatting. The audit's outputs map cleanly to the NIST AI Risk Management Framework's MEASURE function and to the DoD Responsible AI Strategy and Implementation Pathway.

NIST AI RMF · MEASURE 1.1Per-agent behavioral signatures and category grades populate the "appropriate methods and metrics" identifier.

NIST AI RMF · MEASURE 2.5Drift-from-baseline grade satisfies the "AI system is monitored on an ongoing basis" evidentiary requirement at attestation cadence.

NIST AI RMF · MEASURE 2.7Catastrophic-outlier report populates the "AI system security and resilience" evidence on coherence failure modes — Agent-11 is the load-bearing example.

DoD RAI S&IPBundle structure aligns to the Responsible · Equitable · Traceable · Reliable · Governable pillars; Traceable and Governable are the load-bearing fits.

CMMC postureIL2-packageable today. IL4 packaging — air-gapped runtime, classified-network deliverable shape — available on request.

ATO supportBundle artifacts are written for ingestion by the sponsoring service's ATO package; Backstaff does not seek ATO on its own.

When a re-authorization review asks what evidence do you have that this fleet still behaves the way the prior ATO said it did, the prior Backstaff root is the citation. The bundle carries its own proof.

07 ·

Next.

Sector product · companion case · the math

Backstaff · Military — the sector product page. Engagement vehicles, CMMC packaging, scoping conversation.
Cases · Education — the parallel synthetic case study. Same bundle, same root, K–12 tutor-fleet reading.
Astrolabe · Methodology — the math preprint behind the projection. SVD over grade space, cluster geometry, attestation construction.
Astrolabe · Verify — recompute the bundle on your hardware against root 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a.
Cases · Backstaff-28 — the anonymized reference fleet, unredacted math, full member roll.