Planisphere
Backstaff · Case Study Methodology + vignettes
Backstaff · Cross-sector

Two sectors. One methodology.

A 28-agent fleet. Seven distinct behavioral profiles. One catastrophic outlier at PC1 = +3.86σ. The same numbers, reframed for military analyst auditing and education tutor auditing. The substrate is sector-agnostic; the proof is in both vignettes deriving from a single sha-pinned bundle — root 408a536d…b964e9a.

Cross-sector diptych — one methodology, two sectors
01 ·

The methodology.

Three-category battery · SVD over grade space · cluster map

Backstaff is the shadow-staff audit vertical on the Astrolabe engine. The engine reads a fleet as a matrix — subjects on one axis, evaluation categories on the other, trinary grades in the cells — and projects that matrix into a geometry the procurement reader can act on. The audit shape is fixed; the categories and the fleet are not.

The three-category battery

Every Backstaff audit reduces a customer-defined evaluation suite onto a trinary scale: PASS, PARTIAL, FAIL. The categories live on three orthogonal questions a procurement officer always asks — the question that distinguishes the agent from baseline, the question that distinguishes it from a system-prompted baseline, and the question that survives task coherence.

AxisWhat it measuresPass condition
behavioral_distinctnessDoes the agent do something the base model does not?Win-rate vs. base ≥ 0.50
drift_from_baselineDoes it survive system-prompted equivalent?Win-rate vs. system-prompted base ≥ 0.50
coherence_under_taskDoes it hold voice and structure under load?Absolute win-rate ≥ 0.90

Each axis grades each subject independently. A fleet of N agents reduces to an N×3 grade matrix. That matrix is the input to the geometry.

Spectral geometry via SVD over grade space

Astrolabe takes the centered grade matrix and computes a singular value decomposition. From the decomposition it derives five per-agent coordinates: norm (overall capability magnitude), stable_rank (effective dimensionality the agent occupies), SV_entropy (information dispersion across the spectrum), effective_rank (count of axes the agent actually uses), and cosine_to_centroid (angular distance from the fleet's anchor agent).

These five numbers are not a score. They are coordinates. Two agents with the same grade vector occupy the same point; two agents with different grade vectors occupy different points; the distance between them is meaningful. That is the difference between a leaderboard and a map.

The audit is not a ranking. It is a projection. A leaderboard collapses a fleet onto one axis and loses every other dimension; the projection keeps the dimensions and names them.

The cluster map names the profiles

Subjects with identical grade vectors collapse to a single cluster. Subjects with near-identical vectors land adjacent under the cosine metric. The result is a finite, populated subset of the 27 possible grade vectors on a three-category trinary scale — for the reference fleet, that subset is exactly seven. The math preprint behind the projection lives at /astrolabe/methodology.

The audit deliverable is not a number. It is a cluster map plus an attestation bundle: a sha-pinned record of the input artifacts, the kernel that produced the projection, and the Merkle root that lets any third party recompute every byte. Tamper detection is single-byte. Runtime for N = 28 is sub-second on a contractor laptop.

02 ·

The reference fleet.

28 agents · 7 profiles · 21 dedupe candidates · 1 catastrophic outlier

The reference fleet for this case study is the v0.2 anonymized batch documented in full at /cases/backstaff-28. Twenty-eight LoRA-style adapters over a single foundation model, three customer-defined categories, csv ingest through the existing Astrolabe loader. The numbers below are pulled directly from that bundle.

Fleet sizeN = 28 subjects
Categories3 customer-defined, trinary-graded
Distinct profiles7 of 27 possible grade vectors populated
Dedupe candidates21 of 28 agents share a vector with another agent
Saturated cluster17 agents at [1.0, 1.0, 1.0] — 61% of the fleet
Variance on PC183.6% — capability axis, single-signed loading
Variance on PC210.0% — voice-coherence trade-off axis
Catastrophic outlierAgent-11 at [0.0, 0.0, 0.0] · PC1 = +3.86σ
Bundle root408a536d9e18f09a8236a744e7c1ae5318b5115fc13a64460f610eddb7964e9a
Runtime< 1 second on a contractor laptop

The unanonymized math — cluster table, member roll, PC loadings, attestation transcript — lives in the reference write-up. The two vignettes below recontextualize the same 28 agents into two operating sectors. The bundle does not change. The categories do not change. Only the labels do.

The reference fleet is the load-bearing artifact. Both vignettes derive from it. If the bundle root drifts, both vignettes drift; if the bundle root verifies, both vignettes verify. One bundle, two stories.
03 ·

Vignette A · Military analyst fleet.

Hypothetical AF cyber unit · 28 LLM-augmented analyst assistants
Synthetic relabel · same bundle

The analyst fleet.

Hypothetical · no real unit

This vignette uses the same anonymized 28-agent reference fleet from /cases/backstaff-28, recontextualized for a hypothetical Air Force cyber unit standing up an LLM-augmented analyst workforce. Backstaff has zero military customers. This is a methodology illustration.

The unit operates a portfolio of 28 fine-tuned analyst assistants, each adapted from a single foundation model for a specialized intelligence task: SIGINT triage, OSINT correlation, malware artifact summarization, indicator-of-compromise enrichment, threat-actor attribution, network-flow narrative, after-action drafting. The portfolio grew organically — one analyst team trained one assistant, another team trained another, two years passed, and procurement is now asking the question every procurement office eventually asks: how many of these are actually doing different work?

The seven analytical specializations

When Backstaff projects this fleet onto the three-category battery, seven distinct profiles emerge. Each profile is one analytical posture — one way of trading off raw capability, system-prompt distinctness, and coherence under task load. The reference labels (cluster 01–07) recontextualize for the cyber unit as follows:

  • Cluster 01Saturated baseline — generalist triage. Strong on all three axes; interchangeable on the measured probes.17 of 28
  • Cluster 02Competition-strong, voice-weak — indicator enrichment. Wins distinct work but loses some structural coherence under load.5 of 28
  • Cluster 03Base-only specialist — SIGINT triage. Beats raw base, less distinct against a system-prompted control.2 of 28
  • Cluster 04System-strong inverter — OSINT correlation. Where a system prompt would normally suffice, this adapter still adds.1 of 28
  • Cluster 05Mid-band generalist — after-action drafting. Passes everything partially, dominates nothing.1 of 28
  • Cluster 06Near-failure — attribution stub. Fails distinctness and coherence; minimal value over base.1 of 28
  • Cluster 07Catastrophic — Agent-11, the assistant that hallucinates under high-uncertainty intel. Fails every axis.1 of 28

The Agent-11 finding

The catastrophic outlier sits at [0.0, 0.0, 0.0] — failing behavioral distinctness, failing drift-from-baseline, failing coherence. Projected onto PC1 at +3.86σ, more than ten standard deviations from the fleet centroid. In the military reframe, Agent-11 is the analyst assistant that fabricates indicators when intelligence is sparse, presents low-confidence attributions as high-confidence, and produces narrative consistency at the cost of factual grounding — the exact failure mode a battlestaff officer cannot tolerate downstream of an intelligence cycle.

Procurement questionReference answerAction
Which assistants are doing distinct work?7 of 28Maintain · document specialization
How many are dedupe candidates?21 of 28 share a vectorConsolidate · sunset duplicates
Which categories are no longer discriminative?61% saturate on existing probesHarder probes · adversarial conditions
Which assistant is the catastrophic failure?Agent-11 · [0,0,0] · PC1 +3.86σQuarantine · targeted re-training
Where does the unit invest next?PC1 = capability; PC2 = coherence trade-offProbe the trade-off; harden coherence under load
The audit does not tell the unit which adapter to keep. It tells the unit which adapters are distinguishable, which are redundant, and which one fails under conditions the unit actually faces. The decision stays with the officer.
04 ·

Vignette B · Education tutor fleet.

Hypothetical district deployment · 28 LoRA-tuned tutor agents
Synthetic relabel · same bundle

The tutor fleet.

Hypothetical · no real district

This vignette uses the same anonymized 28-agent reference fleet from /cases/backstaff-28, recontextualized for a hypothetical mid-size district deploying LoRA-tuned tutor agents across the K–12 portfolio. Backstaff has zero education customers. This is a methodology illustration.

The district operates 28 tutor agents across math, reading, English-language learner support, and IEP accommodation. Each tutor was fine-tuned for a specialized pedagogical posture: elementary math drill, middle-school word-problem coaching, decoding support for early readers, comprehension scaffolding for upper grades, ELL bridge work, IEP-aware re-explanation, frustration-aware pacing. Two procurement cycles passed; the catalog grew; the same question arrived. How many of these tutors are actually different teachers?

The seven pedagogical specializations

The identical seven-cluster projection emerges. Each profile is one pedagogical posture — one way of trading off content distinctness, prompt-only equivalence, and coherence under student affect. The reference labels recontextualize for the district as follows:

  • Cluster 01Saturated baseline — core math drill. Strong on probed axes; interchangeable across content domains.17 of 28
  • Cluster 02Content-strong, voice-weak — middle-school word problems. Beats baselines on content but loses pedagogical register under load.5 of 28
  • Cluster 03Distinct-from-base — early reading decoding. Adds value over raw base; less distinct against a system-prompted control.2 of 28
  • Cluster 04System-strong inverter — ELL bridge. Adds beyond what a system prompt alone produces.1 of 28
  • Cluster 05Mid-band generalist — IEP re-explanation. Partial across the board; dominates nothing.1 of 28
  • Cluster 06Near-failure — frustration-pacing stub. Loses coherence the moment student affect spikes.1 of 28
  • Cluster 07Catastrophic — Agent-11, the tutor that fails the coherence axis when a student expresses frustration; gives up scaffolding and hands the answer.1 of 28

The Agent-11 finding

The same [0.0, 0.0, 0.0] outlier, the same PC1 = +3.86σ. In the education reframe, Agent-11 is the tutor that fails the coherence axis under student frustration — the tutor that, when a learner signals struggle, abandons the pedagogical scaffold, delivers the answer, and breaks the learning loop. A district administrator reading this profile recognizes it instantly; the failure mode shows up in every audit of every tutor catalog at scale.

Procurement questionReference answerAction
Which tutors are pedagogically distinct?7 of 28Maintain · document specialization
How many are dedupe candidates?21 of 28 share a vectorConsolidate · sunset duplicates
Which content categories have lost discrimination?61% saturate on existing probesHarder rubrics · affect-loaded probes
Which tutor is the catastrophic failure?Agent-11 · [0,0,0] · PC1 +3.86σQuarantine · pedagogical re-tuning
Where does the district invest next?PC1 = pedagogical capability; PC2 = register-vs-content trade-offHarden coherence under student affect
The audit does not tell the district which tutor to keep. It names redundancy, names saturation, and names the one tutor that breaks the learning contract under conditions a real classroom produces every day. The decision stays with the administrator.
05 ·

Same numbers, two contexts.

One bundle · root 408a536d…b964e9a

Both vignettes derive from the identical attestation bundle. The fleet matrix is the same matrix. The SVD is the same SVD. The seven clusters are the same seven clusters. Agent-11 is the same agent. The only thing that changes between the two vignettes is the label attached to each cluster — the human meaning a procurement officer reads into a profile when they recognize themselves in it.

Axis
Military reading
Education reading
Cluster 01 saturation (n=17)
Generalist triage assistants — interchangeable on probed tasks
Core math drill tutors — interchangeable on probed content
Cluster 02 trade-off (n=5)
Indicator-enrichment posture · content over voice
Word-problem coaching · content over pedagogical register
Cluster 04 system-inverter (n=1)
OSINT correlation · beats system-prompted control
ELL bridge · beats system-prompted control
Cluster 07 catastrophic (n=1)
Hallucinates under high-uncertainty intel
Fails coherence under student frustration
PC1 variance (83.6%)
Analyst capability axis
Pedagogical capability axis
PC2 variance (10.0%)
Content-vs-voice trade-off
Content-vs-register trade-off
Bundle root
408a536d…b964e9a
408a536d…b964e9a

The engine does not know what sector it is auditing. The geometry does not know what sector it is projecting. The catastrophic outlier does not know it is being read as a hallucinating analyst or a frustration-failing tutor. Astrolabe does not know what is a subject. The audit shape is universal; the relabel is reader-dependent.

Cross-sector on day one. The same attestation bundle, recomputable by any third party, supports both a military reading and an education reading. The methodology category claims itself.
06 ·

Verify it yourself.

Independent recomputation against the reference bundle

Both vignettes above are claims. The bundle is the evidence. Any third party in possession of the same input artifacts and the same pinned analysis code can recompute every byte of the canonical output and verify the attestation root independently.

psp › verify <backstaff-28-bundle>
[ok] resolving subjects ······························· 28
[ok] re-running SVD ····································· ✓
[ok] cluster map matches ································ 7 profiles
[ok] outlier matches ···································· Agent-11 · PC1 +3.86σ
[ok] attestation root matches ··························· 408a536d···b964e9a
{ "verified": true, "root_match": true, "artifact_mismatches": [] }

The verifier runs in under a second. Tampering with any artifact — the grade matrix, the kernel, the cluster table, the PC loadings — defeats verification on a single byte. The full verification protocol and a hosted recomputation surface are available at /astrolabe/verify.

Recompute the bundle ›