Home Backstaff-28 · Public · Reproducible
Planisphere
Reference case study · v1.0.0 2026-05-17
Backstaff-28 · Anonymized reference fleet · Astrolabe applied to a real workforce

Twenty-eight subjects, one axis.

A federal program office asks: across a portfolio of dozens of fine-tunes of a single foundation model, which variants are doing distinct work, where is the fleet saturating, and which one fails everything? This case study answers that question end-to-end against a real 28-agent fleet ingested through Astrolabe's csv loader.

Companion to the Bibles case study. Same engine, same attestation shape, different fleet — 28 agents instead of 6, csv ingest instead of native, three customer-defined categories instead of five. Astrolabe is fleet- and format-agnostic by design; these two case studies are the proof.

The fleet measured here is the v0.2 reference batch from Backstaff — the first vertical product on the Astrolabe engine. Every Backstaff delivery includes a bundle in the same shape as this one. Agent identifiers (Agent-01 through Agent-28) are anonymized labels for public rendering; the attestation bundle was issued against the original identifiers and remains canonically stamped that way — immutable by design.

Fleet of 28 — PC1 distribution with outlier at +3.86σ
01 ·

The fleet.

28 subjects · 3 customer-defined categories · csv ingest

Twenty-eight LoRA-style adapters in an anonymized fleet. Each adapter is one member of a fine-tuned AI workforce. The fleet is a stand-in for any program office's portfolio of single-base-model fine-tunes; the Agent-NN identifiers below are anonymized for public rendering.

CategoryDefinition
cat_beats_basePASS if win-rate vs. base ≥ 0.50; PARTIAL ≥ 0.20; else FAIL
cat_beats_systemPASS if win-rate vs. system-prompted base ≥ 0.50; PARTIAL ≥ 0.20; else FAIL
cat_voice_coherencePASS if absolute win-rate ≥ 0.90; PARTIAL ≥ 0.70; else FAIL

Three categories, trinary grades, ingested through Astrolabe's existing csv loader. No new format. No engine modifications. The categories are customer-defined and arbitrary — Astrolabe runs SVD over whatever dimensionality the input provides.

The 28.

Each Agent-NN identifier represents one fine-tuned agent in the v0.2 reference fleet. The number under each identifier is the cluster index from §03: 01 is the saturated bucket (full PASS across all three categories); 07 is the catastrophic failure mode. Outliers shown in signal red.

Agent-01cluster 02
Agent-02cluster 01 · centroid
Agent-03cluster 03
Agent-04cluster 03
Agent-05cluster 04
Agent-06cluster 02
Agent-07cluster 01
Agent-08cluster 01
Agent-09cluster 01
Agent-10cluster 01
Agent-11cluster 07 · catastrophic
Agent-12cluster 01
Agent-13cluster 06 · near-failure
Agent-14cluster 01
Agent-15cluster 02
Agent-16cluster 01
Agent-17cluster 05
Agent-18cluster 01
Agent-19cluster 01
Agent-20cluster 01
Agent-21cluster 01
Agent-22cluster 01
Agent-23cluster 01
Agent-24cluster 01
Agent-25cluster 02
Agent-26cluster 01
Agent-27cluster 02
Agent-28cluster 01

Saturated cluster members (17 of 28) are rendered at low contrast — they live in a single behavioral bucket and any one of them substitutes for the others on the measured categories. Distinct profiles (clusters 02–05) and outliers (06–07) carry the fleet's actual variation.

02 ·

What Astrolabe resolved.

End-to-end runtime · Under one second on a laptop
Finding 01 · Saturation
17 / 28

Subjects scoring full PASS across all three categories.

Sixty-one percent of the fleet lands in a single [1.0, 1.0, 1.0] bucket. The probes are not currently discriminative against the upper tier. Either the fleet is genuinely uniform on these axes or the evaluation needs harder probes. Astrolabe names the saturated regime as a deliverable.

Finding 02 · Distinct profiles
7

Behavioral profiles across 28 subjects.

Out of 27 possible grade vectors on a three-category trinary scale, only seven are populated. Twenty-one subjects are dedupe candidates against six representatives. Consolidation evidence in one number.

Finding 03 · Dominant axis
83.6%

Variance explained on PC1 — overall capability.

PC1 loads −0.45, −0.60, −0.66 across the three categories — roughly equally weighted, single-signed. The dominant axis is not category-specific; it is capable vs. not capable. The fleet's variation is one-dimensional at this resolution.

Finding 04 · Catastrophic outlier
1

Subject failing every category.

Agent-11 sits at [0.0, 0.0, 0.0]. Projected onto PC1 at +3.86 — more than ten standard deviations from the fleet centroid. Fleet-level fail mode; targeted re-training candidate. The outlier names itself.

Variance attribution across principal components

PC 1
83.6%
PC 2
10.0%
PC 3
6.4%

PC2 (10.0%) loads +0.52, +0.42, −0.74 — a voice-coherence trade-off axis. Subjects strong on competition wins but weak on absolute voice coherence sit at the positive PC2 end; the inverse at the negative end. The third component is residual.

Rank-1 variation in a three-category space, plus a clean secondary trade-off axis. Together: 93.6% of inter-subject variation captured in two numbers per subject.
03 ·

Cluster map.

7 distinct profiles · sorted by population · all members named
[1.0, 1.0, 1.0]
17
[1.0, 1.0, 0.5]
5
[1.0, 0.5, 0.5]
2
[0.5, 1.0, 0.5]
1
[0.5, 0.5, 0.5]
1
[0.5, 0.0, 0.0]
1
[0.0, 0.0, 0.0]
1
ClusterGrade vector [base, system, voice]nMembers
01[1.0, 1.0, 1.0]17Agent-02 (centroid), Agent-07, Agent-08, Agent-09, Agent-10, Agent-12, Agent-14, Agent-16, Agent-18, Agent-19, Agent-20, Agent-21, Agent-22, Agent-23, Agent-24, Agent-26, Agent-28
02[1.0, 1.0, 0.5]5Agent-01, Agent-06, Agent-15, Agent-25, Agent-27
03[1.0, 0.5, 0.5]2Agent-03, Agent-04
04[0.5, 1.0, 0.5]1Agent-05
05[0.5, 0.5, 0.5]1Agent-17
06[0.5, 0.0, 0.0]1Agent-13 · near-failure
07[0.0, 0.0, 0.0]1Agent-11 · catastrophic

The Astrolabe-selected centroid is Agent-02 — highest-norm grade vector, anchor for cosine similarity. Sixteen other agents share the same grade vector; the centroid is the lex-first among them under deterministic tie-breaking.

04 ·

The attestation.

Independently recomputable · Tamper-evident
Fleet sha256a91516d3e14835d21c0a7f32eac9d591b265a4139bd06863c96d31e8ecb6e5ca
Attestation root408a536d9e18f09a8236a744e7c1ae5318b5115fc13a64460f610eddb7964e9a
Kernel shaEmbedded in ATTESTATION.json
Engine versionplanisphere 0.2.0 (bundle stamp at issue; engine since renamed to Astrolabe — bundle remains canonically stamped)
Format ingestedcsv · existing loader · no engine modifications
Determinism propertyBit-identical canonical artifacts across runs for identical inputs and pinned code
Tamper detectionSingle-byte mutation defeats verification
Runtime< 1 second for N = 28 on Contractor laptop
psp › measure <fleet>
[ok] resolving subjects ······························· 28
[ok] discovering categories ··························· 3
[ok] projecting onto plane ····························· ✓
[ok] variance explained on PC1 ························· 0.836
[ok] distinct profiles ································· 7
[ok] attestation root ·································· 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a
psp › verify <bundle>
{ "verified": true, "root_match": true, "artifact_mismatches": [] }

Any party in possession of the same inputs and the same pinned analysis code can recompute every byte of the canonical artifacts and verify the attestation root independently. Tampering with any artifact defeats verification.

05 ·

For a federal reader.

Substituting your portfolio for ours

The Bibles case study demonstrated Astrolabe on a small fleet with a rich five-category evaluation. This case study demonstrates the same engine at scale on a different fleet shape — twenty-eight agents, three customer-defined categories, csv input. Together the two case studies answer five governance questions a procurement office actually asks, with attestable evidence:

  • Consolidation: how many distinct behavioral profiles live in a portfolio of N fine-tunes (Backstaff-28: 7 of 28 — 21 dedupe candidates)
  • Saturation: which evaluation categories are no longer discriminative against the upper tier of the fleet (Backstaff-28: 61% of agents converge to the saturated bucket)
  • Investment direction: which dimension explains most of the inter-agent variation (Backstaff-28: a single capability axis at 83.6%; Bibles: null-handling at 70.1%)
  • Targeted remediation: which agent is the catastrophic outlier and what is the fail signature (Backstaff-28: Agent-11 at [0,0,0]; Bibles: meroitic on schema transfer)
  • Audit-ready evidence: all of the above as a sha-pinned, Merkle-rooted, NIST AI RMF-mapped bundle, admissible under IG review
Engine-, format-, and domain-agnostic. The csv ingest of one fleet runs the same kernel as the native-format ingest of another. Astrolabe doesn't know what's a subject.
Read the companion case study Capability declaration