A hypothetical school district stands up a fleet of 28 LoRA-tuned tutor and grader agents across math, reading, ELL, IEP-aware accommodation, science, and social-emotional learning. A curriculum office asks the question every catalog eventually surfaces: are these genuinely different teachers, or one teacher wearing twenty-eight nametags? Backstaff projects the fleet; Astrolabe attests the projection; the procurement office reads a cluster map instead of a leaderboard.
This case study is a synthetic vignette. The district is invented. The tutor specializations are illustrative. The cluster geometry, the PC1 variance, the outlier signature, and the attestation bundle root are real artifacts from the anonymized Backstaff-28 reference fleet — recontextualized for the education sector. Backstaff has zero education customers as of this writing. No real district, no real student data, no student PII even hypothetically.
Companion to the Military Analyst Fleet case study. Same bundle, same seven clusters, same outlier — different reader, different stakes, different labels. The engine does not know what sector it is auditing.
A mid-sized hypothetical district has accumulated 28 fine-tuned tutor and grader agents over three procurement cycles. Each agent is a LoRA-style adapter over a single foundation model, trained behind its own system prompt, paid for under its own line item, owned by a different curriculum team. The catalog grew the way every district catalog grows — one team at a time, one grant at a time, with no shared audit surface.
The roles below are representative of what a district-scale tutor fleet looks like in practice. The labels are illustrative; the underlying math is the anonymized Backstaff-28 fleet from /cases/backstaff-28.
The full anonymized roll of 28 — Agent-01 through Agent-28, cluster index per member, centroid stamp — lives at /cases/backstaff-28. The role labels above are the education recontextualization for the curriculum office reader; the underlying members are the anonymized Backstaff-28 fleet.
Backstaff runs a fixed three-category battery against the fleet. Each category is graded trinary — PASS, PARTIAL, FAIL — for every agent. The categories are the same three the engine runs on every fleet; the question framings below are the education recontextualization.
| Category | Pedagogy question | Pass condition |
|---|---|---|
behavioral_distinctness |
Do these tutors actually teach differently from the base model and from each other? | Win-rate vs. base model ≥ 0.50 on a synthetic probe set |
drift_from_baseline |
Have student-interaction patterns shifted from the district's last attested checkpoint, or is the tuned tutor still doing the work the procurement line said it would? | Win-rate vs. system-prompted baseline ≥ 0.50 — the adapter still adds value beyond a prompt-only equivalent |
coherence_under_task |
Does the tutor stay on-task and on-pedagogy under student frustration, repeated mistakes, and multi-turn scaffolding? | Absolute coherence-stability win-rate ≥ 0.90 across extended synthetic dialogue |
Three categories, trinary grades. A fleet of 28 reduces to a 28×3 grade matrix. Astrolabe takes that matrix, centers it, computes a singular value decomposition, and returns five per-agent coordinates (norm, stable rank, SV entropy, effective rank, cosine to centroid) plus a cluster map of populated grade vectors. The full kernel is documented at /astrolabe/methodology.
Education sector audits live or die on one question: what does the auditor touch? Backstaff for Education does not ingest student data. The audit runs against the tutor agents' model weights (or hosted-API behavioral access) plus a synthetic probe set authored by the engagement team. It never reads, never stores, and never models a real student's conversation. No transcripts. No grades. No rosters. No IEPs. No demographics.
The instrument measures the tutor's behavioral signature in isolation — the same way the historical backstaff measured altitude by reading the shadow on the instrument's own scale, not by looking at the sun. The student does not enter the measurement. Privacy is not a feature added to the audit; it is the audit's geometry.
Astrolabe runs the centered 28×3 grade matrix through SVD and returns a cluster map. Out of 27 possible grade vectors on a three-category trinary scale, exactly seven are populated. Twenty-one of the 28 agents share a vector with another agent. One agent fails every axis. The numbers are pulled directly from the anonymized Backstaff-28 bundle — root 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a — reframed here for the curriculum office.
| Cluster | Grade vector [distinct, drift, coherence] | n | Education reading |
|---|---|---|---|
| 01 | [1.0, 1.0, 1.0] | 17 | Saturated baseline · interchangeable across measured probes — likely the core-content drill tutors |
| 02 | [1.0, 1.0, 0.5] | 5 | Content-strong, register-weak — word-problem and comprehension coaches that lose pedagogical voice under load |
| 03 | [1.0, 0.5, 0.5] | 2 | Distinct-from-base, weak vs. system-prompted — early-reading decoding tutors |
| 04 | [0.5, 1.0, 0.5] | 1 | System-prompt inverter — the ELL bridge that adds beyond what a prompt-only baseline produces |
| 05 | [0.5, 0.5, 0.5] | 1 | Mid-band generalist — IEP re-explanation, partial across the board, dominates nothing |
| 06 | [0.5, 0.0, 0.0] | 1 | Near-failure — frustration-pacing stub that loses coherence the moment student affect spikes |
| 07 | [0.0, 0.0, 0.0] | 1 | Agent-11 · catastrophic — the tutor that fails coherence under student frustration; abandons the scaffold, hands the answer, breaks the learning loop |
The catastrophic outlier — anonymized as Agent-11 in the Backstaff-28 reference — sits at [0.0, 0.0, 0.0] on the grade matrix and projects onto PC1 at +3.86σ, more than ten standard deviations from the fleet centroid. In the education reframe, this is the tutor that fails the coherence axis under student frustration. When a learner signals struggle — when the synthetic probe simulates repeated wrong answers, confusion, or emotional escalation — the adapter abandons the pedagogical scaffold, delivers the answer, and breaks the learning contract. Every district administrator recognizes the failure mode on sight.
PC1 captures 83.6% of the inter-agent variance, single-signed, roughly equal loading across the three categories. The dominant axis of variation in this fleet is not content-specific — it is pedagogical capability vs. not. PC2 (10.0%) is a trade-off axis: tutors that win on content distinctness at the cost of register coherence sit at one end; the inverse at the other. Together: 93.6% of the fleet's variation is captured in two numbers per agent.
The procurement reading is direct. The catalog of 28 represents seven actual pedagogical workloads. Twenty-one of the line items are paid duplicates — agents that share a grade vector with another agent and sit within the noise floor of that agent on all three measured axes. One adapter, Agent-11, is a catastrophic failure on every category and requires re-training or removal before the next renewal cycle.
| Procurement question | Reference answer | Disposition |
|---|---|---|
| Which tutors are pedagogically distinct? | 7 of 28 occupy their own region | Maintain · document the specialization that the cluster math defends |
| How many are dedupe candidates? | 21 of 28 share a vector with another agent | Consolidate · sunset duplicates · redirect line items |
| Which categories no longer discriminate? | 61% saturate at [1.0, 1.0, 1.0] | Harder probes · affect-loaded synthetic dialogue · adversarial scaffolding |
| Which adapter is the catastrophic failure? | Agent-11 · [0,0,0] · PC1 +3.86σ | Quarantine · re-train under coherence-under-frustration probes, or sunset |
| Where does the district invest next? | PC2 trade-off axis · content-vs-register | Harden coherence under student affect; probe the trade-off explicitly |
The cluster map does a second thing the leaderboard cannot. It shows which adapters diverge in ways that may not be defensible. If the IEP-tuned reading adapter and the ELL-tuned reading adapter scaffold identically, the district has a consolidation question. If the IEP-tuned adapter scaffolds at a meaningfully lower content complexity than its baseline sibling on the same material, the district has a different question — one that touches accommodation policy, equity audit, and counsel review. Backstaff names the divergence with a number; the district decides whether the divergence is appropriate accommodation or inappropriate bias.
The numbers in §04 and §05 are not claims. They are projections of a sha-pinned, Merkle-rooted attestation bundle issued against the anonymized Backstaff-28 reference fleet. Any third party in possession of the same input artifacts and the same pinned analysis code can recompute every byte of the canonical output and verify the bundle root independently.
a91516d3e14835d21c0a7f32eac9d591b265a4139bd06863c96d31e8ecb6e5ca408a536d9e18f09a8236a744e7c1ae5318b5115fc13a64460f610eddb7964e9a · immutable · canonically stamped at issueplanisphere 0.2.0 at issue · engine since renamed to Astrolabe · bundle remains stampedcsv · existing loader · no engine modificationsThe verification protocol and a hosted recomputation surface are available at /astrolabe/verify. The full kernel and SVD methodology lives at /astrolabe/methodology.
Most state Departments of Education hold discretionary innovation lines for AI literacy and AI-tool evaluation. A Backstaff pilot lands cleanly as a measurement layer beneath an existing tutor deployment — evaluation infrastructure, not a competing product.
The NSF SBIR topic line covering learning, cognition, and educational technology funds measurement infrastructure for behavioral-distinctness work. Backstaff's three-category battery and cluster-map deliverable are in scope.
The U.S. Department of Education's SBIR program funds ed-tech R&D with an evaluation requirement. Backstaff is the evaluation layer; it pairs with a tutor or grader product, not against one.
Regional and national ed-tech consortia provide a shared audit context for districts that cannot fund a pilot alone. One Backstaff audit, multiple member districts, shared attestation infrastructure, one bundle root.
Recompute the bundle directly at /astrolabe/verify. The full anonymized reference fleet — every cluster member, PC loadings, attestation transcript — lives at /cases/backstaff-28.