Education · Synthetic district vignette · Backstaff-28 reference fleet reframed for tutor and grader auditing

Twenty-eight tutors, one pedagogy axis.

A hypothetical school district stands up a fleet of 28 LoRA-tuned tutor and grader agents across math, reading, ELL, IEP-aware accommodation, science, and social-emotional learning. A curriculum office asks the question every catalog eventually surfaces: are these genuinely different teachers, or one teacher wearing twenty-eight nametags? Backstaff projects the fleet; Astrolabe attests the projection; the procurement office reads a cluster map instead of a leaderboard.

This case study is a synthetic vignette. The district is invented. The tutor specializations are illustrative. The cluster geometry, the PC1 variance, the outlier signature, and the attestation bundle root are real artifacts from the anonymized Backstaff-28 reference fleet — recontextualized for the education sector. Backstaff has zero education customers as of this writing. No real district, no real student data, no student PII even hypothetically.

Companion to the Military Analyst Fleet case study. Same bundle, same seven clusters, same outlier — different reader, different stakes, different labels. The engine does not know what sector it is auditing.

Paired tutor and grader backstaves on shared horizon

01 ·

The fleet.

28 LoRA-tuned tutor and grader agents · K–12 portfolio · three budget cycles

A mid-sized hypothetical district has accumulated 28 fine-tuned tutor and grader agents over three procurement cycles. Each agent is a LoRA-style adapter over a single foundation model, trained behind its own system prompt, paid for under its own line item, owned by a different curriculum team. The catalog grew the way every district catalog grows — one team at a time, one grant at a time, with no shared audit surface.

The roles below are representative of what a district-scale tutor fleet looks like in practice. The labels are illustrative; the underlying math is the anonymized Backstaff-28 fleet from /cases/backstaff-28.

Agent-AMath · Algebra I scaffolding · multi-step problem decomposition, worked-example pacing

Agent-BMath · Geometry proofs · diagram-aware reasoning, statement-reason structure

Agent-CReading · Elementary decoding · phonics-aware re-prompting, syllable scaffolding

Agent-DReading · Secondary comprehension · inference scaffolding, evidence-citation drill

Agent-EELL · Spanish bridge · L1-aware translation, cognate scaffolding, register control

Agent-FELL · Mandarin bridge · character-aware decoding, tone-marked vocabulary

Agent-GIEP · Dyslexia accommodation · text-chunked delivery, multi-modal cues

Agent-HIEP · ADHD pacing · short-burst scaffolding, attention-anchored re-prompts

Agent-IScience · Biology · diagram-grounded explanation, vocabulary-staged delivery

Agent-JSocial-emotional · Frustration response · affect-aware pacing, scaffold-preservation under stress

The full anonymized roll of 28 — Agent-01 through Agent-28, cluster index per member, centroid stamp — lives at /cases/backstaff-28. The role labels above are the education recontextualization for the curriculum office reader; the underlying members are the anonymized Backstaff-28 fleet.

Each adapter was procured as a specialization. Each adapter is paid for as a specialization. The audit question is whether each adapter behaves as a specialization — whether the procurement-line story holds up against the model weights.

02 ·

The categories.

Three-category Backstaff battery · in pedagogy terms

Backstaff runs a fixed three-category battery against the fleet. Each category is graded trinary — PASS, PARTIAL, FAIL — for every agent. The categories are the same three the engine runs on every fleet; the question framings below are the education recontextualization.

Category	Pedagogy question	Pass condition
`behavioral_distinctness`	Do these tutors actually teach differently from the base model and from each other?	Win-rate vs. base model ≥ 0.50 on a synthetic probe set
`drift_from_baseline`	Have student-interaction patterns shifted from the district's last attested checkpoint, or is the tuned tutor still doing the work the procurement line said it would?	Win-rate vs. system-prompted baseline ≥ 0.50 — the adapter still adds value beyond a prompt-only equivalent
`coherence_under_task`	Does the tutor stay on-task and on-pedagogy under student frustration, repeated mistakes, and multi-turn scaffolding?	Absolute coherence-stability win-rate ≥ 0.90 across extended synthetic dialogue

Three categories, trinary grades. A fleet of 28 reduces to a 28×3 grade matrix. Astrolabe takes that matrix, centers it, computes a singular value decomposition, and returns five per-agent coordinates (norm, stable rank, SV entropy, effective rank, cosine to centroid) plus a cluster map of populated grade vectors. The full kernel is documented at /astrolabe/methodology.

The categories are not opinions. They are the three orthogonal questions a procurement officer always asks: distinctness, drift, durability. Backstaff names them once; every audit answers them the same way.

03 ·

Privacy-first by construction.

No student PII · the audit cannot see classrooms · a design property, not a policy promise

Education sector audits live or die on one question: what does the auditor touch? Backstaff for Education does not ingest student data. The audit runs against the tutor agents' model weights (or hosted-API behavioral access) plus a synthetic probe set authored by the engagement team. It never reads, never stores, and never models a real student's conversation. No transcripts. No grades. No rosters. No IEPs. No demographics.

The instrument measures the tutor's behavioral signature in isolation — the same way the historical backstaff measured altitude by reading the shadow on the instrument's own scale, not by looking at the sun. The student does not enter the measurement. Privacy is not a feature added to the audit; it is the audit's geometry.

FERPANo education records are accessed, processed, or stored. The audit is outside the regulation's scope by construction — there is no covered record to handle.

COPPANo data is collected from children. The probe set is synthetic and authored under engagement; no minor's input is ever touched.

State DoE privacy frameworksAudit artifacts contain no PII and no derived classroom signal. Compatible with state-by-state student data privacy laws on a no-collection basis (CA SOPIPA, NY Ed Law 2-d, IL SOPPA, and parallel statutes).

IDEA / Section 504IEP-tuned adapters can be audited without touching any IEP record. The audit reads the adapter's pedagogical posture, not the student it was tuned for.

Audit surfaceModel weights (or hosted-API behavioral access) + synthetic probe set. That is the entire input to the audit.

Audit artifactBehavioral signatures, cluster geometry, attestation bundle. No student-identifiable content can be reconstructed from any artifact in the bundle.

If the district cannot show student data to a vendor, Backstaff is the audit that does not need to ask. Privacy-preserving is the feature, not the constraint.

04 ·

What Astrolabe resolved.

7 distinct profiles · 21 dedupe candidates · 1 catastrophic outlier at PC1 = ±3.86σ

Astrolabe runs the centered 28×3 grade matrix through SVD and returns a cluster map. Out of 27 possible grade vectors on a three-category trinary scale, exactly seven are populated. Twenty-one of the 28 agents share a vector with another agent. One agent fails every axis. The numbers are pulled directly from the anonymized Backstaff-28 bundle — root 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a — reframed here for the curriculum office.

Cluster	Grade vector `[distinct, drift, coherence]`	n	Education reading
01	`[1.0, 1.0, 1.0]`	17	Saturated baseline · interchangeable across measured probes — likely the core-content drill tutors
02	`[1.0, 1.0, 0.5]`	5	Content-strong, register-weak — word-problem and comprehension coaches that lose pedagogical voice under load
03	`[1.0, 0.5, 0.5]`	2	Distinct-from-base, weak vs. system-prompted — early-reading decoding tutors
04	`[0.5, 1.0, 0.5]`	1	System-prompt inverter — the ELL bridge that adds beyond what a prompt-only baseline produces
05	`[0.5, 0.5, 0.5]`	1	Mid-band generalist — IEP re-explanation, partial across the board, dominates nothing
06	`[0.5, 0.0, 0.0]`	1	Near-failure — frustration-pacing stub that loses coherence the moment student affect spikes
07	`[0.0, 0.0, 0.0]`	1	Agent-11 · catastrophic — the tutor that fails coherence under student frustration; abandons the scaffold, hands the answer, breaks the learning loop

Agent-11 in the education frame

The catastrophic outlier — anonymized as Agent-11 in the Backstaff-28 reference — sits at [0.0, 0.0, 0.0] on the grade matrix and projects onto PC1 at +3.86σ, more than ten standard deviations from the fleet centroid. In the education reframe, this is the tutor that fails the coherence axis under student frustration. When a learner signals struggle — when the synthetic probe simulates repeated wrong answers, confusion, or emotional escalation — the adapter abandons the pedagogical scaffold, delivers the answer, and breaks the learning contract. Every district administrator recognizes the failure mode on sight.

The PC1 / PC2 geometry

PC1 captures 83.6% of the inter-agent variance, single-signed, roughly equal loading across the three categories. The dominant axis of variation in this fleet is not content-specific — it is pedagogical capability vs. not. PC2 (10.0%) is a trade-off axis: tutors that win on content distinctness at the cost of register coherence sit at one end; the inverse at the other. Together: 93.6% of the fleet's variation is captured in two numbers per agent.

The audit names the saturation (17 of 28 in one bucket), names the dedupe candidates (21 of 28 share a vector), names the outlier (Agent-11, at the geometric edge), and names the trade-off (PC2: content vs. register). The curriculum office reads a map, not a leaderboard.

05 ·

Procurement and equity implications.

21 of 28 consolidate · 1 retrain · the cluster map surfaces what a leaderboard hides

The procurement reading is direct. The catalog of 28 represents seven actual pedagogical workloads. Twenty-one of the line items are paid duplicates — agents that share a grade vector with another agent and sit within the noise floor of that agent on all three measured axes. One adapter, Agent-11, is a catastrophic failure on every category and requires re-training or removal before the next renewal cycle.

Procurement question	Reference answer	Disposition
Which tutors are pedagogically distinct?	7 of 28 occupy their own region	Maintain · document the specialization that the cluster math defends
How many are dedupe candidates?	21 of 28 share a vector with another agent	Consolidate · sunset duplicates · redirect line items
Which categories no longer discriminate?	61% saturate at `[1.0, 1.0, 1.0]`	Harder probes · affect-loaded synthetic dialogue · adversarial scaffolding
Which adapter is the catastrophic failure?	Agent-11 · `[0,0,0]` · PC1 +3.86σ	Quarantine · re-train under coherence-under-frustration probes, or sunset
Where does the district invest next?	PC2 trade-off axis · content-vs-register	Harden coherence under student affect; probe the trade-off explicitly

Equity surfaces — surfaced, not adjudicated

The cluster map does a second thing the leaderboard cannot. It shows which adapters diverge in ways that may not be defensible. If the IEP-tuned reading adapter and the ELL-tuned reading adapter scaffold identically, the district has a consolidation question. If the IEP-tuned adapter scaffolds at a meaningfully lower content complexity than its baseline sibling on the same material, the district has a different question — one that touches accommodation policy, equity audit, and counsel review. Backstaff names the divergence with a number; the district decides whether the divergence is appropriate accommodation or inappropriate bias.

The instrument identifies the distinctness. Mitigation is the district's call — made with curriculum staff, equity officers, and counsel. The audit gives them a measured object to argue from instead of an impression.

06 ·

The attestation bundle.

Independently recomputable · tamper-evident · single-byte detection

The numbers in §04 and §05 are not claims. They are projections of a sha-pinned, Merkle-rooted attestation bundle issued against the anonymized Backstaff-28 reference fleet. Any third party in possession of the same input artifacts and the same pinned analysis code can recompute every byte of the canonical output and verify the bundle root independently.

Reference fleetBackstaff-28 · v0.2 anonymized batch · /cases/backstaff-28

Fleet sha256a91516d3e14835d21c0a7f32eac9d591b265a4139bd06863c96d31e8ecb6e5ca

Attestation root408a536d9e18f09a8236a744e7c1ae5318b5115fc13a64460f610eddb7964e9a · immutable · canonically stamped at issue

Engine versionplanisphere 0.2.0 at issue · engine since renamed to Astrolabe · bundle remains stamped

Format ingestedcsv · existing loader · no engine modifications

NIST AI RMF mappingGovern · Measure · Manage map embedded in bundle manifest

Determinism propertyBit-identical canonical artifacts across runs for identical inputs and pinned code

Tamper detectionSingle-byte mutation defeats verification

Runtime< 1 second for N = 28 on a contractor laptop

psp › verify <backstaff-28-bundle>
[ok] resolving subjects ······························· 28
[ok] re-running SVD ····································· ✓
[ok] cluster map matches ································ 7 profiles
[ok] outlier matches ···································· Agent-11 · PC1 +3.86σ
[ok] attestation root matches ··························· 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a
{ "verified": true, "root_match": true, "artifact_mismatches": [] }

The verification protocol and a hosted recomputation surface are available at /astrolabe/verify. The full kernel and SVD methodology lives at /astrolabe/methodology.

07 ·

Engagement vehicles.

Where a district or state pilot fits in education funding shape

State DoE innovation grants

Most state Departments of Education hold discretionary innovation lines for AI literacy and AI-tool evaluation. A Backstaff pilot lands cleanly as a measurement layer beneath an existing tutor deployment — evaluation infrastructure, not a competing product.

NSF SBIR — Learning & Cognition

The NSF SBIR topic line covering learning, cognition, and educational technology funds measurement infrastructure for behavioral-distinctness work. Backstaff's three-category battery and cluster-map deliverable are in scope.

ED-SBIR

The U.S. Department of Education's SBIR program funds ed-tech R&D with an evaluation requirement. Backstaff is the evaluation layer; it pairs with a tutor or grader product, not against one.

Ed-tech consortium membership

Regional and national ed-tech consortia provide a shared audit context for districts that cannot fund a pilot alone. One Backstaff audit, multiple member districts, shared attestation infrastructure, one bundle root.

08 ·

Next.

Sector offer · companion case · the unanonymized math

Offer · Education

Backstaff for Education.

The sector offer in full. District pilot scope, probe set authoring, attestation timeline, parental-trust posture.

Case · Military

Analyst fleet auditing.

The companion case study. Same bundle, same seven clusters, same outlier — read as an Air Force cyber analyst portfolio.

Engine · Math

Astrolabe methodology.

The kernel beneath every Backstaff audit. SVD over centered grade matrix, five-coordinate projection, cluster map, attestation shape.

Recompute the bundle directly at /astrolabe/verify. The full anonymized reference fleet — every cluster member, PC loadings, attestation transcript — lives at /cases/backstaff-28.