Backstaff · Education vertical

Audit your tutor and grader fleet.

The historical backstaff read the sun by its shadow, not by direct gaze. Backstaff for Education reads tutor behavior by what students experience — the outputs the fleet casts — not by reading prompts or instrumenting classrooms. The audit runs against model weights and a synthetic probe set; no student record is touched.

Schedule a district pilot Read the case study

Paired tutor and grader backstaves on shared horizon

01 ·

The fleet problem.

Fifteen tutors or one tutor wearing fifteen nametags?

A district stands up a math tutor agent. Then a reading tutor. Then an ELL tutor. Then an IEP-aware variant for each. Each one is LoRA-tuned on different data, behind a different system prompt, procured under a different line item. Within a year the catalog has thirty entries and no one in the curriculum office can answer the procurement question: are these genuinely different teachers, or one teacher wearing thirty nametags?

There is no standardized way to audit whether a tuned tutor is pedagogically distinct from its siblings, or whether its behavior has drifted from the district baseline since the last refresh. Vendors ship benchmark scores; benchmarks do not measure whether the tutor's style of explanation actually differs from the other tutor the district already pays for.

Backstaff answers the consolidation question with a number. Each agent in the fleet gets a behavioral signature; the cluster map shows which signatures collapse into the same profile and which stand alone. The output is evidence the procurement officer can cite.

02 ·

What Backstaff measures.

Three categories · in pedagogy terms

Category	What it asks	How the fleet is read
Behavioral distinctness	Does this tutor actually teach differently from the others in the fleet?	Spectral fingerprint over response style, scaffolding pattern, error-correction strategy, and worked-example structure on a fixed synthetic probe set.
Drift from baseline	Has this tutor's behavior moved since the district's last attested checkpoint?	Signature delta against the prior Merkle-rooted attestation. Drift is reported as a vector, not a verdict.
Coherence under interaction	Does the tutor hold a stable pedagogy across a multi-turn session, or does it degrade?	Signature stability across extended synthetic dialogues. A tutor that contradicts its earlier scaffolding fails this axis.

GradingPASS / PARTIAL / FAIL per category, per agent

DeliverablePer-agent signature, cluster map, outlier report, Astrolabe attestation bundle

Attestation formatSha-pinned, Merkle-rooted, NIST AI RMF–mapped — same shape as Astrolabe

Re-measurementOn-demand or scheduled; deltas cite the prior root

03 ·

Privacy-first by construction.

No student PII · ever · the audit cannot see classrooms

Backstaff for Education does not ingest student data. The audit runs against the tutor agents' model weights and a synthetic probe set authored by the engagement team. It never reads, never stores, and never models a real student's conversation.

This is a design property, not a policy promise. The instrument measures the tutor's behavioral signature in isolation — the same way the historical backstaff measured altitude by the shadow on the instrument's own scale, not by looking at the sun. The student does not enter the measurement.

FERPANo education records are accessed, processed, or stored. Outside the scope of the regulation by construction.

COPPANo data is collected from children. The probe set is synthetic and authored under engagement.

State DoE privacy frameworksAudit artifacts contain no PII and no derived classroom signal. Compatible with state-by-state student data privacy laws on a no-collection basis.

Audit surfaceModel weights (or hosted-API behavioral access) + synthetic probe set. That is the entire input.

Audit artifactBehavioral signatures, cluster geometry, attestation bundle. No student-identifiable content can be reconstructed from the output.

If the district cannot show student data to a vendor, Backstaff is the audit that does not need to ask.

04 ·

A synthetic vignette.

Hypothetical district · 28 LoRA-tuned tutors · the reference fleet recontextualized

The following is a hypothetical applied to the actual Backstaff-28 reference fleet — same math, education framing. The district itself is invented. The cluster geometry is not.

A mid-sized district has procured 28 LoRA-tuned tutor agents over three budget cycles: nine for mathematics across grade bands, seven for reading, six for English-language learners, six tuned for IEP-aware accommodation. The curriculum office requests a Backstaff audit before the next renewal.

Audit finding	Count	Disposition
Pedagogically distinct profiles	7	Keep. Each occupies its own region of the cluster map.
Consolidation candidates	21	Collapse into the 7 distinct profiles. Each candidate is within the noise floor of an existing tutor on all three axes.
Catastrophic outlier	1 (Agent-11)	Fails coherence axis at `-3.86σ`. Recommend retraining or removal before renewal.

The procurement officer now has a defensible answer: the catalog of 28 represents 7 actual teachers and 21 paid duplicates, with one tutor whose behavior degrades under multi-turn use. The attestation bundle root is 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a; the full case math lives at /cases/backstaff-28.

05 ·

Fairness · surfaced, not resolved.

The instrument reveals; mitigation is the district's call

The cluster map does two things at once. It shows which tutors converge — the consolidation signal — and it shows which tutors diverge in ways that may not be defensible. If the IEP-tuned reading tutor and the ELL-tuned reading tutor scaffold identically, the district has a consolidation question. If the IEP-tuned tutor scaffolds at a meaningfully lower complexity than its baseline sibling on the same content, the district has a different question, one with a name.

Backstaff identifies the distinctness. It does not adjudicate whether a given divergence is appropriate accommodation or inappropriate bias. That call is the district's, made with curriculum staff, equity officers, and counsel. The audit gives them a measured object to argue from instead of an impression.

The instrument names the gap. The customer names the response.

06 ·

Engagement vehicles.

Where a pilot fits in district and state funding shape

State DoE innovation grants

Most state Departments of Education hold discretionary innovation lines for AI literacy and AI-tool evaluation. A Backstaff pilot fits cleanly as a measurement layer beneath an existing tutor deployment.

NSF SBIR — Learning & Cognition

The NSF SBIR topic line covering learning, cognition, and educational technology funds measurement infrastructure where Backstaff's behavioral-distinctness work is in scope.

ED-SBIR

The U.S. Department of Education's SBIR program funds ed-tech R&D with an evaluation requirement. Backstaff is the evaluation layer; it pairs with a tutor or grader product, not against one.

Ed-tech consortium membership

Regional and national ed-tech consortia provide a shared audit context for districts that cannot fund a pilot alone. One Backstaff audit, multiple member districts, shared attestation infrastructure.

07 ·

Schedule a district pilot.

One conversation · scope · probe set · attestation timeline

A district pilot is a single engagement that returns one attestation bundle for the fleet as it stands today. The conversation covers fleet scope, hosting topology, the synthetic probe set, and the attestation timeline. No student data changes hands at any stage.

Schedule a district pilot Read the case study