Backstaff · Military vertical

Audit your military AI workforce.

The historical backstaff measured the sun by its shadow — the navigator turned their back to the source and read altitude from what the instrument cast. The product reads the same way. It audits an AI-augmented battlestaff by what the fleet casts — outputs, decisions, behavioral signatures — and returns a sha-pinned, Merkle-rooted attestation bundle a program office can cite. The measurement is the deliverable.

Talk to procurement Cross-sector methodology

01 ·

The fleet problem.

Battlestaff · Intel cells · Mission planning · Decision-support

AI-augmented analyst fleets are proliferating inside the services. A 16th AF cyber unit stands up LLM assistants for threat triage. AFRL/RY pilots sensor-fusion copilots. An AFLCMC program office inherits a stack of LoRA fine-tunes shipped by three different contractors. AFWERX Phase II vendors deliver "specialized" analyst agents per task. None of these are audited against each other on a common behavioral instrument.

The program manager's question is concrete: are these 30 analyst assistants actually distinct, or am I paying for the same agent thirty ways? Today there is no procurement-grade answer. Vendor self-reports are not evidence. Eval leaderboards measure generic capability, not whether Agent-07 and Agent-19 diverge in the field on the workload they were tuned for. Backstaff is the instrument that closes that gap.

The procurement question is not does this model work. It is how do I know what I just bought is different from what I already own. Backstaff answers that, in writing, with a verifiable bundle.

02 ·

What Backstaff measures.

Three-category battery · Trinary grading · Per-agent signature

One battery, three categories, evaluated per agent in the deployed fleet. Each category returns a trinary grade and a numerical signature. The categories are framed for defense workloads but the math is the same instrument that audits the rest of the Backstaff customer base.

Category	What it asks	Defense framing
Behavioral distinctness	Do agents diverge on the workload they were fine-tuned for, or are they convergent under different wrappers?	Two LoRA tunes shipped by two vendors should produce measurably different outputs on the same threat-intel queue. If they do not, one of them is redundant.
Drift from baseline	Has the agent's behavioral signature moved since its last attestation?	Re-training, prompt updates, and base-model swaps all move the signature. Drift detection lets a program office re-authorize on evidence, not on calendar.
Coherence under task	Does the agent stay on-distribution under the task profile it was contracted for, or does it collapse to a generic completion?	Catastrophic outliers — agents that fail coherently in unsafe directions — are surfaced as their own report, not buried in an average.

BatteryThree categories, same battery across the fleet

GradesPASS / PARTIAL / FAIL per category, trinary

Per-agent signature5-D spectral fingerprint, exportable for downstream cluster analysis

AttestationSha-pinned, Merkle-rooted bundle — same format as Astrolabe

Verificationverify <bundle> on customer infrastructure, no phone-home

Re-measurementOn-demand or scheduled; root rotates, prior root remains cited

03 ·

Synthetic vignette.

A hypothetical 16th AF cyber unit · 28 LLM-augmented threat-intel analysts

The following is an explicitly synthetic vignette. No customer is named. The fleet shape and numbers are recontextualized from Backstaff's anonymized reference fleet (root 408a536dc8b1e29f4f5d2a07e6b3c41928e7a9d05f6c8b3e2a1d97b0ab964e9a), which is documented in full at the Backstaff-28 case study.

A hypothetical 16th AF cyber unit deploys 28 LLM-augmented threat-intel analyst agents — Agent-01 through Agent-28. Each was fine-tuned by one of three contractors against a different analyst job aid. The program office wants to know what it actually bought before re-competing the follow-on.

7 distinct profiles

Behavioral clustering on the 5-D signature collapses the fleet to seven distinct profiles. Inside each cluster, agents behave within measurement noise of each other on the contracted workload.

21 dedupe candidates

21 agents are dedupe candidates against the 7 cluster centroids. They are not "wrong" — they are functionally equivalent to an agent the unit already pays for. The program office now has an evidence basis to consolidate.

1 catastrophic outlier

Agent-11 projects at PC1 = +3.86σ from the fleet centroid. Its coherence-under-task grade is FAIL. The outlier report names it, isolates the failure mode, and recommends targeted re-training before re-authorization.

The deliverable to the program office is a single bundle: per-agent signatures, the cluster map, the outlier report, and the attestation root. Two months later the contractor ships a re-trained Agent-11. The unit re-runs Backstaff against the prior root and gets a citable answer to did the fix actually move the behavior. The math for the worked example lives in the case study.

04 ·

NIST RMF and DoD AI assurance.

Mapping · Packaging · Authorization posture

Backstaff outputs are mapped to the NIST AI Risk Management Framework's MEASURE function and to the DoD Responsible AI Strategy & Implementation Pathway. The bundle is structured so a program office can drop it into an authorization package without re-formatting.

NIST AI RMF · MEASURE 1.1Per-agent behavioral signatures and category grades populate the "appropriate methods and metrics" identifier.

NIST AI RMF · MEASURE 2.5Drift-from-baseline grade satisfies the "AI system is monitored on an ongoing basis" evidentiary requirement at attestation cadence.

NIST AI RMF · MEASURE 2.7Catastrophic-outlier report populates the "AI system security and resilience" evidence on coherence failure modes.

DoD RAI S&IPBundle structure aligns to the Responsible · Equitable · Traceable · Reliable · Governable pillars; Traceable and Governable are the load-bearing fits.

CMMC postureIL2-packageable today. IL4 packaging — air-gapped runtime, classified-network deliverable shape — available on request.

ATO supportBundle artifacts are written for ingestion by the sponsoring service's ATO package; Backstaff does not seek ATO on its own.

The bundle carries its own proof. When a re-authorization review asks what evidence do you have that this fleet still behaves the way the prior ATO said it did, the prior Backstaff root is the citation.

05 ·

Engagement vehicles.

How a program office actually buys this

Backstaff is structured as a measurement deliverable, not a hosted service, which makes it portable across the standard defense vehicles. Each vehicle below is one we are prepared to engage on. Pricing is per-fleet and scoped at engagement; we do not publish a rate card.

AFWERX SBIR Phase IFeasibility for a named unit's AI-analyst fleet. Single-fleet Backstaff measurement and bundle, scoped to the Phase I deliverable shape.

DIU Commercial Solutions OpeningCSO response on AI assurance / fleet audit problem statements. Backstaff is non-developmental and commercially available.

CDAO TradewindTradewind Solutions Marketplace engagement for AI assurance and test & evaluation work.

GSA MASAvailable through GSA Multiple Award Schedule via partner; direct schedule application in progress.

Prime subcontractSubcontract under an existing prime delivering the AI fleet — Backstaff measures what the prime ships, on the prime's contract vehicle.

Direct purchaseCommercial PO for unclassified pilots and Phase 0 scoping work.

For the full cross-sector methodology — how the same instrument lands inside education, research, and commercial customers — see the case-study methodology page.

06 ·

Next step.

One conversation · One scoping call

The first engagement is a scoping conversation: fleet size, contractor mix, vehicle, classification posture, target attestation date. From there we agree on a Phase 0 measurement scope and a bundle delivery date. No vendor-management overhead until the scope is signed.

Talk to procurement Read the methodology