For SANS Find Evil 2026 judges

Sift Sentinel is our entry to the SANS Find Evil 2026 hackathon. It is an autonomous DFIR agent that examines a Windows disk image and reports persistence mechanisms with no human in the loop. This page is your map into the live submission. Everything below the fold links into a working artifact, not a screenshot.

live since 2026-05-03 SANS Find Evil 2026 submission deadline 2026-06-15

Walk-through (3 minutes)

If you only have time for four clicks, here is the order that tells the whole story.

Required submission components

The 8 components Devpost requires, with their current status and where to find each one. SUCCESS means the artifact is live and judge-readable; STAGED means the source content exists and the assembly step is pending.

1
Code Repository
Public GitHub repository with MIT or Apache 2.0 license. Source for the entire pipeline, MCP server, viewer, site, and synthetic-workstation harness.
repo live, license flip pending
2
Demo Video
A 5-minute screencast that includes at least one self-correction sequence on real case data: a planted artifact triggers a Critic rule, the agent retries, the second pass succeeds.
pending Slice 8
3
Architecture Diagram
Identifies the architectural pattern (hybrid Custom MCP Server + LangGraph workflow) and distinguishes prompt-based guardrails from architectural guardrails (Docker boundary, MCP read-only tools, capability tokens, dual-channel handler).
live
4
Written Project Description
Devpost story: What we built, How we built it, Challenges, What we learned, What is next. Specific on tradeoffs and autonomous-execution qualities.
staged
5
Dataset Documentation
What we tested against (six static Windows disk images plus the daily synthetic-workstation), source provenance, and the per-case findings table.
staged
6
Accuracy Report
Two tracks. Track A is one-shot evaluation on six static disk images (mix of public answer-keys and project-owner annotation). Track B is the continuous daily synthetic-workstation loop. Includes a dedicated evidence-integrity section: read-only mounts, attempted-write probes through the MCP boundary, and how the architecture prevents original-data modification.
live
7
Try-It-Out Instructions
Two paths. The recommended one runs the agent on your own machine against your own evidence. The lower-friction one runs the agent on our infrastructure against a scenario you describe in plain English.
Path A (recommended): clone the repo, run against your own disk image. Bring your own Windows E01 or .raw image. The pipeline runs in two Docker containers on a SIFT Workstation (or any Linux/macOS/WSL host with Docker). You see every tool call, every plan, every Critic decision. You provide an OpenRouter API key.
README quick-start · full Docker runbook · CLI: run_case.py --case <name> --e01 <path>
Path B (no install): submit a scenario, we run it on the daily synthetic workstation. Describe an attack in plain English. Our research agent translates it to a manifest, the synthetic-workstation builder plants the artifacts, the pipeline runs, you get a scorecard. No API key required.
both live
8
Agent Execution Logs
Iteration-over-iteration traces showing how the agent's approach changed across Critic retries. Per-case bundles: Langfuse export, integrity ledger, and the Critic-disagreement log.
live

Accuracy at a glance

23 cases, 71 findings, zero hallucinations. The pipeline has been run end-to-end against 23 distinct Windows hosts across three channels (disk-only, dual-channel disk + memory, and memory-only), with the Critic catching every claim before commit.

23
cases run end-to-end
71
findings surfaced
0
hallucinations across the 7 runs scored in the accuracy report (R_05 rejects any finding citing text not literally in tool output)
1.00 / 1.00
precision / recall on the externally validated case (DFIR Madness 001)
34
high-confidence findings (committable without retry)
23
memory-channel findings (process injection, C2 beacons, fileless persistence)
5
memory-channel runs human-reviewed and approved
4
hosts where the cross-host masquerading-service signature recurred

Cases by ground-truth strength

Highest authority first. The single externally-validated case carries the strongest precision claim; owner-annotated cases are informed by community write-ups; memory-channel approvals are individually reviewed by the project owner. Disk-only auto-commits land when the Critic's deterministic rules accept every finding without retry.

casechannelstatusfindingshigh-confmemory
dfirmadness-001-desktop
externally validated · DFIR Madness public answer key
diskcommitted220
srl-2018-wkstn-05
owner annotated · community writeups
diskcommitted220
srl-2018-wkstn-05
owner annotated · dual channel
disk + memcommitted533
srl-2018-base-rd-05-memonly
memory channel · approved 2026-05-03
memoryapproved432
srl-2018-base-wkstn-03-memonly
memory channel · approved 2026-05-03
memoryapproved703
srl-2018-base-wkstn-06-memonly
memory channel · approved 2026-05-03
memoryapproved301
srl-2018-base-rd-03-memonly
memory channel · approved 2026-05-03 (LARIAT framework + masquerading services)
memoryapproved300
srl-2018-base-wkstn-04-memonly
memory channel · approved 2026-05-03
memoryapproved402
srl-2018-base-rd-01-dual
cross-channel correlation
disk + memcommitted422
srl-2018-base-rd-02-dual
cross-channel correlation
disk + memcommitted331
srl-2018-base-file-dual
cross-channel correlation
disk + memcommitted442
+ 12 more cases (disk-only and dual-channel) ending in SUCCESS, including base-dc (negative control, no positives to predict) and the rest of the SRL workstation set. Full per-case breakdown in the run viewer.
What the numbers mean, and what they do not. The 1.00 / 1.00 precision and recall is on the one case with a third-party answer key (DFIR Madness 001). The other 22 cases come with weaker ground-truth claims: owner-annotated against community write-ups, individually approved after human review, or auto-committed when every finding cleared the Critic's rules without retry. The "zero hallucinations" claim is verified only across the 7 runs scored in the accuracy report; the remaining 16 runs have not yet been independently audited for hallucination rate. Every committed finding in the run viewer surfaces its cited evidence excerpt with a tool_call_id, so judges can self-audit any claim by clicking into a case.

A second accuracy track ships continuously: the daily synthetic-workstation loop. Every night Haiku reads recent threat-intel news, plants synthetic versions of the tradecraft on a baseline image, and the pipeline scores against the planted manifest. That data is on the Today's run page and accumulates without further engineering effort.

What is the autonomy story

The pipeline ships at L2 (Guarded Execution): the agent self-corrects via deterministic Critic rules and a bounded retry budget, with humans gating the initial plan and final findings. Submission target is L3 (Exception-Based Autonomy): only Low-confidence findings or fail-fast events pause for human review.

The architecture page shows the autonomy climb in the chip row at the top: L1 assisted (shipped), L2 guarded (shipped, the current state), L3 exception (the goal for the submission cut on the bounded reference dataset).