For SANS Find Evil 2026 judges

Sift Sentinel is our entry to the SANS Find Evil 2026 hackathon. It is an autonomous DFIR agent that examines a Windows disk image and reports persistence mechanisms with no human in the loop. This page is your map into the live submission. Everything below the fold links into a working artifact, not a screenshot.

live since 2026-05-03 SANS Find Evil 2026 submission deadline 2026-06-15

Walk-through (3 minutes)

If you only have time for four clicks, here is the order that tells the whole story.

FIRST

How it works

The architecture diagram. Pipeline (extract, plan, gates, MCP-tooled execute, interpret, critic), defender-AI integrity layer, and the autonomy ladder.

SECOND

Today's run

Live state. What threat-intel Haiku read last night, what tradecraft it planted, what the pipeline caught vs missed, what new rules the system learned, and what is running right now.

THIRD

Past cases

Per-run viewer. Select any case, see the full plan, the tool calls, the evidence excerpts the agent based its findings on, and the integrity ledger.

FOURTH

Submit your own test

Design a scenario. The form takes a tradecraft pattern, validates it structurally, and queues it for the next daily run. The pipeline will be tested against it within 24 hours.

Required submission components

The 8 components Devpost requires, with their current status and where to find each one. SUCCESS means the artifact is live and judge-readable; STAGED means the source content exists and the assembly step is pending.

Code Repository

Public GitHub repository with MIT or Apache 2.0 license. Source for the entire pipeline, MCP server, viewer, site, and synthetic-workstation harness.

github.com/charanbobby/sift-sentinel

repo live, license flip pending

Demo Video

A 5-minute screencast that includes at least one self-correction sequence on real case data: a planted artifact triggers a Critic rule, the agent retries, the second pass succeeds.

posted on Devpost project page at submission

pending Slice 8

Architecture Diagram

Identifies the architectural pattern (hybrid Custom MCP Server + LangGraph workflow) and distinguishes prompt-based guardrails from architectural guardrails (Docker boundary, MCP read-only tools, capability tokens, dual-channel handler).

View on this site

live

Written Project Description

Devpost story: What we built, How we built it, Challenges, What we learned, What is next. Specific on tradeoffs and autonomous-execution qualities.

draft at docs/submission/devpost-description.md

staged

Dataset Documentation

What we tested against (six static Windows disk images plus the daily synthetic-workstation), source provenance, and the per-case findings table.

draft per-case writeups in docs/submission/

staged

Accuracy Report

Two tracks. Track A is one-shot evaluation on six static disk images (mix of public answer-keys and project-owner annotation). Track B is the continuous daily synthetic-workstation loop. Includes a dedicated evidence-integrity section: read-only mounts, attempted-write probes through the MCP boundary, and how the architecture prevents original-data modification.

View report · headline numbers below

live

Try-It-Out Instructions

Two paths. The recommended one runs the agent on your own machine against your own evidence. The lower-friction one runs the agent on our infrastructure against a scenario you describe in plain English.

Path A (recommended): clone the repo, run against your own disk image. Bring your own Windows E01 or .raw image. The pipeline runs in two Docker containers on a SIFT Workstation (or any Linux/macOS/WSL host with Docker). You see every tool call, every plan, every Critic decision. You provide an OpenRouter API key.

README quick-start · full Docker runbook · CLI: run_case.py --case <name> --e01 <path>

Path B (no install): submit a scenario, we run it on the daily synthetic workstation. Describe an attack in plain English. Our research agent translates it to a manifest, the synthetic-workstation builder plants the artifacts, the pipeline runs, you get a scorecard. No API key required.

Use the live form · read the full guide · what can be tested

both live

Agent Execution Logs

Iteration-over-iteration traces showing how the agent's approach changed across Critic retries. Per-case bundles: Langfuse export, integrity ledger, and the Critic-disagreement log.

Open the run viewer · per case

live

Accuracy at a glance

23 cases, 71 findings, zero hallucinations. The pipeline has been run end-to-end against 23 distinct Windows hosts across three channels (disk-only, dual-channel disk + memory, and memory-only), with the Critic catching every claim before commit.

cases run end-to-end

findings surfaced

hallucinations across the 7 runs scored in the accuracy report (R_05 rejects any finding citing text not literally in tool output)

1.00 / 1.00

precision / recall on the externally validated case (DFIR Madness 001)

high-confidence findings (committable without retry)

memory-channel findings (process injection, C2 beacons, fileless persistence)

memory-channel runs human-reviewed and approved

hosts where the cross-host masquerading-service signature recurred

Cases by ground-truth strength

Highest authority first. The single externally-validated case carries the strongest precision claim; owner-annotated cases are informed by community write-ups; memory-channel approvals are individually reviewed by the project owner. Disk-only auto-commits land when the Critic's deterministic rules accept every finding without retry.

case	channel	status	findings	high-conf	memory
dfirmadness-001-desktop externally validated · DFIR Madness public answer key	disk	committed	2	2	0
srl-2018-wkstn-05 owner annotated · community writeups	disk	committed	2	2	0
srl-2018-wkstn-05 owner annotated · dual channel	disk + mem	committed	5	3	3
srl-2018-base-rd-05-memonly memory channel · approved 2026-05-03	memory	approved	4	3	2
srl-2018-base-wkstn-03-memonly memory channel · approved 2026-05-03	memory	approved	7	0	3
srl-2018-base-wkstn-06-memonly memory channel · approved 2026-05-03	memory	approved	3	0	1
srl-2018-base-rd-03-memonly memory channel · approved 2026-05-03 (LARIAT framework + masquerading services)	memory	approved	3	0	0
srl-2018-base-wkstn-04-memonly memory channel · approved 2026-05-03	memory	approved	4	0	2
srl-2018-base-rd-01-dual cross-channel correlation	disk + mem	committed	4	2	2
srl-2018-base-rd-02-dual cross-channel correlation	disk + mem	committed	3	3	1
srl-2018-base-file-dual cross-channel correlation	disk + mem	committed	4	4	2
+ 12 more cases (disk-only and dual-channel) ending in SUCCESS, including `base-dc` (negative control, no positives to predict) and the rest of the SRL workstation set. Full per-case breakdown in the run viewer.

What the numbers mean, and what they do not. The 1.00 / 1.00 precision and recall is on the one case with a third-party answer key (DFIR Madness 001). The other 22 cases come with weaker ground-truth claims: owner-annotated against community write-ups, individually approved after human review, or auto-committed when every finding cleared the Critic's rules without retry. The "zero hallucinations" claim is verified only across the 7 runs scored in the accuracy report; the remaining 16 runs have not yet been independently audited for hallucination rate. Every committed finding in the run viewer surfaces its cited evidence excerpt with a tool_call_id, so judges can self-audit any claim by clicking into a case.

A second accuracy track ships continuously: the daily synthetic-workstation loop. Every night Haiku reads recent threat-intel news, plants synthetic versions of the tradecraft on a baseline image, and the pipeline scores against the planted manifest. That data is on the Today's run page and accumulates without further engineering effort.

What is the autonomy story

The pipeline ships at L2 (Guarded Execution): the agent self-corrects via deterministic Critic rules and a bounded retry budget, with humans gating the initial plan and final findings. Submission target is L3 (Exception-Based Autonomy): only Low-confidence findings or fail-fast events pause for human review.

The architecture page shows the autonomy climb in the chip row at the top: L1 assisted (shipped), L2 guarded (shipped, the current state), L3 exception (the goal for the submission cut on the bounded reference dataset).