Benchmarks

#2 overall on LongMemEval Oracle at 95.6%, tied with Chronos (PwC), and the only file-based system in the top tier. No vector database, no embeddings, on a single 4 vCPU / 16 GB box.

LongMemEval Oracle: the architecture

LongMemEval (ICLR 2025, University of Michigan) is a 500-question test of long-horizon memory. Sibyl's architecture scored 95.6% with Claude Opus 4.6 and 93.6% with Sonnet 4.6, placing #2 overall. The only system above sits at 96.2%; everything else, including Mastra, MemMachine, Hindsight, Mem0, Supermemory, Zep, and the Oracle baseline, sits below.

Category (Opus 4.6)	Score
single-session-user	100%
single-session-assistant	100%
temporal-reasoning	96.2%
single-session-preference	93.3%
multi-session	93.2%
knowledge-update	92.3%
overall	95.6%

Methodology

"We did not optimize for the benchmark. We optimized for production efficiency. The benchmark improvement was a side effect." Full report and per-category methodology at blog.sibylcap.com/longmemeval-v2.

The plugin matches the architecture

The published number above comes from the file-based architecture. The productized plugin, the same one you install from PyPI, was run against the same 500 questions and scored 95.1% on Sonnet 4.5 (447/470 ex-preference, 90.6% raw across all 500, zero errored records). That lands within half a point of the architectural ceiling: tool-mediated access through sibyl_search / sibyl_recall / sibyl_list holds up against raw file reads, because the HOT tier preserves verbatim text and the WARM tier adds extracted cross-references.

Full plugin report: blog.sibylcap.com/plugin-longmemeval.

Independent beta testing

These runs were done by independent closed-beta testers on their own machines against the real shipped packages. They are independent-tester results, not a Sibyl-published benchmark like LongMemEval, and not vendor-reported numbers from the other engines.

Four engines, one corpus

A tester ran the same deterministic corpus (500 companies, 1,500 people, 365 simulated days, 191k records; 350 questions) against four memory engines. Retrieval was isolated from the model, then Claude Sonnet 4.6 answered from the retrieved context.

Engine	Retrieval /350	Answered /350	Cost (all 350)
Sibyl	350 (100%)	344	$0.64
Hindsight	152	152	$18.68
Mem0	92	105	$2.76
Mnemosyne	5	55	$2.78

Sibyl retrieved 50/50 in every category, reading about two rows and ~228 tokens per query with zero write-time LLM calls. Report and a PII-scrubbed replication kit: blog.sibylcap.com/beta-analysis-v2.

Head to head: Sibyl vs Honcho

A separate 1:1 on a 42k-record corpus (250 questions, Sonnet 4.6) on a verified-fair, matched configuration:

Metric	Sibyl	Honcho
Retrieval contained answer	97.2%	87.6%
Answer correct	97.2%	85.6%
Avg context / query	291 tok	1,313 tok

Report: blog.sibylcap.com/beta-analysis.

Reproduce it

The methodology and scorers are open. Start with the LongMemEval report, then install the plugin and run your own. Install →

← Previous

Tiers & access

Sibyl Memory overview