Benchmarks
#2 overall on LongMemEval Oracle at 95.6%, tied with Chronos (PwC), and the only file-based system in the top tier. No vector database, no embeddings, on a single 4 vCPU / 16 GB box.
LongMemEval Oracle: the architecture
LongMemEval (ICLR 2025, University of Michigan) is a 500-question test of long-horizon memory. Sibyl's architecture scored 95.6% with Claude Opus 4.6 and 93.6% with Sonnet 4.6, placing #2 overall. The only system above sits at 96.2%; everything else, including Mastra, MemMachine, Hindsight, Mem0, Supermemory, Zep, and the Oracle baseline, sits below.
| Category (Opus 4.6) | Score |
|---|---|
| single-session-user | 100% |
| single-session-assistant | 100% |
| temporal-reasoning | 96.2% |
| single-session-preference | 93.3% |
| multi-session | 93.2% |
| knowledge-update | 92.3% |
| overall | 95.6% |
"We did not optimize for the benchmark. We optimized for production efficiency. The benchmark improvement was a side effect." Full report and per-category methodology at blog.sibylcap.com/longmemeval-v2.
The plugin matches the architecture
The published number above comes from the file-based architecture. The productized plugin, the same one
you install from PyPI, was run against the same 500 questions and scored 95.1% on Sonnet 4.5
(447/470 ex-preference, 90.6% raw across all 500, zero errored records). That lands within half a point of
the architectural ceiling: tool-mediated access through sibyl_search / sibyl_recall /
sibyl_list holds up against raw file reads, because the HOT tier preserves verbatim text and the
WARM tier adds extracted cross-references.
Full plugin report: blog.sibylcap.com/plugin-longmemeval.
Independent beta testing
These runs were done by independent closed-beta testers on their own machines against the real shipped packages. They are independent-tester results, not a Sibyl-published benchmark like LongMemEval, and not vendor-reported numbers from the other engines.
Four engines, one corpus
A tester ran the same deterministic corpus (500 companies, 1,500 people, 365 simulated days, 191k records; 350 questions) against four memory engines. Retrieval was isolated from the model, then Claude Sonnet 4.6 answered from the retrieved context.
| Engine | Retrieval /350 | Answered /350 | Cost (all 350) |
|---|---|---|---|
| Sibyl | 350 (100%) | 344 | $0.64 |
| Hindsight | 152 | 152 | $18.68 |
| Mem0 | 92 | 105 | $2.76 |
| Mnemosyne | 5 | 55 | $2.78 |
Sibyl retrieved 50/50 in every category, reading about two rows and ~228 tokens per query with zero write-time LLM calls. Report and a PII-scrubbed replication kit: blog.sibylcap.com/beta-analysis-v2.
Head to head: Sibyl vs Honcho
A separate 1:1 on a 42k-record corpus (250 questions, Sonnet 4.6) on a verified-fair, matched configuration:
| Metric | Sibyl | Honcho |
|---|---|---|
| Retrieval contained answer | 97.2% | 87.6% |
| Answer correct | 97.2% | 85.6% |
| Avg context / query | 291 tok | 1,313 tok |
Report: blog.sibylcap.com/beta-analysis.
The methodology and scorers are open. Start with the LongMemEval report, then install the plugin and run your own. Install →