SIBYL LABSdocs

Benchmarks

#2 overall on LongMemEval Oracle at 95.6%, tied with Chronos (PwC), and the only file-based system in the top tier. No vector database, no embeddings, on a single 4 vCPU / 16 GB box.

LongMemEval Oracle: the architecture

LongMemEval (ICLR 2025, University of Michigan) is a 500-question test of long-horizon memory. Sibyl's architecture scored 95.6% with Claude Opus 4.6 and 93.6% with Sonnet 4.6, placing #2 overall. The only system above sits at 96.2%; everything else, including Mastra, MemMachine, Hindsight, Mem0, Supermemory, Zep, and the Oracle baseline, sits below.

Category (Opus 4.6)Score
single-session-user100%
single-session-assistant100%
temporal-reasoning96.2%
single-session-preference93.3%
multi-session93.2%
knowledge-update92.3%
overall95.6%
Methodology

"We did not optimize for the benchmark. We optimized for production efficiency. The benchmark improvement was a side effect." Full report and per-category methodology at blog.sibylcap.com/longmemeval-v2.

The plugin matches the architecture

The published number above comes from the file-based architecture. The productized plugin, the same one you install from PyPI, was run against the same 500 questions and scored 95.1% on Sonnet 4.5 (447/470 ex-preference, 90.6% raw across all 500, zero errored records). That lands within half a point of the architectural ceiling: tool-mediated access through sibyl_search / sibyl_recall / sibyl_list holds up against raw file reads, because the HOT tier preserves verbatim text and the WARM tier adds extracted cross-references.

Full plugin report: blog.sibylcap.com/plugin-longmemeval.

Independent beta testing

These runs were done by independent closed-beta testers on their own machines against the real shipped packages. They are independent-tester results, not a Sibyl-published benchmark like LongMemEval, and not vendor-reported numbers from the other engines.

Four engines, one corpus

A tester ran the same deterministic corpus (500 companies, 1,500 people, 365 simulated days, 191k records; 350 questions) against four memory engines. Retrieval was isolated from the model, then Claude Sonnet 4.6 answered from the retrieved context.

EngineRetrieval /350Answered /350Cost (all 350)
Sibyl350 (100%)344$0.64
Hindsight152152$18.68
Mem092105$2.76
Mnemosyne555$2.78

Sibyl retrieved 50/50 in every category, reading about two rows and ~228 tokens per query with zero write-time LLM calls. Report and a PII-scrubbed replication kit: blog.sibylcap.com/beta-analysis-v2.

Head to head: Sibyl vs Honcho

A separate 1:1 on a 42k-record corpus (250 questions, Sonnet 4.6) on a verified-fair, matched configuration:

MetricSibylHoncho
Retrieval contained answer97.2%87.6%
Answer correct97.2%85.6%
Avg context / query291 tok1,313 tok

Report: blog.sibylcap.com/beta-analysis.

Reproduce it

The methodology and scorers are open. Start with the LongMemEval report, then install the plugin and run your own. Install →