Benchmarks
As we build more and more of our Kernel, we will benchmark our performance on each component and update this page.
Our goal is to help builders understand the capabilities and limitations as it evolves.
LongMemEval
LongMemEval is a benchmark for evaluating an agent's ability to remember and evolve through conversation. It measures how well a model maintains consistency and personalization over time.
You can find the ArXiv paper here.
What Good Performance Means for You
This is the difference between building a generic chatbot and a true AI assistant.
A high score means your agent can build lasting relationships with users, delivering experiences that feel intelligent and personal. It also means that it remembers temporal information as well as facts the user would expect it to know.
Praxos shows significant improvement on all categories except knowledge updates compared to gpt-4o baseline:
Question Type
Praxos
Full-context (gpt-4o)
Delta (%)
single-session-preference
80.00%
20.00%
+300%
Agents that learn user preferences over a single interaction. If a user says they prefer concise summaries during a conversation, your agent remembers and adapts its behavior instantly and consistently.
Question Type
Praxos
Full-context (gpt-4o)
Delta (%)
temporal-reasoning
62.40%
45.10%
+38.36%
Agents that are aware of time. Your agent can understand "what did we discuss last Tuesday?" or "remind me two weeks from now," so you can do more sophisticated schedule-based actions, or even just improve the conversation experience.
Question Type
Praxos
Full-context (gpt-4o)
Delta (%)
multi-session
59.30%
44.30%
+33.86%
Agents will preserve and share context across interactions that are days or weeks apart. This reduces the need to ask the same questions over and over.
Question Type
Praxos
Full-context (gpt-4o)
Delta (%)
single-session-user
88.60%
81.40%
+8.83%
High-fidelity recall. This applies to data extracted from documents and held in memory, as well as specific facts provided by a user during a conversation.
Question Type
Praxos
Full-context (gpt-4o)
Delta (%)
single-session-assistant
96.40%
94.60%
+1.90%
Maintain conversational consistency. The agent remembers its own responses, preventing self-contradiction.
Question Type
Praxos
Full-context (gpt-4o)
Delta (%)
Weighted Aggregate
71.80%
60.6%
+17.49%
Across the board, Praxos delivers a superior, more reliable memory capability than using a large context window with a leading model.
Our advanced memory architecture achieves these benchmark-beating results while reducing context window usage by more than 90%.
What This Means for You as a Builder
Massive Cost Savings: Drastically lower your token costs for every interaction. Your operational expenses will plummet, making it feasible to deploy sophisticated agents at scale.
Lower Latency: Smaller context windows mean faster processing. You can build agents that respond more quickly, leading to a much better user experience.
Greater Complexity: By offloading memory management from the context window, you free up valuable space for more complex instructions, tools, and real-time data, allowing you to build more powerful and capable agents.