Benchmarks

As we build more and more of our Kernel, we will benchmark our performance on each component and update this page.

Our goal is to help builders understand the capabilities and limitations as it evolves.

LongMemEval

LongMemEval is a benchmark for evaluating an agent's ability to remember and evolve through conversation. It measures how well a model maintains consistency and personalization over time.

You can find the ArXiv paper here.

What Good Performance Means for You

This is the difference between building a generic chatbot and a true AI assistant.

A high score means your agent can build lasting relationships with users, delivering experiences that feel intelligent and personal. It also means that it remembers temporal information as well as facts the user would expect it to know.

Praxos shows significant improvement on all categories except knowledge updates compared to gpt-4o baseline:

Question Type

Praxos

Full-context (gpt-4o)

Delta (%)

single-session-preference

80.00%

20.00%

+300%

Agents that learn user preferences over a single interaction. If a user says they prefer concise summaries during a conversation, your agent remembers and adapts its behavior instantly and consistently.

Question Type

Praxos

Full-context (gpt-4o)

Delta (%)

temporal-reasoning

62.40%

45.10%

+38.36%

Agents that are aware of time. Your agent can understand "what did we discuss last Tuesday?" or "remind me two weeks from now," so you can do more sophisticated schedule-based actions, or even just improve the conversation experience.

Question Type

Praxos

Full-context (gpt-4o)

Delta (%)

multi-session

59.30%

44.30%

+33.86%

Agents will preserve and share context across interactions that are days or weeks apart. This reduces the need to ask the same questions over and over.

Question Type

Praxos

Full-context (gpt-4o)

Delta (%)

single-session-user

88.60%

81.40%

+8.83%

High-fidelity recall. This applies to data extracted from documents and held in memory, as well as specific facts provided by a user during a conversation.

Question Type

Praxos

Full-context (gpt-4o)

Delta (%)

single-session-assistant

96.40%

94.60%

+1.90%

Maintain conversational consistency. The agent remembers its own responses, preventing self-contradiction.

Question Type

Praxos

Full-context (gpt-4o)

Delta (%)

Weighted Aggregate

71.80%

60.6%

+17.49%

Across the board, Praxos delivers a superior, more reliable memory capability than using a large context window with a leading model.

Our advanced memory architecture achieves these benchmark-beating results while reducing context window usage by more than 90%.

What This Means for You as a Builder

  • Massive Cost Savings: Drastically lower your token costs for every interaction. Your operational expenses will plummet, making it feasible to deploy sophisticated agents at scale.

  • Lower Latency: Smaller context windows mean faster processing. You can build agents that respond more quickly, leading to a much better user experience.

  • Greater Complexity: By offloading memory management from the context window, you free up valuable space for more complex instructions, tools, and real-time data, allowing you to build more powerful and capable agents.