LMEB

Long-horizon Memory Embedding Benchmark

22
Datasets
193
Retrieval Tasks
4
Memory Types
15
Models Evaluated

Introduction

A comprehensive evaluation benchmark for long-horizon memory retrieval

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information.

To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks.

LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world.

We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval.

In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval.

Memory Taxonomy

LMEB covers four categories of memory, characterized by abstraction level and temporal dependency

Episodic Memory

Retrieval of past events linked to temporal cues, entities, contents, and spatial contexts. Critical for adaptability, decision-making, and temporal reasoning.

2Datasets
69Tasks
5.8KQueries
29.9KCorpus
16.3KQrels
2.81Avg. D/Q

Dialogue Memory

Maintaining context across multi-turn interactions, recalling previous dialogue turns and user preferences for coherent, personalized conversations.

6Datasets
42Tasks
21.2KQueries
1.69MCorpus
125.0KQrels
5.91Avg. D/Q

Semantic Memory

Recalling general knowledge and concepts independent of time or specific context. Stable, generalizable, and foundational for memory-augmented reasoning.

8Datasets
15Tasks
7.5KQueries
200.4KCorpus
12.0KQrels
1.61Avg. D/Q

Procedural Memory

Retrieval of learned skills and action sequences, essential for problem-solving and multi-step reasoning in tool-augmented and RL systems.

6Datasets
67Tasks
124.6KQueries
157.4KCorpus
127.0KQrels
1.02Avg. D/Q

Two-Dimensional Memory Characterization

Abstraction Level vs. Temporal Dependency

Strong Temporal Dependency
Low Abstraction
High Abstraction
Episodic
Memory
  • Specific events
  • Time-Stamped
Dialogue
Memory
  • Interaction History
  • Order-Sensitive
Semantic
Memory
  • General Knowledge
  • Context-Insensitive
Procedural
Memory
  • Action Know-how
  • Execution-Oriented
Weak Temporal Dependency

Statistics of datasets in LEB benchmark

LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types

Scroll horizontally to see all columns →

Memory Type Dataset Granularity # Tasks Querysrc Corpussrc #Query #Corpus #Qrels Avg. D/Q Avg. Query Len Avg. Doc Len
Episodic EPBench Event 54 AI AI 3,644 2,838 11,458 3.14 24.1 410.7
KnowMeBench Event 15 AI Human 2,162 27,062 4,840 2.24 31.8 58.7
Total Event 69 AI Hybrid 5,806 29,900 16,298 2.81 - -
Dialogue LoCoMo Turn 5 AI AI & Human 1,976 5,882 2,801 1.42 10.4 38.7
LongMemEval Session 6 AI & Human AI 500 237,655 948 1.90 15.9 243.0
REALTALK Turn 3 Human Human 679 8,944 1,553 2.29 9.0 34.2
TMD Turn 12 AI AI 2,134 7,463 76,287 35.7 15.7 45.9
MemBench Round 10 AI AI 10,000 929,115 29,592 2.96 10.4 42.9
ConvoMem Turn 6 AI AI 5,867 500,221 13,779 2.35 23.2 27.3
Total Multi 42 Hybrid Hybrid 21,156 1,689,280 124,960 5.91 - -
Semantic QASPER Paragraph 1 Human Human 1,335 65,300 2,319 1.74 7.9 77.5
NovelQA Paragraph 7 Human Human 1,541 79,286 2,506 1.63 19.8 139.5
PeerQA Sentence 1 Human Human 136 18,593 389 2.86 15.7 24.7
Covid-QA Paragraph 1 Human Human 1,111 3,351 1,111 1.0 9.5 110.9
ESG-Reports Paragraph 1 Human Human 36 2,407 75 2.08 9.4 129.4
MLDR Paragraph 1 AI Human 100 1,536 100 1.0 11.3 112.9
LooGLE Paragraph 2 AI & Human Human 3,052 28,190 5,176 1.70 13.7 164.8
SciFact Sentence 1 Human Human 188 1,748 366 1.95 12.9 39.9
Total Multi 15 Hybrid Human 7,499 200,411 12,042 1.61 - -
Procedural Gorilla Tool 3 AI Human 598 1,005 598 1.0 22.4 146.8
ToolBench Tool 1 AI Human 1,100 13,862 2,629 2.39 46.1 87.9
ReMe Experience 9 AI AI 1,217 914 1,217 1.0 13.5 47.4
Proced_mem_bench Trajectory 3 Human AI 40 336 529 13.2 8.2 362.9
MemGovern Experience 48 Human AI 121,475 121,475 121,475 1.0 18.6 104.0
DeepPlanning Item 3 AI Human 120 19,839 515 4.29 161.6 127.4
Total Multi 67 Hybrid Hybrid 124,550 157,431 126,963 1.02 - -

🏆 Leaderboard

Comprehensive evaluation of embedding models on LMEB. We report NDCG@10 and Recall@10 under two settings: w/o instruction and w/ instruction.

Scroll horizontally to see all columns →

# Model Size Dim Episodic Dialogue Semantic Procedural Mean (Dataset) Mean (Type)
N@10 R@10 N@10 R@10 N@10 R@10 N@10 R@10 N@10 R@10 N@10 R@10
Embedding Models > 1B Parameters
1 KaLM-Embedding-Gemma3 12B 12B 3840 67.01 74.56 50.89 61.86 47.81 65.82 60.70 71.08 53.91 66.97 56.60 68.33
2 bge-multilingual-gemma2 9B 9B 3584 54.77 62.46 40.68 51.62 40.67 59.84 52.20 62.86 45.10 58.66 47.08 59.19
3 Qwen3-Embedding-8B 8B 8B 4096 61.03 68.73 48.99 59.61 55.47 72.00 57.00 68.58 54.63 67.39 55.62 67.23
4 NV-Embed-v2 7B 7B 4096 70.44 77.40 56.47 67.35 59.12 73.70 60.40 70.70 59.78 71.49 61.61 72.29
5 e5-mistral-7b-instruct 7B 7B 4096 57.21 66.64 48.50 59.97 45.99 63.67 53.03 65.23 49.61 63.36 51.18 63.88
6 Qwen3-Embedding-4B 4B 4B 2560 58.26 67.44 41.20 51.07 53.74 70.37 56.32 67.78 51.44 64.13 52.38 64.16
Embedding Models < 1B Parameters
7 jina-v5-text-small 596M 596M 1024 61.03 69.06 51.06 61.43 51.80 67.24 56.79 66.50 53.80 65.62 55.17 66.05
8 Qwen3-Embedding-0.6B 596M 596M 1024 57.66 65.73 50.41 61.13 52.01 67.84 57.14 67.12 53.49 65.62 54.30 65.46
9 multilingual-e5-large-instruct 560M 560M 1024 57.49 66.68 48.13 60.10 48.36 65.38 51.02 62.73 49.85 63.34 51.25 63.72
10 bge-m3 (Dense) 560M 560M 1024 67.00 73.60 55.61 66.10 56.29 71.26 55.37 64.68 56.83 68.27 58.57 68.91
11 KaLM-Embedding-V2.5 494M 494M 896 61.86 69.63 48.95 59.94 52.00 68.15 52.97 64.23 52.33 64.98 53.94 65.49
12 KaLM-Embedding-V1 494M 494M 896 56.60 66.33 41.32 51.64 47.49 64.16 54.85 65.39 48.64 61.28 50.07 61.88
13 bge-large-en-v1.5 335M 335M 1024 55.85 65.49 53.12 63.85 52.55 67.94 52.62 62.89 53.02 65.22 53.54 65.04
14 EmbeddingGemma-300M 307M 307M 768 68.19 75.29 53.94 65.00 53.58 69.33 57.32 67.42 56.03 68.17 58.26 69.26
15 jina-v5-text-nano 239M 239M 768 56.73 64.58 43.22 53.18 49.49 65.01 50.48 59.64 48.71 60.28 49.98 60.60

Key Findings

Insights from comprehensive experiments across embedding models, spanning from several hundred million (M) to 10 billion (B) parameters, on LMEB

01 LMEB Benchmark Offers a Reasonable Level of Difficulty

The best-performing model (i.e., bge-multilingual-gemma2) achieves a Mean (Dataset) score of 61.41 on N@10 under the w/ inst. setting, suggesting that the tasks are sufficiently challenging without being excessively difficult.

02 Larger Embedding Models Do Not Always Perform Better

The experimental results show that larger embedding models, such as KaLM-Embedding-Gemma3, bge-multilingual-gemma2, and Qwen3-Embedding-8B, do not always outperform smaller models.

03 The Impact of Task Instructions on Model Performance Varies

The effect of task instructions on model performance varies across different models. For some models, task instructions result in a positive impact, while for others, there is little to no change, and in some cases, a negative effect is observed.

04 LMEB and MTEB Exhibit Orthogonality in Evaluation Domains

The correlation analysis between LMEB and MTEB (eng, v2) (retrieval subset) shows low Pearson and Spearman correlation coefficients of -0.115 and -0.130, respectively, demonstrating that the two benchmarks are orthogonal in the domains they evaluate.

05 MTEB Performance Does Not Generalize Well to LMEB-Episodic/Dialogue

Models that perform well on MTEB struggle to transfer or generalize to LMEB-Episodic/Dialogue, especially LMEB Dialogue. The Pearson and Spearman correlation coefficients between LMEB-Dialogue and MTEB (eng, v2) (retrieval subset) are -0.496 and -0.364, respectively.

06 MTEB Shows Partial Generalization to LMEB-Semantic/Procedural

MTEB shows some, but not substantial, generalization to LMEB-Semantic and LMEB-Procedural. The Pearson and Spearman correlation coefficients between LMEB-Semantic and MTEB are 0.103 and 0.061, respectively, while for LMEB-Procedural, they are 0.291 and 0.429.

Evaluation Protocol

Built on MTEB v2 framework with standardized IR-style data format

Zero-shot

Models evaluated without task-specific fine-tuning

NDCG@10

Main metric capturing both ranking quality and graded relevance

Extensible

New models via model wrappers; new datasets via corpus/queries/qrels/candidates format

Two Settings

w/o instruction and w/ instruction to measure instruction sensitivity

# queries.jsonl
{"id": "query_id", "text": "query text"}

# corpus.jsonl
{"id": "corpus_id", "text": "corpus text", "title": "title"}

# qrels.tsv
query_id  corpus_id  1

# candidates.jsonl (optional)
{"scene_id": "scene_id", "candidate_doc_ids": ["id1", ...]}

Citation

If you find LMEB useful in your research, please consider citing our paper

@misc{zhao2026lmeb,
      title={LMEB: Long-horizon Memory Embedding Benchmark}, 
      author={Xinping Zhao and Xinshuo Hu and Jiaxin Xu and Danyu Tang and Xin Zhang and Mengjia Zhou and Yan Zhong and Yao Zhou and Zifei Shan and Meishan Zhang and Baotian Hu and Min Zhang},
      year={2026},
      eprint={2603.12572},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.12572}, 
}