LMEB

Long-horizon Memory Embedding Benchmark

arXiv Code 📊 Dataset 🏆 Leaderboard

Datasets

193

Retrieval Tasks

Memory Types

Models Evaluated

Introduction

A comprehensive evaluation benchmark for long-horizon memory retrieval

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information.

To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks.

LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world.

We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval.

In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval.

Memory Taxonomy

LMEB covers four categories of memory, characterized by abstraction level and temporal dependency

Episodic Memory

Retrieval of past events linked to temporal cues, entities, contents, and spatial contexts. Critical for adaptability, decision-making, and temporal reasoning.

2Datasets

69Tasks

5.8KQueries

29.9KCorpus

16.3KQrels

2.81Avg. D/Q

Dialogue Memory

Maintaining context across multi-turn interactions, recalling previous dialogue turns and user preferences for coherent, personalized conversations.

6Datasets

42Tasks

21.2KQueries

1.69MCorpus

125.0KQrels

5.91Avg. D/Q

Semantic Memory

Recalling general knowledge and concepts independent of time or specific context. Stable, generalizable, and foundational for memory-augmented reasoning.

8Datasets

15Tasks

7.5KQueries

200.4KCorpus

12.0KQrels

1.61Avg. D/Q

Procedural Memory

Retrieval of learned skills and action sequences, essential for problem-solving and multi-step reasoning in tool-augmented and RL systems.

6Datasets

67Tasks

124.6KQueries

157.4KCorpus

127.0KQrels

1.02Avg. D/Q

Two-Dimensional Memory Characterization

Abstraction Level vs. Temporal Dependency

Strong Temporal Dependency

Low Abstraction

High Abstraction

Episodic
Memory

Specific events
Time-Stamped

Dialogue
Memory

Interaction History
Order-Sensitive

Semantic
Memory

General Knowledge
Context-Insensitive

Procedural
Memory

Action Know-how
Execution-Oriented

Weak Temporal Dependency

Statistics of datasets in LEB benchmark

LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types

Scroll horizontally to see all columns →

Memory Type	Dataset	Granularity	# Tasks	Query_src	Corpus_src	#Query	#Corpus	#Qrels	Avg. D/Q	Avg. Query Len	Avg. Doc Len
Episodic	EPBench	Event	54	AI	AI	3,644	2,838	11,458	3.14	24.1	410.7
	KnowMeBench	Event	15	AI	Human	2,162	27,062	4,840	2.24	31.8	58.7
	Total	Event	69	AI	Hybrid	5,806	29,900	16,298	2.81	-	-
Dialogue	LoCoMo	Turn	5	AI	AI & Human	1,976	5,882	2,801	1.42	10.4	38.7
	LongMemEval	Session	6	AI & Human	AI	500	237,655	948	1.90	15.9	243.0
	REALTALK	Turn	3	Human	Human	679	8,944	1,553	2.29	9.0	34.2
	TMD	Turn	12	AI	AI	2,134	7,463	76,287	35.7	15.7	45.9
	MemBench	Round	10	AI	AI	10,000	929,115	29,592	2.96	10.4	42.9
	ConvoMem	Turn	6	AI	AI	5,867	500,221	13,779	2.35	23.2	27.3
	Total	Multi	42	Hybrid	Hybrid	21,156	1,689,280	124,960	5.91	-	-
Semantic	QASPER	Paragraph	1	Human	Human	1,335	65,300	2,319	1.74	7.9	77.5
	NovelQA	Paragraph	7	Human	Human	1,541	79,286	2,506	1.63	19.8	139.5
	PeerQA	Sentence	1	Human	Human	136	18,593	389	2.86	15.7	24.7
	Covid-QA	Paragraph	1	Human	Human	1,111	3,351	1,111	1.0	9.5	110.9
	ESG-Reports	Paragraph	1	Human	Human	36	2,407	75	2.08	9.4	129.4
	MLDR	Paragraph	1	AI	Human	100	1,536	100	1.0	11.3	112.9
	LooGLE	Paragraph	2	AI & Human	Human	3,052	28,190	5,176	1.70	13.7	164.8
	SciFact	Sentence	1	Human	Human	188	1,748	366	1.95	12.9	39.9
	Total	Multi	15	Hybrid	Human	7,499	200,411	12,042	1.61	-	-
Procedural	Gorilla	Tool	3	AI	Human	598	1,005	598	1.0	22.4	146.8
	ToolBench	Tool	1	AI	Human	1,100	13,862	2,629	2.39	46.1	87.9
	ReMe	Experience	9	AI	AI	1,217	914	1,217	1.0	13.5	47.4
	Proced_mem_bench	Trajectory	3	Human	AI	40	336	529	13.2	8.2	362.9
	MemGovern	Experience	48	Human	AI	121,475	121,475	121,475	1.0	18.6	104.0
	DeepPlanning	Item	3	AI	Human	120	19,839	515	4.29	161.6	127.4
	Total	Multi	67	Hybrid	Hybrid	124,550	157,431	126,963	1.02	-	-

🏆 Leaderboard

Comprehensive evaluation of embedding models on LMEB. We report NDCG@10 and Recall@10 under two settings: w/o instruction and w/ instruction.

Scroll horizontally to see all columns →

#	Model	Size	Dim	Episodic		Dialogue		Semantic		Procedural		Mean (Dataset)		Mean (Type)
#	Model	Size	Dim	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10
Embedding Models > 1B Parameters
1	KaLM-Embedding-Gemma3 12B	12B	3840	67.01	74.56	50.89	61.86	47.81	65.82	60.70	71.08	53.91	66.97	56.60	68.33
2	bge-multilingual-gemma2 9B	9B	3584	54.77	62.46	40.68	51.62	40.67	59.84	52.20	62.86	45.10	58.66	47.08	59.19
3	Qwen3-Embedding-8B 8B	8B	4096	61.03	68.73	48.99	59.61	55.47	72.00	57.00	68.58	54.63	67.39	55.62	67.23
4	NV-Embed-v2 7B	7B	4096	70.44	77.40	56.47	67.35	59.12	73.70	60.40	70.70	59.78	71.49	61.61	72.29
5	e5-mistral-7b-instruct 7B	7B	4096	57.21	66.64	48.50	59.97	45.99	63.67	53.03	65.23	49.61	63.36	51.18	63.88
6	Qwen3-Embedding-4B 4B	4B	2560	58.26	67.44	41.20	51.07	53.74	70.37	56.32	67.78	51.44	64.13	52.38	64.16
Embedding Models < 1B Parameters
7	jina-v5-text-small 596M	596M	1024	61.03	69.06	51.06	61.43	51.80	67.24	56.79	66.50	53.80	65.62	55.17	66.05
8	Qwen3-Embedding-0.6B 596M	596M	1024	57.66	65.73	50.41	61.13	52.01	67.84	57.14	67.12	53.49	65.62	54.30	65.46
9	multilingual-e5-large-instruct 560M	560M	1024	57.49	66.68	48.13	60.10	48.36	65.38	51.02	62.73	49.85	63.34	51.25	63.72
10	bge-m3 (Dense) 560M	560M	1024	67.00	73.60	55.61	66.10	56.29	71.26	55.37	64.68	56.83	68.27	58.57	68.91
11	KaLM-Embedding-V2.5 494M	494M	896	61.86	69.63	48.95	59.94	52.00	68.15	52.97	64.23	52.33	64.98	53.94	65.49
12	KaLM-Embedding-V1 494M	494M	896	56.60	66.33	41.32	51.64	47.49	64.16	54.85	65.39	48.64	61.28	50.07	61.88
13	bge-large-en-v1.5 335M	335M	1024	55.85	65.49	53.12	63.85	52.55	67.94	52.62	62.89	53.02	65.22	53.54	65.04
14	EmbeddingGemma-300M 307M	307M	768	68.19	75.29	53.94	65.00	53.58	69.33	57.32	67.42	56.03	68.17	58.26	69.26
15	jina-v5-text-nano 239M	239M	768	56.73	64.58	43.22	53.18	49.49	65.01	50.48	59.64	48.71	60.28	49.98	60.60

#	Model	Size	Dim	Episodic		Dialogue		Semantic		Procedural		Mean (Dataset)		Mean (Type)
#	Model	Size	Dim	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10
Embedding Models > 1B Parameters
1	KaLM-Embedding-Gemma3 12B	12B	3840	70.89	77.40	56.59	67.27	57.53	73.83	63.43	73.04	60.10	72.15	62.11	72.89
2	bge-multilingual-gemma2 9B	9B	3584	70.88	77.67	59.60	69.86	60.41	75.91	61.40	71.71	61.41	73.27	63.07	73.79
3	Qwen3-Embedding-8B 8B	8B	4096	60.85	68.80	51.69	61.99	55.51	71.39	59.12	69.53	55.94	68.08	56.79	67.93
4	NV-Embed-v2 7B	7B	4096	68.45	76.21	56.42	66.77	62.18	76.74	58.77	68.07	60.25	71.61	61.46	71.95
5	e5-mistral-7b-instruct 7B	7B	4096	60.64	69.58	55.03	65.25	53.16	69.66	56.30	67.34	55.21	67.82	56.28	67.95
6	Qwen3-Embedding-4B 4B	4B	2560	57.24	66.33	43.58	53.86	54.52	69.86	59.81	69.94	53.23	65.20	53.79	65.00
Embedding Models < 1B Parameters
7	jina-v5-text-small 596M	596M	1024	61.76	69.60	51.12	61.39	52.57	68.09	56.87	67.01	54.18	66.10	55.58	66.52
8	Qwen3-Embedding-0.6B 596M	596M	1024	59.54	66.93	50.97	61.30	52.67	68.02	59.56	68.94	54.71	66.34	55.69	66.30
9	multilingual-e5-large-instruct 560M	560M	1024	59.77	68.79	54.62	65.52	53.86	68.92	55.54	65.65	55.06	67.09	55.95	67.22
10	bge-m3 (Dense) 560M	560M	1024	62.63	69.92	48.20	59.40	50.07	66.24	50.74	60.58	50.88	63.17	52.91	64.04
11	KaLM-Embedding-V2.5 494M	494M	896	62.70	70.15	50.78	61.28	54.72	70.12	56.72	66.73	54.92	66.79	56.23	67.07
12	KaLM-Embedding-V1 494M	494M	896	60.21	68.23	51.60	61.73	54.81	70.32	57.70	67.23	55.21	66.95	56.08	66.88
13	bge-large-en-v1.5 335M	335M	1024	46.07	56.31	36.74	48.20	40.83	57.99	48.10	58.25	42.17	55.24	42.94	55.19
14	EmbeddingGemma-300M 307M	307M	768	63.73	71.50	42.58	54.03	45.21	65.54	53.24	63.40	48.37	62.36	51.19	63.62
15	jina-v5-text-nano 239M	239M	768	58.00	66.05	50.50	60.66	54.99	70.94	54.27	64.60	53.84	65.97	54.44	65.56

Key Findings

Insights from comprehensive experiments across embedding models, spanning from several hundred million (M) to 10 billion (B) parameters, on LMEB

01 LMEB Benchmark Offers a Reasonable Level of Difficulty

The best-performing model (i.e., bge-multilingual-gemma2) achieves a Mean (Dataset) score of 61.41 on N@10 under the w/ inst. setting, suggesting that the tasks are sufficiently challenging without being excessively difficult.

02 Larger Embedding Models Do Not Always Perform Better

The experimental results show that larger embedding models, such as KaLM-Embedding-Gemma3, bge-multilingual-gemma2, and Qwen3-Embedding-8B, do not always outperform smaller models.

03 The Impact of Task Instructions on Model Performance Varies

The effect of task instructions on model performance varies across different models. For some models, task instructions result in a positive impact, while for others, there is little to no change, and in some cases, a negative effect is observed.

04 LMEB and MTEB Exhibit Orthogonality in Evaluation Domains

The correlation analysis between LMEB and MTEB (eng, v2) (retrieval subset) shows low Pearson and Spearman correlation coefficients of -0.115 and -0.130, respectively, demonstrating that the two benchmarks are orthogonal in the domains they evaluate.

05 MTEB Performance Does Not Generalize Well to LMEB-Episodic/Dialogue

Models that perform well on MTEB struggle to transfer or generalize to LMEB-Episodic/Dialogue, especially LMEB Dialogue. The Pearson and Spearman correlation coefficients between LMEB-Dialogue and MTEB (eng, v2) (retrieval subset) are -0.496 and -0.364, respectively.

06 MTEB Shows Partial Generalization to LMEB-Semantic/Procedural

MTEB shows some, but not substantial, generalization to LMEB-Semantic and LMEB-Procedural. The Pearson and Spearman correlation coefficients between LMEB-Semantic and MTEB are 0.103 and 0.061, respectively, while for LMEB-Procedural, they are 0.291 and 0.429.

Evaluation Protocol

Built on MTEB v2 framework with standardized IR-style data format

Zero-shot

Models evaluated without task-specific fine-tuning

NDCG@10

Main metric capturing both ranking quality and graded relevance

Extensible

New models via model wrappers; new datasets via corpus/queries/qrels/candidates format

Two Settings

w/o instruction and w/ instruction to measure instruction sensitivity

					# queries.jsonl
				
					{"id": "query_id", "text": "query text"}
				
# corpus.jsonl

					{"id": "corpus_id", "text": "corpus text", "title": "title"}
				
# qrels.tsv

					query_id  corpus_id  1
				
# candidates.jsonl (optional)

					{"scene_id": "scene_id",
					"candidate_doc_ids": ["id1",
					...]}