Skip to content

AvatarMemory/CloneMemBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CloneMem

Benchmarking Long-Term Memory for AI Clones

arXiv License Hugging Face


πŸ“– Overview

CloneMem is a comprehensive benchmark for evaluating long-term memory capabilities of AI Clones. Unlike existing memory benchmarks that primarily rely on user–agent conversational histories, CloneMem tests whether an AI Clone can integrate non-conversational digital traces drawn from everyday life and use them to consistently track an individual's experiences, emotional changes, and evolving opinions over time.

CloneMem Application Scenarios

Figure 1: Illustrative application scenarios of an AI Clone grounded in long-term digital traces, including delegated communication and proactive memory-driven assistance.


🎯 Key Features

  • Non-Conversational Digital Traces: Grounded in diaries, social media posts, direct messages, and emails spanning 1-3 years
  • Top-Down Data Construction: Hierarchical generation framework ensuring longitudinal coherence from persona to micro-level events
  • Multi-Dimensional Evaluation: Assesses tracking of experiences, emotions, and opinions over time
  • Diverse Task Types: 8 reasoning categories including factual recall, temporal reasoning, causal/counterfactual reasoning, and unanswerable detection
  • Bilingual Support: English and Chinese datasets

πŸ“Š Dataset Statistics

Statistic Value
# Personas 10
# Questions 1,183
Languages English, Chinese
Context Length 3 short (~100k tokens), 7 long (>500k tokens)
Question Types 8 task categories
Time Span 1-3 years per persona

πŸ” Task Examples

CloneMem Task Examples

Figure 3: Illustrative examples of CloneMem tasks. The left panel shows non-conversational digital traces and their associated ground-truth evidence; the right panel shows example questions and answers for three task types.

Evaluation Tasks

Level Task Type Description
Factual Recall Single-Point Factual Retrieve explicit factual information at a given time point
Temporal Reasoning Comparative Contrast experiences/emotions/opinions between two time points
Trajectory Analysis Characterize how aspects evolve over extended periods
Pattern Identification Recognize recurring behaviors across different life events
Higher-Level Reasoning Causal Reasoning Trace chains of events explaining why changes occur
Counterfactual Reasoning Consider how alternative decisions could lead to different outcomes
Inferential Reasoning Form higher-level judgments from scattered information
Unanswerable Questions Recognize when evidence is insufficient to answer

πŸš€ Quick Start

Installation

git clone https://github.com/AvatarMemory/CloneMemBench.git
cd CloneMem
pip install -e .

Dataset

The dataset is included in this repository under data/releases/. After cloning, you can directly access the benchmark data. See the Data Format documentation for detailed schema information.


πŸ“ Repository Structure

CloneMem/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ configs/                     # Configuration files
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ big_five/               # Big Five personality data
β”‚   β”œβ”€β”€ releases/               # πŸ“¦ Released benchmark dataset
β”‚   β”‚   └── README.md           # Data format documentation
β”‚   └── runs/                   # Pipeline run outputs
β”œβ”€β”€ docs/
β”‚   └── README.md               # General documentation
β”œβ”€β”€ outputs/                     # Generated outputs
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ clonemem/               # Data generation pipeline
β”‚   β”‚   β”œβ”€β”€ build/
β”‚   β”‚   β”‚   β”œβ”€β”€ config/         # Build configurations
β”‚   β”‚   β”‚   β”œβ”€β”€ core/           # Core data structures
β”‚   β”‚   β”‚   β”œβ”€β”€ generators/     # LLM-based generators
β”‚   β”‚   β”‚   β”œβ”€β”€ postprocess/    # Post-processing utilities
β”‚   β”‚   β”‚   β”œβ”€β”€ prompting/      # Prompt templates
β”‚   β”‚   β”‚   β”œβ”€β”€ runners/        # Pipeline runners
β”‚   β”‚   β”‚   └── workflows/      # Workflow orchestration
β”‚   β”‚   β”œβ”€β”€ common/             # Shared utilities
β”‚   β”‚   β”œβ”€β”€ cli.py              # Command-line interface
β”‚   β”‚   └── README.md           # Data generation guide
β”‚   └── clonemem-eval/          # Evaluation framework
β”‚       β”œβ”€β”€ eval/
β”‚       β”‚   β”œβ”€β”€ analysis/       # Metric computation scripts
β”‚       β”‚   β”œβ”€β”€ eval_amem.py    # A-Mem evaluation
β”‚       β”‚   β”œβ”€β”€ eval_flat.py    # Flat retriever evaluation
β”‚       β”‚   β”œβ”€β”€ eval_mem0.py    # Mem0 evaluation
β”‚       β”‚   β”œβ”€β”€ eval_oracle.py  # Oracle baseline
β”‚       β”‚   β”œβ”€β”€ run_eval.sh     # Evaluation runner script
β”‚       β”‚   └── run_generation.py
β”‚       └── README.md           # Evaluation guide
β”œβ”€β”€ .env                         # Environment variables
β”œβ”€β”€ .gitignore
└── LICENSE                      # Apache 2.0 License

πŸ“š Documentation

Document Description
Data Format Detailed documentation of data schema, fields, and structure
Data Generation Guide to reproduce the data generation pipeline
Evaluation Instructions for running evaluations and baselines

πŸ“ˆ Main Results

Our experiments reveal that current memory systems face significant challenges in AI Clone scenarios:

  • Simple flat retrieval often outperforms complex abstractive memory systems (A-Mem, Mem0)
  • Abstraction helps search but hurts cloning: Summarization and fact extraction act as lossy compression
  • Models fall back to narrative priors when evidence is underspecified
  • Event logs cannot represent "no decision yet": Activity β‰  state
Method Recall@10 QA Consistency Choice Accuracy
Oracle - 0.83 89.65
Flat Retriever 0.22 0.72 88.50
A-Mem 0.21 0.70 87.48
Mem0 0.13 0.65 85.28

Results with GPT-4o-mini backbone at k=10


πŸ”— Citation

If you find CloneMem useful for your research, please cite our paper:

@misc{hu2026clonemembenchmarkinglongtermmemory,
      title={CloneMem: Benchmarking Long-Term Memory for AI Clones}, 
      author={Sen Hu and Zhiyu Zhang and Yuxiang Wei and Xueran Han and Zhenheng Tang and Huacan Wang and Ronghao Chen},
      year={2026},
      eprint={2601.07023},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.07023}, 
}

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Releases

No releases published

Packages

No packages published