Memory system design for long-running agent systems
For a long-running agent system, a well-designed memory architecture is essential to maintain context, preserve experience, and continuously improve behaviour across interactions.
This article presents a systematic way to think about agent memory, clarifies its relationship with context, and provides a practical framework for storing, retrieving, and composing memory into each model invocation.
This article also includes learned experience from OpenClaw (Previously called Clawdbot, Moltbot).
To make the ideas described in this article practical and easy to adopt, we have developed MemoryAgent, an open-source framework that implements this tiered memory architecture as a drop-in memory layer for agent systems.
Github link: https://github.com/jia-wei-zheng/MemoryAgent
Before discussing specific memory mechanisms, it is crucial to distinguish two closely related but fundamentally different concepts: context and memory.
Context refers to everything the model can directly “see” for a single request.
Context has below features:
- Short-lived: Exists only for this single request
- Bounded: Limited by the model’s context window (e.g., 200j tokens)
- Costly: Larger context -> high latency and API cost
- Transient: Discarded after the request completes.
Memory is persistent information, typically stored on disk, either in files or in a database.
The features of memory:
- Persistent: Can last days, months, or years
- Unbounded: No inherent context window limits
- Cheap to store: Disk and database are orders of magnitude cheaper than context tokens.
- Searchable: Can be indexed, embedded for semantic retrieval
Memory exists to selectively rebuild context when needed.
Memory type
Based on purpose, lifespan, and usage patterns, agent memory can be characterized into four primary types:
Working memory
Working memory represents the agent’s current workspace, closely aligned with the context.
Characteristics:
- Very short-lived
- Frequently updated
- Automatically cleared or overwritten
- Directly influences reasoning and tool use
Examples:
- current conversation turns
- Intermediate reasoning states
- Tool call outputs
- Temporary variables (e.g., current task = booking flight)
In practice, working memory is usually implemented as context assembly logic, not long-term storage.
Episodic memory
Episodic memory stores time-stamped experiences, what happened, when it happened, and under what circumstances.
Characteristics:
- Strongly tied to time
- Append-only or slowly growing
- Gradually loses relevance over time
Examples:
-
“2026-03-12 10:30: User asked to book a flight using browser automation”
-
“Yesterday: user complained about poor sleep quality”
Episodic memory enables the agent to recall past interactions, detect patterns over time, provide continuity (“last time you mentioned…")
Semantic memory
Semantic memory stores abstracted knowledge extracted from experience: facts, rules, preferences, and generalized conclusions.
Characteristics
- Time-independent (or weakly time-dependent)
- Highly reusable
- Actively influences decision-making
- Low volume but high value
Examples:
- “User prefers aisle seats when flying”
- “If user sleeps < 6h for multiple days, fatigue increases”
- “Booking flights usually requires passport verification”
Semantic memory often emerges through consolidation from episodic and working memory.
Perceptual memory
Perceptual memory stores raw or minimally processed sensory and environmental data.
Characteristics:
- High volume
- Often noisy
- Rarely retrieved directly
- Used to derive high-level features or summaries
Examples:
- Wearable sensor data (heart rate, steps, sleep cycles)
- Audio, images, screenshots
- Environment logs
Perceptual memory typically feeds into episodic or semantic memory through preprocessing and aggregation.
Example Case: Personal assistant agent
| Memory Type | Example Usage |
|---|---|
| Working | Current conversation, tool results, active task |
| Episodic | Daily interaction logs, notable events |
| Semantic | User preferences, learned habits, rules |
| Perceptual | Health sensor data, raw activity logs |
Example Case: OpenClaw
OpenClaw (Previously called Clawdbot, Moltbot) is recent popular agent project, which achieves the most popular AI product in Github. In Clawdbot, the working memory is equivalent to context, which contains:
[0] System Prompt (static + conditional instructions)
[1] Project Context (bootstrap files: AGENTS.md, SOUL.md, etc.)
[2] Conversation History (messages, tool calls, compaction summaries)
[3] Current Message
In project context, there are user editable Markdown files, which can be injected into working memory. This includes:
| File | Purpose |
|---|---|
| AGENTs.md | Agent instructions, including memory guidelines |
| SOUL.md | Personality and tone |
| USER.md. | Information about the user |
| TOOLs.md | Usage guidance for external tools |
Episodic and semantic memory of Clawdbot is build on plain Markdown in the agent workspace. It has a two layer memory system.
Memory lives in the agent’s workspace (default: ~/clawd/):
~/clawd/
├── MEMORY.md - Layer 2: Long-term curated knowledge
└── memory/
├── 2026-01-26.md - Layer 1: Today's notes
├── 2026-01-25.md - Yesterday's notes
├── 2026-01-24.md - ...and so on
└── ...
Layer 1 is episodic memory, recording daily logs, These are append-only daily notes that the agent writes here throughout the day. The agent writes this when the agent wants to remember something or when explicitly told to remember something.
# 2026-01-26
## 10:30 AM - API Discussion
Discussed REST vs GraphQL with user. Decision: use REST for simplicity.
Key endpoints: /users, /auth, /projects.
## 2:15 PM - Deployment
Deployed v2.3.0 to production. No issues.
## 4:00 PM - User Preference
User mentioned they prefer TypeScript over JavaScript.
Layer 2 is Long-term memory, what we called semantic memory. This is curated, persistent knowledge. Agent writes to this when significant events, thought, decisions, opinions, and lessons learned.
# Long-term Memory
## User Preferences
- Prefers TypeScript over JavaScript
- Likes concise explanations
- Working on project "Acme Dashboard"
## Important Decisions
- 2026-01-15: Chose PostgreSQL for database
- 2026-01-20: Adopted REST over GraphQL
- 2026-01-26: Using Tailwind CSS for styling
## Key Contacts
- Alice (alice@acme.com) - Design lead
- Bob (bob@acme.com) - Backend engineer
Hot and cold memory
LLMs operate with a bounded context window and a bounded latency budget. Memory, in contrast, is unbounded and grows continuously. Without hot/cold separation, the system faces an impossible trade-off:
-
either retrieve too much and overflow context / latency,
-
or retrieve too little and hallucinate or forget.
Hot/cold memory is the mechanism that resolves this tension.
In a long-running agent, each memory type can be split into:
-
Hot memory (hot tier): optimized for low-latency retrieval and frequent access
-
Cold memory (cold tier): optimized for low-cost storage and long retention
-
Archive index (hot index over cold): lightweight searchable representation of cold items (so we never scan cold directly)
Rule of thumb: hot tier stores what is needed for interactive reasoning, cold tier stores what might be needed “eventually”.
Working memory
Hot working memory:
-
Stored in process state
-
Contains:
- current conversation window (last N turns),
- current task state (goal, plan, pending actions)
- recent tool outputs and scratchpad.
-
Retention: Time-to-Live (TTL): from minutes to hours
Cold working memory:
- Usually not stored as “working memory”
- Instead, a subset is consolidated into episodic / semantic memory, see How different memory is created
Episodic memory
Hot episodic memory:
- Store: recent and/or frequently used episodes
- Typical hot content:
- last 7/30/90 days of events
- high-importance episodes
- Index:
- Vector index on event text/summary, this is used for semantic retrieval
- time index, used for range filtering
- tags/entities filters
Cold episodic memory
- Store: full event details for long-term history
- Typical cold content: full text, attachments pointers, tool traces, structured fields
- Partitioning: by user/agent, by time (day/week/month) for time-range retrieval
Archive index for episodic
- Stored in hot tier (fast)
- Contains:
- event summary
- summary embedding
- time range
- tags/entities
- pointer to cold object
Semantic memory
Hot semantic memory:
- Store: active knowledge that affects decisions
- Recommend structure:
- Graph DB (nodes: concepts/rules/preference; edges: relations)
- Optional vector index over node description
- Contains:
- high-centrality knowledge
- most-referenced user preferences
- latest versions of rules/policies
Cold semantic memory:
- Store: rarely-used knowledge
- Best practice: archive at subgraph / concept-cluster granularity
- Contains:
- serialised subgraph snapshot
- version metadata
- deprecation reasons
Archive index for semantic memory
- Store: cluster summaries rather than raw nodes
- Contains:
- Cluster_id + version
- summary + embedding
- kep concepts list
- pointer to cold snapshot
Perceptual memory
Hot perceptual memory
- Store: recent raw window and derived features
- Contains
- last N days raw signals
- features (daily aggregates, anomalies)
- evidence snippets for explanation
- Index:
- time-series index (by timestamp)
- anomaly index (events that matter)
- vector index for media embeddings (images/audio)
Cold perceptual memory:
- Store: full raw data indefinitely (cheap storage)
- Partitioning: user_id / date / modality
- Compression: store raw as parquet/jsonl/zip
- Retrieval: fetch a specific time range or referenced segment
Archive index for perceptual
- Stores:
- time ranges + stats + anomaly markers
- thumbnails/low-res previews (optional)
- embeddings for media summaries (optional)
- pointer to raw partitions
How different memory is created?
Memory consolidation does not happen continuously. It is triggered by events or conditions indicating that information in working memory has long-term value.
Primary consolidation triggers
| Trigger type | Examples |
|---|---|
| Task boundary | Task completed, failed, or abandoned |
| User signal | Explicit feedback, correction, or preference |
| Repetition | Similar working-memory patterns observed repeatedly |
| Decision point | Irreversible or high-impact action taken |
| Anomaly | Unexpected outcome or error |
| Time boundary | End of conversation or daily summary cycle |
How episodic memory is created?
Typical candidates from working memory can be consolidated as episodic memory:
-
completed tasks,
-
notable user requests,
-
important tool interactions,
-
errors and recoveries,
-
user-visible outcomes.
Episodic memory consolidation pipeline
-
Event segmentation: segmenting working memory into discrete events.
Conversation → multiple atomic events:
-
“User requested flight booking”
-
“Browser automation executed”
-
“Flight booked successfully”
-
-
Event summarization: each event is summarised into a concise description suitable for retrieval.
Example: “2026-01-22 10:30 — Booked a flight to Paris using browser automation.”
-
Temporal anchoring: the event is assigned with
start_time,end_time. -
Metadata extraction: extract entities (people, place, objects), tags, importance signals. This can become filterable fields in episodic memory
-
Storage and indexing: The summarized event is written to hot episodic memory. A semantic embedding is generated and indexed. The canonical metadata record is created.
At this point, an episodic memory is created, and can be retrieved via semantic or temporal queries.
Example: how Clawdbot create the memory from working memory?
In Clawdbot, you can directly edit the Markdown files to create the memory. Besides this, Clawdbot supports automatic memory flush.
This is a lossy process. Important information may be summarised away and potentially lost. To counter that, Clawbot uses the pre-compaction memory flush.
┌─────────────────────────────────────────────────────────────┐
│ Context Approaching Limit │
│ │
│ ████████████████████████████░░░░░░░░ 75% of context │
│ ↑ │
│ Soft threshold crossed │
│ (contextWindow - reserve - softThreshold)│
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Silent Memory Flush Turn │
│ │
│ System: "Pre-compaction memory flush. Store durable │
│ memories now (use memory/YYYY-MM-DD.md). │
│ If nothing to store, reply with NO_REPLY." │
│ │
│ Agent: reviews conversation for important info │
│ writes key decisions/facts to memory files │
│ -> NO_REPLY (user sees nothing) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Compaction Proceeds Safely │
│ │
│ Important information is now on disk │
│ Compaction can proceed without losing knowledge │
└─────────────────────────────────────────────────────────────┘
How semantic memory is created?
Semantic memory can be created either by user created files or consolidated from working memory and episodic memory.
User can specify their preferences, rules in a file, indicating their habits.
Semantic consolidation is triggered when:
- similar episodic events recur,
- the agent detects a stable pattern,
- the user confirms or corrects behavior,
- a decision has long-term implications.
Example: Multiple episodic events show the user prefers morning flights.
Semantic memory consolidation pipeline:
- Pattern detection: The system analyzes episodic memory over time: clustering similar events, detecting repeated choices or outcomes.
- Hypothesis generation: A candidate semantic statement is proposed. At this stage, the statement is tentative.
- Validation and confidence assignment: Confidence is adjusted based on: frequency, consistency, explicit user confirmation.
- Graph integration: the validated semantic memory is inserted into the semantic memory graph.
- Versioning and stability
What is the relation of perceptual memory with other memories?
Perceptual memory provides raw evidence that supports or contradicts other memories.
Perceptual data contributes to episodic memory by enriching event descriptions and providing objective measurements.
For example, from raw sensor data, we can derive user slept 4.2 hours last night. This can be merged to episodic memory
Semantic memory may emerge from aggregated perceptual patterns.
For Example, weeks of sleep data + repeated fatigue episodes -> semantic rule.
An simple example
Working memory
User books multiple flights over time, repeatedly choosing morning departures.
Episodic memory
“2026-01-22 — Booked morning flight to Paris”
“2026-02-10 — Booked morning flight to Berlin”
Semantic memory
Preference: “User prefers morning flights” (confidence 0.85)
Perceptual memory
Not involved in this example.
How different memory is retrieved?
Memory retrieval is not a single database query. Instead, it is a multi-stage control process that integrates: query understanding, memory-type routing, confidence evaluation, archive escalation, and controlled rehydration
The detailed process are listed below:
-
Build a retrieve plan according to the query, task type, explicit or implicit time references, required entities or concepts.
-
Based on this the system constructs a retrieval plan that specifies:
- which memory types to query
- preferred retrieval order
- time ranges to apply
- escalation thresholds and limits
-
Hot memory retrieval
- First retrieval the working memory
- For episodic memory, the system embeds the query, performs vector similarity search against the hot episodic index, filters by owner, time range, and memory tier. The resulting candidates are reranked using semantic similarity, recency decay, importance and value scores.
- For semantic memory, identify seed concepts, and traverse the semantic graph. The output is formatted as subgraph containing relevant concepts, rules, and their relationships
- For Perceptual memory, the system queries hot perceptual stores for recent aggregates, detected anomalies, metrics relevant to the query.
-
Confidence evaluation
- Confidence can be assessed across several dimensions:
- semantic relevance
- coverage
- temporal fit
- authority: reliability of semantic knowledge
- consistency: agreement across memory type
- These metrics are combined into an overall confidence score. If the score exceeds a pre-defined threshold, retrieval stops and the system proceeds to context assembly.
- Confidence can be assessed across several dimensions:
-
Escalation to archive indices
- If confidence is insufficient, the system escalates to archive indices.
- Importantly, escalation does not query cold storage directly. Instead, it searches lightweight archive indices that represent archived memory.
- The output of this stage is a list of candidate pointers into cold storage.
-
Cold memory rehydration
- Once archive candidates are identified, the system selectively fetches cold memory using their archive pointers. Two rehydration strategies can be applied
- Lazy hydration: Cold memory is loaded only for the current response and is not reinserted into hot storage.
- Warm restoration: If a cold memory is accessed frequently or deemed high value, it is restored to hot storage and reindexed.
- Once archive candidates are identified, the system selectively fetches cold memory using their archive pointers. Two rehydration strategies can be applied
-
Merging and Re-evaluating Results
- Hot and cold results are merged and reranked according to the retrieval plan. Confidence evaluation is re-run on the merged result set. If confidence remains low, the system acknowledges uncertainty, or request clarification.
-
Context packaging
- Before passing retrieved memory to the language model, the system formats it into compact, structured blocks, including:
- working context
- semantic facts and rules
- episodic evidence
- perceptual summaries
- Each block is truncated or summarised to fit token budgets, annotated with timestamps and confidence cues, ordered by relevance.
- Before passing retrieved memory to the language model, the system formats it into compact, structured blocks, including:
Diagram to show this process
┌──────────────────────────────────────────────────────────┐
│ USER QUERY │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ QUERY ANALYSIS & RETRIEVAL PLANNING │
│ - intent detection │
│ - entity extraction │
│ - time range inference │
│ - memory type routing │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ HOT MEMORY RETRIEVAL │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Working │ │ Episodic │ │ Semantic │ │
│ │ Memory │ │ (Vector DB) │ │ (Graph DB) │ │
│ │ (Session) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └─────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Perceptual │ │
│ │ (Features) │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ CONFIDENCE EVALUATION │
│ │
│ - semantic relevance │
│ - coverage (entities / aspects) │
│ - temporal fit │
│ - authority & stability │
│ - cross-memory consistency │
│ │
│ confidence ≥ threshold ? │
└──────────────────────────────────────────────────────────┘
│ YES │ NO
▼ ▼
┌──────────────────────────────┐ ┌───────────────────────────────┐
│ CONTEXT PACKAGING │ │ ARCHIVE INDEX RETRIEVAL │
│ (Hot memory only) │ │ │
│ │ │ ┌──────────────┐ │
│ - compress │ │ │ Episodic │ │
│ - structure │ │ │ Archive IDX │ │
│ - order by relevance │ │ └──────────────┘ │
│ │ │ │
│ │ │ ┌──────────────┐ │
│ │ │ │ Semantic │ │
│ │ │ │ Archive IDX │ │
│ │ │ └──────────────┘ │
│ │ │ │
│ │ │ ┌──────────────┐ │
│ │ │ │ Perceptual │ │
│ │ │ │ Archive IDX │ │
│ │ │ └──────────────┘ │
└──────────────────────────────┘ └───────────────────────────────┘
│ │
▼ ▼
┌──────────────────────────────┐ ┌────────────────────────────────┐
│ LLM CONTEXT │ │ COLD MEMORY FETCH │
│ (Ready for inference) │ │ │
└──────────────────────────────┘ │ ┌──────────────────────────┐ │
│ │ Cold Object Storage │ │
| │ (JSON / Parquet / Media) │ │
│ └──────────────────────────┘ │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ REHYDRATION & MERGE │
│ │
│ - lazy hydrate │
│ - optional warm restore │
│ - rerank merged results │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ CONFIDENCE RE-EVALUATION │
│ │
│ sufficient? │
└────────────────────────────────┘
│ YES │ NO
▼ ▼
┌──────────────────────┐ ┌───────────────────┐
│ CONTEXT PACKAGING │ │ EXPLICIT │
│ (Hot + Cold memory) │ │ UNCERTAINTY / │
│ │ │ CLARIFICATION │
└──────────────────────┘ └───────────────────┘
│
▼
┌──────────────────────┐
│ LLM INFERENCE │
└──────────────────────┘
Example: how the memory in Clawdbot get indexed?
When you save a memory file, here’s what happens behind the scenes:
┌─────────────────────────────────────────────────────────────┐
│ 1. File Saved │
│ ~/clawd/memory/2026-01-26.md │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. File Watcher Detects Change │
│ Chokidar monitors MEMORY.md + memory/**/*.md │
│ Debounced 1.5 seconds to batch rapid writes │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Chunking │
│ Split into ~400 token chunks with 80 token overlap │
│ │
│ ┌────────────────┐ │
│ │ Chunk 1 │ │
│ │ Lines 1-15 │──────┐ │
│ └────────────────┘ │ │
│ ┌────────────────┐ │ (80 token overlap) │
│ │ Chunk 2 │◄─────┘ │
│ │ Lines 12-28 │──────┐ │
│ └────────────────┘ │ │
│ ┌────────────────┐ │ │
│ │ Chunk 3 │◄─────┘ │
│ │ Lines 25-40 │ │
│ └────────────────┘ │
│ │
│ Why 400/80? Balances semantic coherence vs granularity. │
│ Overlap ensures facts spanning chunk boundaries are │
│ captured in both. Both values are configurable. │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Embedding │
│ Each chunk -> embedding provider -> vector │
│ │
│ "Discussed REST vs GraphQL" -> │
│ OpenAI/Gemini/Local -> │
│ [0.12, -0.34, 0.56, ...] (1536 dimensions) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Storage │
│ ~/.clawdbot/memory/<agentId>.sqlite │
│ │
│ Tables: │
│ - chunks (id, path, start_line, end_line, text, hash) │
│ - chunks_vec (id, embedding) -> sqlite-vec │
│ - chunks_fts (text) -> FTS5 full-text │
│ - embedding_cache (hash, vector) -> avoid re-embedding │
└─────────────────────────────────────────────────────────────┘
sqlite-vec is a SQLite extension that enables vector similarity search directly in SQLite, no external vector database required.
FTS5 is SQLite’s built-in full-text search engine that powers the BM25 keyword matching. Together, they allow Clawdbot to run hybrid search (semantic + keyword) from a single lightweight database file.
When you search memory, Clawdbot runs two search strategies in parallel. Vector search (semantic) finds content that means the same thing and BM25 search (keyword) finds content with exact tokens.
The results are combined with weighted scoring:
$$finalScore = (0.7 * vectorScore) + (0.3 * textScore) $$
Summary (TL;DR)
This article presents a systematic memory architecture for long-running agent systems, designed to address the fundamental mismatch between bounded model context and unbounded experiential memory.
We distinguish context-short-lived, costly, and limited, from memory, which is persistent, unbounded, and searchable. To manage memory at scale, we organize it into four functional types:
-
Working memory for immediate reasoning state
-
Episodic memory for time-stamped experiences
-
Semantic memory for abstracted knowledge, rules, and preferences
-
Perceptual memory for raw sensory data and derived features
Each memory type is further divided into hot and cold tiers. Hot memory contains high-utility information optimized for low-latency retrieval, while cold memory preserves long-term knowledge at low cost. An archive index bridges the two, enabling semantic discovery of cold memory without loading it into context.
Retrieval follows a confidence-gated, tiered pipeline:
-
Retrieve from hot memory first
-
Evaluate confidence across relevance, coverage, time, authority, and consistency
-
Escalate to archive indices only when needed
-
Rehydrate cold memory selectively and safely
Memory creation is governed by consolidation pipelines:
-
Working -> Episodic (events and outcomes)
-
Episodic -> Semantic (patterns and generalizations)
-
Perceptual -> Episodic/Semantic (evidence-driven summaries and rules)
This architecture ensures agents can operate indefinitely, maintaining coherence, efficiency, and explainability while avoiding context overflow, hallucination, and uncontrolled memory growth.
The design is intentionally modular and can be packaged as a reusable memory middleware for agent frameworks, offering measurable benefits in retrieval accuracy, cost efficiency, and long-term reliability.