Memory system design for long-running agent systems

Jiawei Zheng

February 5, 2026 16 min read

For a long-running agent system, a well-designed memory architecture is essential to maintain context, preserve experience, and continuously improve behaviour across interactions.

This article presents a systematic way to think about agent memory, clarifies its relationship with context, and provides a practical framework for storing, retrieving, and composing memory into each model invocation.

This article also includes learned experience from OpenClaw (Previously called Clawdbot, Moltbot).

To make the ideas described in this article practical and easy to adopt, we have developed MemoryAgent, an open-source framework that implements this tiered memory architecture as a drop-in memory layer for agent systems.

Github link: https://github.com/jia-wei-zheng/MemoryAgent

Before discussing specific memory mechanisms, it is crucial to distinguish two closely related but fundamentally different concepts: context and memory.

Context refers to everything the model can directly “see” for a single request.

Context has below features:

Short-lived: Exists only for this single request
Bounded: Limited by the model’s context window (e.g., 200j tokens)
Costly: Larger context -> high latency and API cost
Transient: Discarded after the request completes.

Memory is persistent information, typically stored on disk, either in files or in a database.

The features of memory:

Persistent: Can last days, months, or years
Unbounded: No inherent context window limits
Cheap to store: Disk and database are orders of magnitude cheaper than context tokens.
Searchable: Can be indexed, embedded for semantic retrieval

Memory exists to selectively rebuild context when needed.

Memory type

Based on purpose, lifespan, and usage patterns, agent memory can be characterized into four primary types:

Working memory

Working memory represents the agent’s current workspace, closely aligned with the context.

Characteristics:

Very short-lived
Frequently updated
Automatically cleared or overwritten
Directly influences reasoning and tool use

Examples:

current conversation turns
Intermediate reasoning states
Tool call outputs
Temporary variables (e.g., current task = booking flight)

In practice, working memory is usually implemented as context assembly logic, not long-term storage.

Episodic memory

Episodic memory stores time-stamped experiences, what happened, when it happened, and under what circumstances.

Characteristics:

Strongly tied to time
Append-only or slowly growing
Gradually loses relevance over time

Examples:

“2026-03-12 10:30: User asked to book a flight using browser automation”
“Yesterday: user complained about poor sleep quality”

Episodic memory enables the agent to recall past interactions, detect patterns over time, provide continuity (“last time you mentioned…")

Semantic memory

Semantic memory stores abstracted knowledge extracted from experience: facts, rules, preferences, and generalized conclusions.

Characteristics

Time-independent (or weakly time-dependent)
Highly reusable
Actively influences decision-making
Low volume but high value

Examples:

“User prefers aisle seats when flying”
“If user sleeps < 6h for multiple days, fatigue increases”
“Booking flights usually requires passport verification”

Semantic memory often emerges through consolidation from episodic and working memory.

Perceptual memory

Perceptual memory stores raw or minimally processed sensory and environmental data.

Characteristics:

High volume
Often noisy
Rarely retrieved directly
Used to derive high-level features or summaries

Examples:

Wearable sensor data (heart rate, steps, sleep cycles)
Audio, images, screenshots
Environment logs

Perceptual memory typically feeds into episodic or semantic memory through preprocessing and aggregation.

Example Case: Personal assistant agent

Memory Type	Example Usage
Working	Current conversation, tool results, active task
Episodic	Daily interaction logs, notable events
Semantic	User preferences, learned habits, rules
Perceptual	Health sensor data, raw activity logs

Example Case: OpenClaw

OpenClaw (Previously called Clawdbot, Moltbot) is recent popular agent project, which achieves the most popular AI product in Github. In Clawdbot, the working memory is equivalent to context, which contains:

[0] System Prompt (static + conditional instructions)
[1] Project Context (bootstrap files: AGENTS.md, SOUL.md, etc.)
[2] Conversation History (messages, tool calls, compaction summaries)
[3] Current Message

In project context, there are user editable Markdown files, which can be injected into working memory. This includes:

File	Purpose
AGENTs.md	Agent instructions, including memory guidelines
SOUL.md	Personality and tone
USER.md.	Information about the user
TOOLs.md	Usage guidance for external tools

Episodic and semantic memory of Clawdbot is build on plain Markdown in the agent workspace. It has a two layer memory system.

Memory lives in the agent’s workspace (default: ~/clawd/):

~/clawd/
├── MEMORY.md              - Layer 2: Long-term curated knowledge
└── memory/
    ├── 2026-01-26.md      - Layer 1: Today's notes
    ├── 2026-01-25.md      - Yesterday's notes
    ├── 2026-01-24.md      - ...and so on
    └── ...

Layer 1 is episodic memory, recording daily logs, These are append-only daily notes that the agent writes here throughout the day. The agent writes this when the agent wants to remember something or when explicitly told to remember something.

# 2026-01-26

## 10:30 AM - API Discussion
Discussed REST vs GraphQL with user. Decision: use REST for simplicity.
Key endpoints: /users, /auth, /projects.

## 2:15 PM - Deployment
Deployed v2.3.0 to production. No issues.

## 4:00 PM - User Preference
User mentioned they prefer TypeScript over JavaScript.

Layer 2 is Long-term memory, what we called semantic memory. This is curated, persistent knowledge. Agent writes to this when significant events, thought, decisions, opinions, and lessons learned.

# Long-term Memory

## User Preferences
- Prefers TypeScript over JavaScript
- Likes concise explanations
- Working on project "Acme Dashboard"

## Important Decisions
- 2026-01-15: Chose PostgreSQL for database
- 2026-01-20: Adopted REST over GraphQL
- 2026-01-26: Using Tailwind CSS for styling

## Key Contacts
- Alice (alice@acme.com) - Design lead
- Bob (bob@acme.com) - Backend engineer

Hot and cold memory

LLMs operate with a bounded context window and a bounded latency budget. Memory, in contrast, is unbounded and grows continuously. Without hot/cold separation, the system faces an impossible trade-off:

either retrieve too much and overflow context / latency,
or retrieve too little and hallucinate or forget.

Hot/cold memory is the mechanism that resolves this tension.

In a long-running agent, each memory type can be split into:

Hot memory (hot tier): optimized for low-latency retrieval and frequent access
Cold memory (cold tier): optimized for low-cost storage and long retention
Archive index (hot index over cold): lightweight searchable representation of cold items (so we never scan cold directly)

Rule of thumb: hot tier stores what is needed for interactive reasoning, cold tier stores what might be needed “eventually”.

Working memory

Hot working memory:

Stored in process state
Contains:
- current conversation window (last N turns),
- current task state (goal, plan, pending actions)
- recent tool outputs and scratchpad.
Retention: Time-to-Live (TTL): from minutes to hours

Cold working memory:

Usually not stored as “working memory”
Instead, a subset is consolidated into episodic / semantic memory, see How different memory is created

Episodic memory

Hot episodic memory:

Store: recent and/or frequently used episodes
Typical hot content:
- last 7/30/90 days of events
- high-importance episodes
Index:
- Vector index on event text/summary, this is used for semantic retrieval
- time index, used for range filtering
- tags/entities filters

Cold episodic memory

Store: full event details for long-term history
Typical cold content: full text, attachments pointers, tool traces, structured fields
Partitioning: by user/agent, by time (day/week/month) for time-range retrieval

Archive index for episodic

Stored in hot tier (fast)
Contains:
- event summary
- summary embedding
- time range
- tags/entities
- pointer to cold object

Semantic memory

Hot semantic memory:

Store: active knowledge that affects decisions
Recommend structure:
- Graph DB (nodes: concepts/rules/preference; edges: relations)
- Optional vector index over node description
Contains:
- high-centrality knowledge
- most-referenced user preferences
- latest versions of rules/policies

Cold semantic memory:

Store: rarely-used knowledge
Best practice: archive at subgraph / concept-cluster granularity
Contains:
- serialised subgraph snapshot
- version metadata
- deprecation reasons

Archive index for semantic memory

Store: cluster summaries rather than raw nodes
Contains:
- Cluster_id + version
- summary + embedding
- kep concepts list
- pointer to cold snapshot

Perceptual memory

Hot perceptual memory

Store: recent raw window and derived features
Contains
- last N days raw signals
- features (daily aggregates, anomalies)
- evidence snippets for explanation
Index:
- time-series index (by timestamp)
- anomaly index (events that matter)
- vector index for media embeddings (images/audio)

Cold perceptual memory:

Store: full raw data indefinitely (cheap storage)
Partitioning: user_id / date / modality
Compression: store raw as parquet/jsonl/zip
Retrieval: fetch a specific time range or referenced segment

Archive index for perceptual

Stores:
- time ranges + stats + anomaly markers
- thumbnails/low-res previews (optional)
- embeddings for media summaries (optional)
- pointer to raw partitions

How different memory is created?

Memory consolidation does not happen continuously. It is triggered by events or conditions indicating that information in working memory has long-term value.

Primary consolidation triggers

Trigger type	Examples
Task boundary	Task completed, failed, or abandoned
User signal	Explicit feedback, correction, or preference
Repetition	Similar working-memory patterns observed repeatedly
Decision point	Irreversible or high-impact action taken
Anomaly	Unexpected outcome or error
Time boundary	End of conversation or daily summary cycle

How episodic memory is created?

Typical candidates from working memory can be consolidated as episodic memory:

completed tasks,
notable user requests,
important tool interactions,
errors and recoveries,
user-visible outcomes.

Episodic memory consolidation pipeline

Event segmentation: segmenting working memory into discrete events.

Conversation → multiple atomic events:
- “User requested flight booking”
- “Browser automation executed”
- “Flight booked successfully”
Event summarization: each event is summarised into a concise description suitable for retrieval.

Example: “2026-01-22 10:30 — Booked a flight to Paris using browser automation.”
Temporal anchoring: the event is assigned with start_time, end_time.
Metadata extraction: extract entities (people, place, objects), tags, importance signals. This can become filterable fields in episodic memory
Storage and indexing: The summarized event is written to hot episodic memory. A semantic embedding is generated and indexed. The canonical metadata record is created.

At this point, an episodic memory is created, and can be retrieved via semantic or temporal queries.

Example: how Clawdbot create the memory from working memory?

In Clawdbot, you can directly edit the Markdown files to create the memory. Besides this, Clawdbot supports automatic memory flush.

This is a lossy process. Important information may be summarised away and potentially lost. To counter that, Clawbot uses the pre-compaction memory flush.

┌─────────────────────────────────────────────────────────────┐
│  Context Approaching Limit                                  │
│                                                             │
│  ████████████████████████████░░░░░░░░  75% of context       │
│                              ↑                              │
│                    Soft threshold crossed                   │
│                    (contextWindow - reserve - softThreshold)│
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Silent Memory Flush Turn                                   │
│                                                             │
│  System: "Pre-compaction memory flush. Store durable        │
│           memories now (use memory/YYYY-MM-DD.md).          │
│           If nothing to store, reply with NO_REPLY."        │
│                                                             │
│  Agent: reviews conversation for important info           │
│         writes key decisions/facts to memory files        │
│         -> NO_REPLY (user sees nothing)                     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Compaction Proceeds Safely                                 │
│                                                             │
│  Important information is now on disk                       │
│  Compaction can proceed without losing knowledge            │
└─────────────────────────────────────────────────────────────┘

How semantic memory is created?

Semantic memory can be created either by user created files or consolidated from working memory and episodic memory.

User can specify their preferences, rules in a file, indicating their habits.

Semantic consolidation is triggered when:

similar episodic events recur,
the agent detects a stable pattern,
the user confirms or corrects behavior,
a decision has long-term implications.

Example: Multiple episodic events show the user prefers morning flights.

Semantic memory consolidation pipeline:

Pattern detection: The system analyzes episodic memory over time: clustering similar events, detecting repeated choices or outcomes.
Hypothesis generation: A candidate semantic statement is proposed. At this stage, the statement is tentative.
Validation and confidence assignment: Confidence is adjusted based on: frequency, consistency, explicit user confirmation.
Graph integration: the validated semantic memory is inserted into the semantic memory graph.
Versioning and stability

What is the relation of perceptual memory with other memories?

Perceptual memory provides raw evidence that supports or contradicts other memories.

Perceptual data contributes to episodic memory by enriching event descriptions and providing objective measurements.

For example, from raw sensor data, we can derive user slept 4.2 hours last night. This can be merged to episodic memory

Semantic memory may emerge from aggregated perceptual patterns.

For Example, weeks of sleep data + repeated fatigue episodes -> semantic rule.

An simple example

Working memory

User books multiple flights over time, repeatedly choosing morning departures.

Episodic memory

“2026-01-22 — Booked morning flight to Paris”

“2026-02-10 — Booked morning flight to Berlin”

Semantic memory

Preference: “User prefers morning flights” (confidence 0.85)

Perceptual memory

Not involved in this example.

How different memory is retrieved?

Memory retrieval is not a single database query. Instead, it is a multi-stage control process that integrates: query understanding, memory-type routing, confidence evaluation, archive escalation, and controlled rehydration

The detailed process are listed below:

Build a retrieve plan according to the query, task type, explicit or implicit time references, required entities or concepts.
Based on this the system constructs a retrieval plan that specifies:
- which memory types to query
- preferred retrieval order
- time ranges to apply
- escalation thresholds and limits
Hot memory retrieval
- First retrieval the working memory
- For episodic memory, the system embeds the query, performs vector similarity search against the hot episodic index, filters by owner, time range, and memory tier. The resulting candidates are reranked using semantic similarity, recency decay, importance and value scores.
- For semantic memory, identify seed concepts, and traverse the semantic graph. The output is formatted as subgraph containing relevant concepts, rules, and their relationships
- For Perceptual memory, the system queries hot perceptual stores for recent aggregates, detected anomalies, metrics relevant to the query.
Confidence evaluation
- Confidence can be assessed across several dimensions:
  - semantic relevance
  - coverage
  - temporal fit
  - authority: reliability of semantic knowledge
  - consistency: agreement across memory type
- These metrics are combined into an overall confidence score. If the score exceeds a pre-defined threshold, retrieval stops and the system proceeds to context assembly.
Escalation to archive indices
- If confidence is insufficient, the system escalates to archive indices.
- Importantly, escalation does not query cold storage directly. Instead, it searches lightweight archive indices that represent archived memory.
- The output of this stage is a list of candidate pointers into cold storage.
Cold memory rehydration
- Once archive candidates are identified, the system selectively fetches cold memory using their archive pointers. Two rehydration strategies can be applied
  - Lazy hydration: Cold memory is loaded only for the current response and is not reinserted into hot storage.
  - Warm restoration: If a cold memory is accessed frequently or deemed high value, it is restored to hot storage and reindexed.
Merging and Re-evaluating Results
- Hot and cold results are merged and reranked according to the retrieval plan. Confidence evaluation is re-run on the merged result set. If confidence remains low, the system acknowledges uncertainty, or request clarification.
Context packaging
- Before passing retrieved memory to the language model, the system formats it into compact, structured blocks, including:
  - working context
  - semantic facts and rules
  - episodic evidence
  - perceptual summaries
- Each block is truncated or summarised to fit token budgets, annotated with timestamps and confidence cues, ordered by relevance.

Diagram to show this process

┌──────────────────────────────────────────────────────────┐
│                        USER QUERY                        │
└──────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────┐
│          QUERY ANALYSIS & RETRIEVAL PLANNING             │
│  - intent detection                                      │
│  - entity extraction                                     │
│  - time range inference                                  │
│  - memory type routing                                   │
└──────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────┐
│                    HOT MEMORY RETRIEVAL                  │
│                                                          │
│  ┌──────────────┐   ┌──────────────┐   ┌─────────────┐   │
│  │ Working      │   │ Episodic     │   │ Semantic    │   │
│  │ Memory       │   │ (Vector DB)  │   │ (Graph DB)  │   │
│  │ (Session)    │   │              │   │             │   │
│  └──────────────┘   └──────────────┘   └─────────────┘   │
│                                                          │
│                    ┌──────────────┐                      │
│                    │ Perceptual   │                      │
│                    │ (Features)   │                      │
│                    └──────────────┘                      │
└──────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────┐
│                  CONFIDENCE EVALUATION                   │
│                                                          │
│  - semantic relevance                                    │
│  - coverage (entities / aspects)                         │
│  - temporal fit                                          │
│  - authority & stability                                 │
│  - cross-memory consistency                              │
│                                                          │
│  confidence ≥ threshold ?                                │
└──────────────────────────────────────────────────────────┘
               │ YES                              │ NO
               ▼                                  ▼
┌──────────────────────────────┐   ┌───────────────────────────────┐
│ CONTEXT PACKAGING            │   │    ARCHIVE INDEX RETRIEVAL    │
│ (Hot memory only)            │   │                               │
│                              │   │  ┌──────────────┐             │
│  - compress                  │   │  │ Episodic     │             │
│  - structure                 │   │  │ Archive IDX │              │
│  - order by relevance        │   │  └──────────────┘             │
│                              │   │                               │
│                              │   │  ┌──────────────┐             │
│                              │   │  │ Semantic     │             │
│                              │   │  │ Archive IDX │              │
│                              │   │  └──────────────┘             │
│                              │   │                               │
│                              │   │  ┌──────────────┐             │
│                              │   │  │ Perceptual   │             │
│                              │   │  │ Archive IDX │              │
│                              │   │  └──────────────┘             │
└──────────────────────────────┘   └───────────────────────────────┘
               │                                  │
               ▼                                  ▼
┌──────────────────────────────┐   ┌────────────────────────────────┐
│   LLM CONTEXT                │   │      COLD MEMORY FETCH         │
│   (Ready for inference)      │   │                                │
└──────────────────────────────┘   │  ┌──────────────────────────┐  │
                                   │  │ Cold Object Storage      │  │
                                   |  │ (JSON / Parquet / Media) │  │
                                   │  └──────────────────────────┘  │
                                   └────────────────────────────────┘
                                                     │
                                                     ▼
                                    ┌────────────────────────────────┐
                                    │  REHYDRATION & MERGE           │
                                    │                                │
                                    │  - lazy hydrate                │
                                    │  - optional warm restore       │
                                    │  - rerank merged results       │
                                    └────────────────────────────────┘
                                                     │
                                                     ▼
                                    ┌────────────────────────────────┐
                                    │  CONFIDENCE RE-EVALUATION      │
                                    │                                │
                                    │  sufficient?                   │
                                    └────────────────────────────────┘
                                             │ YES            │ NO
                                             ▼                ▼
                               ┌──────────────────────┐  ┌───────────────────┐
                               │ CONTEXT PACKAGING    │  │ EXPLICIT          │
                               │ (Hot + Cold memory)  │  │ UNCERTAINTY /     │
                               │                      │  │ CLARIFICATION     │
                               └──────────────────────┘  └───────────────────┘
                                             │
                                             ▼
                               ┌──────────────────────┐
                               │   LLM INFERENCE      │
                               └──────────────────────┘

Example: how the memory in Clawdbot get indexed?

When you save a memory file, here’s what happens behind the scenes:

┌─────────────────────────────────────────────────────────────┐
│  1. File Saved                                              │
│     ~/clawd/memory/2026-01-26.md                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  2. File Watcher Detects Change                             │
│     Chokidar monitors MEMORY.md + memory/**/*.md            │
│     Debounced 1.5 seconds to batch rapid writes             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Chunking                                                │
│     Split into ~400 token chunks with 80 token overlap      │
│                                                             │
│     ┌────────────────┐                                      │
│     │ Chunk 1        │                                      │
│     │ Lines 1-15     │──────┐                               │
│     └────────────────┘      │                               │
│     ┌────────────────┐      │ (80 token overlap)            │
│     │ Chunk 2        │◄─────┘                               │
│     │ Lines 12-28    │──────┐                               │
│     └────────────────┘      │                               │
│     ┌────────────────┐      │                               │
│     │ Chunk 3        │◄─────┘                               │
│     │ Lines 25-40    │                                      │
│     └────────────────┘                                      │
│                                                             │
│     Why 400/80? Balances semantic coherence vs granularity. │
│     Overlap ensures facts spanning chunk boundaries are     │
│     captured in both. Both values are configurable.         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  4. Embedding                                               │
│     Each chunk -> embedding provider -> vector              │
│                                                             │
│     "Discussed REST vs GraphQL" ->                          │
│         OpenAI/Gemini/Local ->                              │
│         [0.12, -0.34, 0.56, ...]  (1536 dimensions)         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  5. Storage                                                 │
│     ~/.clawdbot/memory/<agentId>.sqlite                     │
│                                                             │
│     Tables:                                                 │
│     - chunks (id, path, start_line, end_line, text, hash)   │
│     - chunks_vec (id, embedding)      -> sqlite-vec         │
│     - chunks_fts (text)               -> FTS5 full-text     │
│     - embedding_cache (hash, vector)  -> avoid re-embedding │
└─────────────────────────────────────────────────────────────┘

sqlite-vec is a SQLite extension that enables vector similarity search directly in SQLite, no external vector database required.

FTS5 is SQLite’s built-in full-text search engine that powers the BM25 keyword matching. Together, they allow Clawdbot to run hybrid search (semantic + keyword) from a single lightweight database file.

When you search memory, Clawdbot runs two search strategies in parallel. Vector search (semantic) finds content that means the same thing and BM25 search (keyword) finds content with exact tokens.

The results are combined with weighted scoring:

$$finalScore = (0.7 * vectorScore) + (0.3 * textScore) $$

Summary (TL;DR)

This article presents a systematic memory architecture for long-running agent systems, designed to address the fundamental mismatch between bounded model context and unbounded experiential memory.

We distinguish context-short-lived, costly, and limited, from memory, which is persistent, unbounded, and searchable. To manage memory at scale, we organize it into four functional types:

Working memory for immediate reasoning state
Episodic memory for time-stamped experiences
Semantic memory for abstracted knowledge, rules, and preferences
Perceptual memory for raw sensory data and derived features

Each memory type is further divided into hot and cold tiers. Hot memory contains high-utility information optimized for low-latency retrieval, while cold memory preserves long-term knowledge at low cost. An archive index bridges the two, enabling semantic discovery of cold memory without loading it into context.

Retrieval follows a confidence-gated, tiered pipeline:

Retrieve from hot memory first
Evaluate confidence across relevance, coverage, time, authority, and consistency
Escalate to archive indices only when needed
Rehydrate cold memory selectively and safely

Memory creation is governed by consolidation pipelines:

Working -> Episodic (events and outcomes)
Episodic -> Semantic (patterns and generalizations)
Perceptual -> Episodic/Semantic (evidence-driven summaries and rules)

This architecture ensures agents can operate indefinitely, maintaining coherence, efficiency, and explainability while avoiding context overflow, hallucination, and uncontrolled memory growth.

The design is intentionally modular and can be packaged as a reusable memory middleware for agent frameworks, offering measurable benefits in retrieval accuracy, cost efficiency, and long-term reliability.

LLMs AI

Jiawei Zheng

Postdoctoral Research Fellow

Jiawei Zheng’s research interests include AI, Data Science, Blockchain, Process Mining, and IoT.

Memory system design for long-running agent systems

Memory type

Working memory

Episodic memory

Semantic memory

Perceptual memory

Example Case: Personal assistant agent

Example Case: OpenClaw

Hot and cold memory

Working memory

Episodic memory

Semantic memory

Perceptual memory

How different memory is created?

How episodic memory is created?

Example: how Clawdbot create the memory from working memory?

How semantic memory is created?

What is the relation of perceptual memory with other memories?

An simple example

How different memory is retrieved?

Example: how the memory in Clawdbot get indexed?

Summary (TL;DR)

Jiawei Zheng

Postdoctoral Research Fellow

Related