Dear Artist — Notes by Haixiang Yuan

Building devbrief: A Local History Browser for Claude Code Terminal Sessions

Tue, 28 Apr 2026 10:00:00 GMT

Engineering Note · Open Source · Developer Tools

How a simple session summariser turned into a token-safe, local-first developer tool.

$ devbrief list → devbrief raw → devbrief estimate → devbrief brief

View devbrief on GitHub →

Local preview, no LLM call.

1. The problem

I started building devbrief because of a very practical problem.

I use Claude Code heavily in the terminal. It is where a lot of real work happens: debugging, refactoring, writing scripts, checking logs, reviewing files, and gradually moving a task from ambiguity to closure.

But after a session ends, the work becomes surprisingly hard to revisit.

Claude Code does store terminal sessions locally as JSONL transcripts under ~/.claude/projects. The problem is not that the history does not exist. The problem is that it is not pleasant to browse.

If I want to remember what happened in a previous coding session, I do not want to dig through raw JSON. I want to know:

What was the task?
What did I ask Claude to do?
What files were touched?
What commands were run?
Did the session finish?
Was it blocked?
Did it stop because of a usage limit?
Do I need to continue this later?

That was the original motivation for devbrief: a local browser for Claude Code terminal history. But the first version went in the wrong direction.

The same instinct shows up in my Claude status line work and the broader terminal workstack — I want clear visibility over what the AI is doing on my machine before I trust any of it.

2. The first wrong design

The obvious first idea was: automatically summarise every completed Claude Code session.

It sounded convenient. Every time a session ended, devbrief would read the transcript, call Claude, extract a nice summary, and store it locally. Something like:

Claude Code session ends
        ↓
devbrief reads full transcript
        ↓
Claude summarises problem → approach → outcome
        ↓
summary saved locally

On paper, that looked useful. In practice, it was the wrong default.

A Claude Code transcript is not just a clean conversation. It can contain tool calls, shell output, long logs, file contents, diffs, JSON, prompts, errors, repeated command output, and sometimes the output of previous analysis steps.

One of my sessions had this shape:

One real session

Raw transcript chars: 833,676
Compact evidence chars: 16,045
Approx input tokens: 4,011
Excluded long tool outputs: 85
Excluded full file contents: 39

That number changed how I thought about the product.

The expensive part was not "summarising a task". The expensive part was blindly feeding a huge raw development transcript back into a model, sometimes repeatedly.

The first version had another problem: the hook could run automatically. That meant a tool designed to help me understand Claude Code sessions could itself start burning Claude usage in the background. That broke the product boundary.

Local evidence, not billing proof

After the first implementation, I found local evidence that the unsafe Stop hook behaviour was real. The SQLite database contained recursive self-analysis rows — sessions where devbrief had effectively analysed its own analyser prompt instead of a normal development task. The local history also showed usage-limit endings during that period.

That is not billing proof, and it does not tell me the exact number of tokens spent. But it was enough evidence to confirm the product risk: an automatic summariser can become part of the problem it is supposed to explain.

This is the same lesson I keep relearning in token cost work and in labelling every LLM call: the costs you do not see are the ones that hurt.

3. The product correction

The real product principle became clear:

Local visibility first.
AI only when explicitly asked.

That changed the entire architecture. devbrief should not be an automatic AI summariser. It should first be a local terminal history browser. The raw history browser should be useful even with no API key, no Claude quota, no hook installed, and no AI calls. AI should be an optional second layer, not the default behaviour.

So the product split into two layers. The first layer is local and zero-token:

devbrief list
devbrief raw SESSION_ID
devbrief view SESSION_ID
devbrief estimate SESSION_ID
devbrief doctor

The second layer is explicit and token-consuming:

devbrief brief SESSION_ID

That command only works on one selected session. It shows an estimate first. It asks for confirmation. Only then does it call Claude. This became the core safety model of the tool.

Local-first by default

The best AI tooling is not the one that calls AI everywhere. It is the one that knows when not to.

4. What devbrief does

devbrief is a local Claude Code terminal history browser with optional AI briefs.

The product surface is intentionally small:

Browse local Claude Code sessions, grouped by project, with raw previews that never call a model.
Read a session: deterministic outcome detection (completed, blocked, usage-limited, needs-followup) without sending anything to Claude.
Decide: estimate the cost of an AI brief locally, then optionally generate one brief for one selected session, after explicit confirmation.

Everything else is a variation of these three. The goal is not to replace Claude Code. It is to make its terminal history easier to inspect, understand, and continue from.

5. Raw history browsing

The raw preview is the most important part of the product. It does not call Claude. It does not spend tokens. It reads the local JSONL transcript and organises the useful parts into a readable view.

A session preview includes:

Session ID
Home project
CWD
Started / updated time
Status
Turn count
Session outcome
Human request
What happened locally
Files touched or inspected
Final assistant response

For example, one real session ended like this:

$ devbrief raw 6f9fdb83

Session Outcome
  Status      usage_limited
  Completion  incomplete
  Confidence  high
  Reason      Final closeout/verification was requested,
              but Claude Code stopped because usage
              limit was reached.
  Signals     final response contains "out of extra usage";
              late request asks for "final closeout"

This is more useful than a generic summary. It tells me not just what the session was about, but whether it actually finished.

That distinction matters. A session that ended with "done" is very different from a session that ended because usage limit was reached halfway through verification.

6. Session outcome detection

One of the features I care about most is local session outcome detection. Without calling a model, devbrief tries to infer whether a session is:

completed
incomplete
blocked
usage_limited
interrupted
needs_followup
unknown

It does this with deterministic local heuristics. For example, if the final assistant response contains out of extra usage, usage limit, resets, or rate limit, then the session can be marked as usage_limited.

If the final user request asked for a closeout report, verification, deploy status, or remaining risks, and the session ended with a usage limit message, devbrief marks the completion state as incomplete.

This is simple, but useful. It turns raw history into something closer to a task log.

7. Token safety model

The most important design constraint is that browsing should never silently spend tokens. In normal use, exactly one command is meant to spend tokens.

Command	Calls Claude?	Spends tokens?	Notes
`devbrief list`	No	No	Local JSONL + SQLite only
`devbrief raw SESSION_ID`	No	No	Local preview
`devbrief view SESSION_ID`	No	No	Reads stored brief from SQLite
`devbrief estimate SESSION_ID`	No	No	Shows packet size only
`devbrief doctor`	No	No	Local diagnostics
`devbrief tui`	No	No	Local browsing
`devbrief capture --hook`	No	No	Metadata only
`devbrief brief SESSION_ID`	Yes	Yes	Only after estimate + confirmation
`devbrief digest SESSION_ID`	Yes	Yes	Deprecated alias for brief; avoid using it
`devbrief report`	No	No	Disabled compatibility stub

The old multi-session report command was disabled because it did not fit the safety model. Multi-session AI reporting can easily become expensive. The product is now intentionally narrower:

One session.
One brief.
Explicit confirmation.

AI only after confirmation

That restraint is a feature.

8. Compact evidence before AI

When an AI brief is requested, devbrief does not send the full raw transcript. It first builds a compact evidence packet.

The compactor removes or truncates:

long shell outputs
full file contents
large diffs
repeated JSON
internal analyser prompts
huge tool results

It keeps the things that matter:

human requests
final assistant response
commands run
files touched
errors and blockers
tool names
session metadata

Then it shows an estimate. The estimate itself never calls Claude:

$ devbrief estimate 6f9fdb83

Token estimate for session 6f9fdb83

  Raw transcript chars       833,676
  Compact evidence chars      16,045
  Approx input tokens          4,011
  Truncated                   yes

No LLM call has been made.

Only when you explicitly run devbrief brief does devbrief ask before spending tokens:

$ devbrief brief 6f9fdb83

Session: glia-core/6f9fdb83

  Compact evidence chars : 16,045
  Approx input tokens    : 4,011
  ⚠ Transcript was truncated to fit max_chars

Generate brief and spend Claude tokens? [y/n] (n):

The default is n. That default matters.

9. The optional hook

Claude Code supports hooks, but devbrief treats them carefully. The optional hook is capture-only:

devbrief capture --hook

It records lightweight metadata, such as:

session_id
jsonl_path
project_name
cwd
created_at
updated_at
turn_count
status = pending/raw

It does not call Claude. It does not generate a brief. It does not spend tokens.

Unsafe hook patterns are explicitly avoided. They are shown here as anti-patterns, not recommended commands:

devbrief digest --hook
devbrief brief --hook
claude -p
claude --print

The hook is not required for browsing. devbrief can read Claude Code's local JSONL history directly. This means the safest default is:

No hook installed.
Browse locally.
Generate AI only when needed.

Capture-only hook

The hook captures metadata into the local SQLite store and nothing more. No transcript content leaves the machine, and Claude is never invoked behind your back.

10. Interactive terminal browser

devbrief also includes an interactive terminal UI:

devbrief
# or
devbrief tui

The interface is a keyboard-first split-pane browser. The left pane shows sessions:

$ devbrief — sessions

ID         Status        Outcome          Project        Title
────────   ───────────   ──────────────   ────────────   ──────────────────────────
6f9fdb83   pending/raw   usage_limited    glia-core      quote provenance bug
e5aab70b   briefed       completed        glia-core      Fix LLM-fabricated prose
6534fc72   briefed       completed        devbrief       Build devbrief CLI

The right pane shows the selected session — outcome, request, what happened locally, files touched, and the final assistant response. If an AI brief already exists, the detail pane can show that instead.

TUI keybindings — local, fast, zero-token.

The important part is that opening and navigating the TUI is still local-only. It does not call Claude just because I browse around.

Key bindings include:

j / k or ↑ / ↓    move selection
Enter             open / focus detail
v                 toggle raw preview / AI brief
d or b            generate brief, after estimate + confirmation
r                 refresh
a                 toggle current project / all projects
?                 help
q                 quit

The interface is meant to make previous Claude Code work feel navigable, not buried inside JSONL files.

11. Architecture

The architecture is intentionally simple.

Claude Code JSONL transcripts
~/.claude/projects
        ↓
devbrief parser
        ↓
local raw preview
        ↓
session outcome detector
        ↓
SQLite metadata + stored briefs
        ↓
optional AI brief after confirmation

Local paths

Claude transcripts: ~/.claude/projects
devbrief config: ~/.config/devbrief/config.toml
devbrief database: ~/.local/share/devbrief/sessions.db

The stack is deliberately lightweight:

Python
Click for CLI commands
Rich for terminal output
Textual for the interactive TUI
SQLite for local metadata and stored briefs

The tool does not need a backend. It does not need a hosted service. It does not need a database server. It sits next to Claude Code and helps me see what already exists locally.

12. Example workflow

A normal pass through devbrief starts with local browsing and only reaches AI at the final, confirmation-gated step. Step through it here.

Step 1 — Browse · zero-token

$ devbrief list

glia-core
  6f9fdb83  2026-04-25 19:42   42 turns   usage_limited
  9c12ab07  2026-04-23 11:08   18 turns   completed

dear-artist
  4a3e51d2  2026-04-22 09:14   31 turns   completed
  2b8f0e6c  2026-04-20 22:01   7 turns    needs_followup

4 sessions across 2 projects.

Local listing of recent Claude Code sessions, grouped by project. No model called.

Step 2 — Read · zero-token

$ devbrief raw 6f9fdb83

Session: glia-core/6f9fdb83
Started:  2026-04-25 17:03    Ended: 2026-04-25 19:42
Turns:    42                  Outcome: usage_limited

— last user request —
Can you give me a closeout report on the migration: what landed,
what is still open, and what to verify before deploy?

— last assistant response —
You are out of extra usage. Limits reset in 2 hours.

No LLM call has been made.

Deterministic preview built from the local JSONL. No tokens spent.

Step 3 — Estimate · zero-token

$ devbrief estimate 6f9fdb83

Token estimate for session 6f9fdb83

  Raw transcript chars       833,676
  Compact evidence chars      16,045
  Approx input tokens          4,011
  Truncated                   yes

No LLM call has been made.

Estimate the cost of an AI brief locally, before deciding to spend tokens.

Step 4 — Brief · confirmation-gated

$ devbrief brief 6f9fdb83

Session: glia-core/6f9fdb83

  Compact evidence chars : 16,045
  Approx input tokens    : 4,011
  ⚠ Transcript was truncated to fit max_chars

Generate brief and spend Claude tokens? [y/n] (n):

The only command that can spend tokens. It always asks first; the default is no.

This keeps the default path local and cheap, while still allowing AI to be used when it actually adds value.

13. From experiment to open source

After the core product model stabilised, I prepared the project for GitHub. The bigger work was language, not code: devbrief is positioned as a local history browser with optional AI briefs, not an automatic summarisation tool. That difference is the whole product.

The repository is public at https://github.com/yuannh/devbrief. The README covers installation, the full command reference, hook safety notes, where local data is stored, and usage examples. This note stays focused on the design decisions; the README handles the manual.

Quickstart

Clone the repository.
Install in editable mode.
Verify the local setup with devbrief doctor.

$ git clone https://github.com/yuannh/devbrief.git
$ cd devbrief
$ pip install -e .
$ devbrief doctor

devbrief doctor runs local health checks only — no network calls, no API key required.

References:

14. Reflection

devbrief started as a convenience tool, but the real lesson was product restraint.

In AI tools, the tempting default is to send everything back to a model: transcripts, logs, diffs, files, prompts, command output. It feels intelligent, but it can quietly create cost, latency, and trust problems.

The better default is local visibility first, AI only when it adds clear value.

That decision shaped every command, every flag, and every line of the safety table. I started by trying to summarise more. I ended by building a tool that summarises less, but lets me see better.

Restraint is a feature.

Originally published at: https://dearartist.xyz/blog/devbrief-local-history-browser

Originally published at https://dearartist.xyz/blog/devbrief-local-history-browser.

Label Every LLM Call: How We Cut AI Backend Cost Without Downgrading Quality

Mon, 27 Apr 2026 10:00:00 GMT

A real backend refactor from one-model-fits-all dispatch to per-callsite LLM routing, prompt caching, and token-level observability.

Abstract

"The problem was not that we were calling LLMs too often. The problem was that every task was routed through the same expensive model tier."

Key changes — at a glance

Change	Est. impact
Per-callsite LLM routing across 30 callsites	structural
13 extraction/classification calls → Flash-class	~$3–5/day
Profile summaries → Haiku-class	~$0.5–1/day
Chat output cap 4096 → 2048 tokens	~$0.5–1/day
Anthropic prompt caching on stable prefixes	~$0.3–2/day
Token-level logging on every callsite	observability
Pinned quality paths before flipping primary	no regression
Total expected QA saving	~$4.7–10/day

QA baseline ~$10–13/day → target ~$2–5/day. Pending 24h billing validation.

Opening

In the early days of building an AI-powered product, it is tempting to wire everything to your best model and ship.

The quality is great. Iteration is fast. The product feels smarter. Cost can feel like a future problem, especially when you are still at MVP scale.

Then you check the API bill.

In our QA environment, LLM costs were running around $10–13 per day. Not production. Not a high-traffic period. Just a testing environment with a handful of active users.

At first glance, nothing looked obviously broken. There was no runaway loop. No duplicate worker. No infinite retry storm. The infrastructure was behaving normally.

The problem was more structural:

We had one expensive model acting as the default for almost every LLM task, and we had never priced the intent of each call.

Chat, narrative composition, entity extraction, memory patching, story validation, thread routing, profile summaries — everything was flowing through the same primary provider.

Some of those calls needed a high-quality narrative model. Most did not.

This post walks through how we audited 30 LLM callsites, introduced per-callsite model routing, added token observability, and reduced expected QA cost significantly without changing product logic or downgrading user-visible quality.

This post is a real engineering note from building Glia, a personal AI for reflection. The specific numbers come from our QA environment during MVP development, but the pattern is broadly applicable to any AI backend with multiple LLM tasks.

1. The Product Context

Glia is a personal AI for reflection. Behind the conversational surface, several backend agents work together — chat, longer-form story composition, memory patching, people-card extraction, thread routing — each one an LLM call with different quality and cost requirements.

extracts people, places, media, and other entities from messages
patches long-term memory
routes messages into topic threads
creates cards and relationship signals
periodically composes narrative stories from those threads
builds a user narrative model used to personalize future responses

The backend already had a unified LLM dispatch function:

result, meta = generate_text_with_failover(
    prompt=prompt,
    where="entity_extract",
    timeout_s=20.0,
    temperature=0.0,
    max_output_tokens=512,
    response_mime_type="application/json",
    context=ctx,
)

The important field here is where.

It was originally used for logging and error attribution. It told us which part of the system made the LLM call:

entity_extract
memory_patch
story_compose
theme_chapter_delta
entity_description
story_gate
cards_extract

But where did not control routing.

The routing was global.

APP_PRIMARY_PROVIDER=anthropic

That meant almost every generic LLM call resolved to a Sonnet-class model first.

This was fine for chat and narrative composition. It was wasteful for structured extraction.

2. The Real Problem: One Model for Every Job

The issue was not simply "too many LLM calls."

The issue was task-model mismatch.

A quality-critical narrative call and a deterministic JSON extraction call were using the same default model.

For example:

# narrative composition — quality matters
text, meta = generate_text_with_failover(
    prompt=compose_prompt,
    where="story_compose",
    timeout_s=55.0,
    temperature=0.7,
    max_output_tokens=2000,
    context=ctx,
)

This kind of task deserves a strong model. It is user-visible, tone-sensitive, and narrative-heavy.

But this was also using the same default provider:

# entity extraction — structured JSON
text, meta = generate_text_with_failover(
    prompt=extraction_prompt,
    where="entity_extract",
    timeout_s=20.0,
    temperature=0.0,
    max_output_tokens=512,
    response_mime_type="application/json",
    context=ctx,
)

This task is not asking the model to write beautifully. It is asking for structured extraction under a schema.

Same dispatcher. Same global model. Very different requirements.

That was the core cost bug.

3. The Audit: 30 LLM Callsites

The first step was not optimization. It was inventory.

We searched the codebase for every call to:

generate_text_with_failover(...)

Then we built a simple table.

For each callsite, we asked:

What does this call do?
Is the output user-visible?
Is it free-form prose or structured JSON?
Does quality materially affect product experience?
Can failure be retried later?
Is this a background task?
Does this need a premium reasoning/writing model?

That gave us 30 callsites.

They naturally collapsed into three tiers.

4. Provider Tier Architecture

We defined the target architecture as a tiered routing model.

L1 — Narrative Generation (Sonnet-class model)

narrative_compose
collection_story_delta
user_context_model
timeline_summary
event_split
copyedit / polish / retry
onboarding_story

L2 — Structured Writing (Haiku-class model)

profile_summary generation

L3 — Extraction / Classification (Flash-class model)

entity_extract
entity_admission
memory_fact_extract
card_extract
quality_gate
thread_router
semantic_referee
soft_links
relationship_extract

Chat — independent path: stream_chat controlled by APP_CHAT_PROVIDER.

The design principle was simple:

APP_PRIMARY_PROVIDER should not be the business routing mechanism.

Instead:

quality-critical paths are explicitly pinned to the high-quality model
structured writing uses a cheaper writing-capable model
extraction/classification uses a cheaper fast model
global primary becomes a fallback/default, not a business decision

This distinction matters. The full per-callsite dispatch — and the deployment sequence that has to follow it — is described in the next sections.

5. The Routing Layer: Per-Callsite Dispatch

We already had a label for each LLM call: where.

So we reused it.

The new convention:

APP_LLM_PROVIDER_
APP_LLM_MODEL_

Examples:

APP_LLM_PROVIDER_STORY_COMPOSE=anthropic
APP_LLM_PROVIDER_ENTITY_EXTRACT=gemini
APP_LLM_PROVIDER_MEMORY_PATCH=gemini
APP_LLM_PROVIDER_ENTITY_DESCRIPTION=anthropic
APP_LLM_MODEL_ENTITY_DESCRIPTION=claude-haiku-4-5-20251001

The suffix is derived by uppercasing the where string and replacing non-alphanumeric characters with underscores.

def _where_env_key(where: str) -> str:
    return re.sub(r"[^a-zA-Z0-9]+", "_", where).strip("_").upper()

def _per_where_provider(where: str) -> str | None:
    key = f"APP_LLM_PROVIDER_{_where_env_key(where)}"
    val = (os.getenv(key) or "").strip().lower()
    return val if val in {"anthropic", "gemini", "openai"} else None

def _per_where_model(where: str) -> str | None:
    key = f"APP_LLM_MODEL_{_where_env_key(where)}"
    val = (os.getenv(key) or "").strip()
    return val or None

Examples:

where value	Provider env var
story_compose	APP_LLM_PROVIDER_STORY_COMPOSE
entity_extract	APP_LLM_PROVIDER_ENTITY_EXTRACT
theme_chapter_delta	APP_LLM_PROVIDER_THEME_CHAPTER_DELTA
entity_description	APP_LLM_PROVIDER_ENTITY_DESCRIPTION + APP_LLM_MODEL_ENTITY_DESCRIPTION

The routing flow became:

generate_text_with_failover(where='entity_extract')
Check APP_LLM_PROVIDER_ENTITY_EXTRACT — if set (gemini), use [gemini, global_secondary]
If unset, check global failover circuit breaker
Circuit open → use global secondary; closed → use APP_PRIMARY_PROVIDER
Call LLM
On per-callsite failure: log warning, fall through to secondary
On global-primary failure: update circuit breaker, fall through to secondary
Return text + metadata

A simplified version of the dispatch logic looks like this:

def generate_text_with_failover(*, prompt, where="unknown", ...) -> tuple[str, dict]:
    per_where_provider = _per_where_provider(where)
    per_where_model = _per_where_model(where)

    decision = choose_provider(kind="text", request_id=request_id)
    global_primary = decision.primary
    global_secondary = decision.secondary

    if per_where_provider:
        providers = [per_where_provider]
        if per_where_provider != global_secondary:
            providers.append(global_secondary)
    else:
        providers = [decision.provider]
        if decision.provider == global_primary:
            providers = [global_primary, global_secondary]

    for provider in providers:
        explicit_model = (
            per_where_model
            if per_where_provider and provider == per_where_provider
            else None
        )
        model, fallback_models = model_chain_from_env(
            provider=provider,
            explicit_primary=explicit_model,
        )
        try:
            text = generate_text_with_fallback(
                provider=provider,
                model=model,
                fallback_models=fallback_models,
                prompt=prompt,
                ...
            )
            if provider == global_primary:
                record_primary_success(request_id=request_id, where=where)
            return text, {
                "provider": provider,
                "model": model,
                "per_where_override": per_where_provider is not None,
                "latency_ms": int((time.perf_counter() - started_at) * 1000),
            }
        except Exception as exc:
            if provider == global_primary and not per_where_provider:
                if should_failover(exc):
                    record_primary_failure(request_id=request_id, where=where)
                continue
            if per_where_provider and provider == per_where_provider:
                log.warning(
                    "llm_per_where_primary_failed provider=%s where=%s",
                    provider,
                    where,
                )
                continue
            raise

Two decisions mattered here.

Circuit breaker isolation

A Gemini failure on entity_extract should not globally trip failover for story composition. Per-callsite provider failures are logged, but they do not update the global Redis circuit breaker.

Secondary provider fallback

L3 tasks are Gemini-first, not necessarily Gemini-only. If Gemini fails, the call can still fall back to the global secondary provider. This preserves reliability. If a product wants strict cost caps later, this can be tightened per callsite.

6. Supporting Optimization 1: Independent Chat Provider

Chat streaming used a separate path, not the batch generate_text_with_failover path.

That meant it needed its own provider pin:

chosen = (
    llm_provider
    or os.getenv("APP_CHAT_PROVIDER")
    or ""
).strip().lower() or None

Then the environment can say:

APP_CHAT_PROVIDER=anthropic
APP_CHAT_MAX_OUTPUT_TOKENS=2048

This keeps chat on the high-quality model even if:

APP_PRIMARY_PROVIDER=gemini

That separation is important. Chat is product-critical. It should not accidentally follow a global background-job optimization.

7. Supporting Optimization 2: Prompt Caching

Every chat request included a large system prompt: product voice, behavioral contract, memory instructions, and response policy.

A big part of that prompt was stable across consecutive turns. So we enabled Anthropic prompt caching for the system prompt:

if system_prompt:
    kwargs["system"] = [{
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral"},
    }]

The point is not to assume caching always works.

The cache is ephemeral and TTL-bound. Long idle periods, changing prompt prefixes, or moving cache boundaries can all reduce hit rate.

So we also logged:

cache_creation_input_tokens
cache_read_input_tokens

Do not assume prompt caching savings. Measure them.

8. Supporting Optimization 3: Token Usage Logging

Before this refactor, we had logs that said an LLM call happened. But we did not have reliable token usage per callsite. That made cost analysis guessy.

We added token logging on both paths.

Sync path

resp = client.messages.create(
    model=model,
    messages=messages,
    ...
)
usage = getattr(resp, "usage", None)
if usage:
    log.info("llm_usage", extra={
        "provider": "anthropic",
        "model": model,
        "input_tokens": getattr(usage, "input_tokens", None),
        "output_tokens": getattr(usage, "output_tokens", None),
        "cache_creation_input_tokens": getattr(
            usage,
            "cache_creation_input_tokens",
            None,
        ),
        "cache_read_input_tokens": getattr(
            usage,
            "cache_read_input_tokens",
            None,
        ),
    })

Streaming path

Chat streaming is different. Usage data arrives across stream events.

input_tokens = None
output_tokens = None
cache_creation_input_tokens = None
cache_read_input_tokens = None

async for event in stream:
    if event.type == "message_start":
        usage = getattr(getattr(event, "message", None), "usage", None)
        if usage:
            input_tokens = getattr(usage, "input_tokens", None)
            cache_creation_input_tokens = getattr(
                usage,
                "cache_creation_input_tokens",
                None,
            )
            cache_read_input_tokens = getattr(
                usage,
                "cache_read_input_tokens",
                None,
            )
    elif event.type == "content_block_delta" and hasattr(event.delta, "text"):
        text = event.delta.text
        if text:
            yield TextDelta(type="text_delta", text=text)
    elif event.type == "message_delta":
        usage = getattr(event, "usage", None)
        if usage:
            output_tokens = getattr(usage, "output_tokens", None)

log.info("llm_usage_stream", extra={
    "provider": "anthropic",
    "model": cfg.model,
    "input_tokens": input_tokens,
    "output_tokens": output_tokens,
    "cache_creation_input_tokens": cache_creation_input_tokens,
    "cache_read_input_tokens": cache_read_input_tokens,
})

This made the biggest cost path — chat — observable.

9. Supporting Optimization 4: Configurable Output Caps

Two output caps were too large by default:

# Before
max_tokens = 4096              # chat
max_output_tokens = 8192       # narrative model

Those values were not based on observed output length. They were "safe" defaults.

Safe defaults can be expensive defaults.

We made both configurable:

max_tokens = cfg.max_output_tokens or int(
    os.getenv("APP_CHAT_MAX_OUTPUT_TOKENS", "2048")
)

max_output_tokens = int(
    os.getenv("APP_NARRATIVE_MODEL_MAX_OUTPUT_TOKENS", "4096")
)

In QA, we used:

APP_CHAT_MAX_OUTPUT_TOKENS=2048
APP_NARRATIVE_MODEL_MAX_OUTPUT_TOKENS=3000

Chat responses rarely needed 4096 tokens. The narrative model output fit comfortably within the lower cap.

10. Supporting Optimization 5: Disable a Broken Feature

One feature was enabled but failing on every call:

APP_MEMORY_READ_PLAN_LLM_ENABLED=1

It was not a major billing driver because failed requests were rejected, but it added latency and noise to every chat turn.

We disabled it:

APP_MEMORY_READ_PLAN_LLM_ENABLED=0

11. The Fallback Escalation Trap

This was the easiest bug to miss.

We wanted profile summary generation to use a Haiku-class model:

APP_LLM_PROVIDER_ENTITY_DESCRIPTION=anthropic
APP_LLM_MODEL_ENTITY_DESCRIPTION=claude-haiku-4-5-20251001

But what happens when Haiku fails?

That depends on APP_ANTHROPIC_MODEL_FALLBACKS.

If fallbacks include a Sonnet-class model, then the call silently escalates:

Haiku fails → Sonnet fallback → successful response → larger bill

That defeats the point of routing the call to Haiku.

The safe case is:

APP_ANTHROPIC_MODEL_FALLBACKS=claude-haiku-4-5-20251001

The model chain resolver filters fallback models that equal the primary model:

primary   = claude-haiku-4-5-20251001
fallbacks = [claude-haiku-4-5-20251001]

after filtering:
fallbacks = []

So if Haiku fails, the background task fails rather than escalating to Sonnet.

For this specific task, that is acceptable. A profile summary can retry later. Unexpected premium-model spend is worse.

We added tests for both cases:

def test_entity_description_no_sonnet_escalation(monkeypatch):
    monkeypatch.setenv(
        "APP_ANTHROPIC_MODEL_FALLBACKS",
        "claude-haiku-4-5-20251001",
    )
    primary, fallbacks = model_chain_from_env(
        provider="anthropic",
        explicit_primary="claude-haiku-4-5-20251001",
    )
    assert primary == "claude-haiku-4-5-20251001"
    assert fallbacks == []

And the risky case:

def test_entity_description_sonnet_escalation_risk_documented(monkeypatch):
    monkeypatch.setenv(
        "APP_ANTHROPIC_MODEL_FALLBACKS",
        "claude-sonnet-4-20250514",
    )
    primary, fallbacks = model_chain_from_env(
        provider="anthropic",
        explicit_primary="claude-haiku-4-5-20251001",
    )
    assert primary == "claude-haiku-4-5-20251001"
    assert "claude-sonnet-4-20250514" in fallbacks

When you introduce model tiers, trace the entire fallback chain. A fallback configuration that was safe for a uniform-model setup can create expensive surprises in a tiered setup.

12. Deployment: Why Order Matters

The dangerous step was not adding the routing code.

The dangerous step was flipping:

APP_PRIMARY_PROVIDER=gemini

If you do that before pinning quality-critical paths, story generation and theme generation can silently move to a cheaper model.

That is the wrong kind of cost optimization.

The deployment sequence mattered.

Deployment order:

Deploy routing code
Set memory/narrative config
Pin L1 quality paths to Sonnet
Pin profile summaries to Haiku
Pin extraction to Flash
Verify env inside containers
Flip APP_PRIMARY_PROVIDER=gemini
Recreate containers
Verify health and logs

The final routing looked like this:

Chat                         → Anthropic / Sonnet-class
Story generation             → Anthropic / Sonnet-class
Theme generation             → Anthropic / Sonnet-class
Narrative model              → Anthropic / Sonnet-class
Onboarding story             → Anthropic / Sonnet-class
Profile summaries            → Anthropic / Haiku-class
Entity extraction            → Gemini    / Flash-class
Memory patching              → Gemini    / Flash-class
Card extraction              → Gemini    / Flash-class
Story gates                  → Gemini    / Flash-class
Thread routing               → Gemini    / Flash-class
Semantic referee             → Gemini    / Flash-class
Soft links                   → Gemini    / Flash-class
Global primary fallback      → Gemini
Global secondary fallback    → Anthropic

13. Docker Compose: .env Is Not Enough

A deployment detail caught us.

We added all the new variables to .env, restarted containers, and expected them to appear.

They did not.

Why?

Because Docker Compose only injects variables into a container if they are referenced in the service's environment: block or passed through an env_file.

The .env file alone is used for Compose interpolation. It does not automatically expose every variable to the container.

So this was not enough:

# .env
APP_LLM_PROVIDER_ENTITY_EXTRACT=gemini
APP_LLM_PROVIDER_STORY_COMPOSE=anthropic

We also had to update docker-compose.yml:

services:
  api:
    environment:
      - APP_PRIMARY_PROVIDER=${APP_PRIMARY_PROVIDER:-anthropic}
      - APP_CHAT_PROVIDER=${APP_CHAT_PROVIDER:-}
      - APP_CHAT_MAX_OUTPUT_TOKENS=${APP_CHAT_MAX_OUTPUT_TOKENS:-2048}
      - APP_LLM_PROVIDER_STORY_COMPOSE=${APP_LLM_PROVIDER_STORY_COMPOSE:-}
      - APP_LLM_PROVIDER_ENTITY_EXTRACT=${APP_LLM_PROVIDER_ENTITY_EXTRACT:-}
      - APP_LLM_PROVIDER_ENTITY_DESCRIPTION=${APP_LLM_PROVIDER_ENTITY_DESCRIPTION:-}
      - APP_LLM_MODEL_ENTITY_DESCRIPTION=${APP_LLM_MODEL_ENTITY_DESCRIPTION:-}
      # ... repeated for each callsite

  worker:
    environment:
      - APP_PRIMARY_PROVIDER=${APP_PRIMARY_PROVIDER:-anthropic}
      - APP_CHAT_PROVIDER=${APP_CHAT_PROVIDER:-}
      - APP_CHAT_MAX_OUTPUT_TOKENS=${APP_CHAT_MAX_OUTPUT_TOKENS:-2048}
      - APP_LLM_PROVIDER_STORY_COMPOSE=${APP_LLM_PROVIDER_STORY_COMPOSE:-}
      - APP_LLM_PROVIDER_ENTITY_EXTRACT=${APP_LLM_PROVIDER_ENTITY_EXTRACT:-}
      - APP_LLM_PROVIDER_ENTITY_DESCRIPTION=${APP_LLM_PROVIDER_ENTITY_DESCRIPTION:-}
      - APP_LLM_MODEL_ENTITY_DESCRIPTION=${APP_LLM_MODEL_ENTITY_DESCRIPTION:-}
      # ... repeated for each callsite

We first made this change directly on the QA server to unblock deployment.

Then we caught the problem: the server had a modified docker-compose.yml that was not committed to the repo.

That is configuration drift.

We moved the change back into the repository, committed it, pushed it, pulled it on QA, and recreated the containers from the committed Compose file.

The lesson:

Infrastructure state that only exists on the server is a commit waiting to cause an incident.

14. What We Did Not Change

This was a cost refactor, not a product behavior refactor.

We did not:

change product prompts
lower chat quality
lower story generation quality
remove failover
ask business logic to know about model providers
make extraction tasks hard-fail on the first provider error
change user-facing memory or story logic
flip the global provider until quality-sensitive callsites were pinned

That constraint mattered.

The goal was not "use cheaper models everywhere."

The goal was:

Use the expensive model where it buys product quality, and stop using it where it does not.

15. Tests

We added tests for:

chat provider pinning
chat max token env override
narrative model max token env override
prompt caching system prompt format
sync token usage logging
streaming token usage logging
per-callsite provider override
per-callsite model override
invalid provider override fallback
no-override behavior preserving global provider
Haiku profile summary fallback safety
Sonnet escalation risk if fallback env is misconfigured

A few examples:

def test_per_where_routes_entity_extract_to_gemini(monkeypatch):
    monkeypatch.setenv("APP_PRIMARY_PROVIDER", "anthropic")
    monkeypatch.setenv("APP_LLM_PROVIDER_ENTITY_EXTRACT", "gemini")

    captured = []

    def fake_generate_text_with_fallback(*, provider, **kwargs):
        captured.append(provider)
        return "ok"

    monkeypatch.setattr(
        failover,
        "generate_text_with_fallback",
        fake_generate_text_with_fallback,
    )

    text, meta = generate_text_with_failover(
        prompt="extract entities",
        where="entity_extract",
        timeout_s=10,
        temperature=0,
        max_output_tokens=512,
    )

    assert text == "ok"
    assert captured[0] == "gemini"
    assert meta["per_where_override"] is True

def test_per_where_routes_story_compose_to_anthropic(monkeypatch):
    monkeypatch.setenv("APP_PRIMARY_PROVIDER", "gemini")
    monkeypatch.setenv("APP_LLM_PROVIDER_STORY_COMPOSE", "anthropic")

    captured = []

    def fake_generate_text_with_fallback(*, provider, **kwargs):
        captured.append(provider)
        return "ok"

    monkeypatch.setattr(
        failover,
        "generate_text_with_fallback",
        fake_generate_text_with_fallback,
    )

    text, meta = generate_text_with_failover(
        prompt="write a story",
        where="story_compose",
        timeout_s=30,
        temperature=0.7,
        max_output_tokens=2000,
    )

    assert text == "ok"
    assert captured[0] == "anthropic"
    assert meta["per_where_override"] is True

The test suite gave us confidence that changing the global primary provider would not accidentally move story generation to the cheaper tier.

16. Results

The cost impact was estimated before deployment and then set up for validation through logs.

Change	Est. daily saving
13 extraction/classification callsites to Flash-class model	~$3–5
Profile summaries to Haiku-class model	~$0.5–1
Chat max output tokens 4096 → 2048	~$0.5–1
Narrative model token reduction	~$0.4
Prompt caching when cache is warm	~$0.3–2
Disable broken memory read plan	latency only
Total expected saving	~$4.7–10/day

QA had been around $10–13/day.

The expected target after deployment was $2–5/day, pending 24-hour billing and log validation.

The important part is that we were no longer guessing.

With token logs and per-callsite metadata, we could now answer:

Which callsite spent the most?
Which provider handled it?
Which model handled it?
How many input tokens?
How many output tokens?
Did prompt caching hit?
Did a cheap model fall back to an expensive one?

That is the difference between "the bill went up" and "this specific callsite is expensive."

17. The Reusable Pattern

This pattern applies to any AI backend with multiple LLM tasks.

1. Label every LLM call. Every call needs a stable identifier such as where="entity_extract", where="story_compose", where="memory_patch". If you already have labels for logging, reuse them for routing.

2. Classify by task type. Do not classify by code ownership. Classify by output requirement: narrative generation → premium writing model; structured writing → lightweight writing model; extraction/classification → fast cheap model.

3. Separate routing from business logic. Business code should not decide providers. The caller says where="entity_extract"; the dispatcher decides entity_extract → Gemini Flash.

4. Pin quality paths before changing defaults. Never flip the global provider until user-visible paths are explicitly pinned. Global defaults are dangerous when the product has mixed call types.

5. Make token caps configurable. Hardcoded max tokens are almost always too high. Configurable caps let you tune without redeploying code.

6. Log token usage on every path. Sync calls and streaming calls behave differently. Instrument both.

7. Trace fallback chains. A "cheap" callsite can silently become expensive if fallback models escalate. Always test the fallback chain.

8. Verify container env, not just .env. This one is boring and important. The variable in .env does not matter if the container never receives it.

18. LLM Cost Audit Checklist

Here is the checklist I would use next time:

List every LLM callsite.
Add or reuse a stable callsite label.
Record provider, model, latency, input tokens, output tokens.
Separate sync and streaming usage logging.
Classify each callsite by output type and user visibility.
Identify quality-critical paths.
Pin quality-critical paths explicitly.
Route extraction/classification to cheaper models.
Route structured writing to a lightweight writing model.
Make max output tokens configurable.
Add prompt caching where prompt prefixes are stable.
Log cache creation and cache read tokens.
Trace fallback chains for expensive escalation.
Verify environment variables inside running containers.
Treat Compose/Kubernetes/env passthrough as source code.
Monitor provider distribution after deploy.
Compare expected savings against 24-hour billing data.

19. Final Takeaway

Early-stage AI products are usually right to optimize for iteration speed first.

But some cost is not a trade-off.

Routing a structured JSON extraction task through your best narrative model is not buying you quality. It is buying you a larger API bill.

The fix was not a giant rewrite. It was mostly:

audit every callsite
label every call
route by task complexity
pin quality paths
add token observability
deploy carefully

The hard part was not the code.

The hard part was admitting that "one model for everything" had quietly become part of the architecture.

Once every LLM call had a label, the cost model became visible.

And once the cost model became visible, the refactor was obvious.

For a complementary cost pass on AI-assisted development — trimming fixed prompt noise and reducing file-load granularity — see the companion note.

Originally published at https://dearartist.xyz/blog/label-every-llm-call-ai-backend-cost-audit.

Glia Narrative Model: Turning AI Memory Into Continuity

Fri, 24 Apr 2026 10:00:00 GMT

A structured, cautious memory layer that helps Glia remember the user as a person-in-progress — without dumping raw transcript history back into the product experience.

Evidence becomes structured continuity, then surfaces as gentle context.

Hero

A user opens Glia for the third time this month and types, "I don't know why I'm so tired this week." Without memory, the assistant answers competently but generically — sleep, workload, stress. The reply is plausible. It just doesn't sound like it remembers anything about the person on the other side of the screen.

Narrative Model is the layer we built so Glia can respond from a more situated place — not by replaying old transcripts, and not by claiming to know the user, but by maintaining a small, structured, inspectable sketch of who the user appears to be becoming.

It is not a profile. It is not a retrieval index. It is a narrow memory layer that distills conversation evidence into a typed, validated representation the product can reason about.

The core path is simple: conversation evidence becomes a validated narrative model, and the product receives only a gentle context block — not raw transcript replay.

Why this exists

Glia is a personal AI for reflection. Three surfaces matter most. Chat happens through the Dot Agent, the conversational agent users actually talk to. Longer-form reflections come from the Story Agent, which composes written pieces from a user's recent context. Recurring people are handled by the People Card pipeline, which promotes well-grounded named relationships into reusable memory objects.

All three share the same underlying problem: each session tends to start cold.

The instinct is to reach for retrieval. And retrieval has its place. But it answers a narrow question — "what was said before that resembles this?" — and misses the more human one: "what is the user in the middle of becoming?"

Related note: How Glia Actually Remembers — Raw messages, async patches, hybrid retrieval, and the evidence pipeline behind continuity.

Retrieval / vector memory	Narrative Model
Finds fragments of past text that look similar to the current query. Good at recall. Silent on shape.	Maintains a structured, validated state of who the user appears to be right now. Good at continuity. Silent on trivia.

The two are complementary. Retrieval can recover a specific fragment when it's needed. The Narrative Model carries the steady picture between sessions so the product doesn't have to rebuild its sense of the user from scratch every time.

This has to be done with care. Memory shouldn't feel like surveillance. It shouldn't overfit. It shouldn't psychoanalyze. And it shouldn't leak hidden fields back at the user. The goal is a product that responds with more situated care, not more dramatic certainty.

Design constraint

The model should help Glia carry continuity without making the assistant sound like it is secretly reading from a dossier.

Early signal: memory that feels situated

One early signal was not simply that Glia could remember more. It was that memory felt emotionally situated.

The strongest feedback was about specificity: not a bigger context window, not transcript replay, but a system that can carry forward details that feel worth remembering.

Feedback during early Narrative Model validation.

This distinction matters. The goal is not larger context for its own sake. The goal is memory that knows what is worth carrying forward, and uses it gently.

The shape of the model

Narrative Model is structured around five broad dimensions: Chapter, Drive, Relationships, Self-image, and Energy.

The purpose is not to classify the user permanently. The purpose is to maintain a cautious, updatable sketch of what appears to matter right now.

                  ┌────────────────────────────┐
                  │      Narrative Model       │
                  │  cautious · updatable ·    │
                  │         grounded           │
                  └────────────┬───────────────┘
                               │
   ┌──────────┬────────────────┼────────────────┬───────────┐
   │          │                │                │           │
Chapter    Drive          Relationships     Self-image    Energy
(arc)   (motivation +   (grounded people)   (claimed +   (sources ·
        tension)                            emerging      drains ·
                                            roles)       absence)

Dimension	Sub	Description
Chapter	current arc	What life phase or season the user seems to be in.
Drive	motivation + tension	What the user appears to be trying to prove, protect, build, or resolve.
Relationships	grounded people	Recurring named people and relational dynamics.
Self-image	claimed + emerging roles	Roles the user claims, resists, or is beginning to inhabit.
Energy	sources · drains · absence	What seems to replenish, drain, or be missing.

Chapter

Chapter captures the user's current arc in plain language. It should feel like a careful summary of the season the user is in, not a permanent identity label.

Example: "Building an early company while learning how to ask for help."

A good chapter is specific enough to be useful, but not so specific that it traps the user. It should update as the user changes. It should avoid diagnosis, moral judgment, or dramatic interpretation.

Drive

Drive captures what appears to be pulling the user forward. In the current implementation, it can also include a tension — the friction between what the user wants and what is making that difficult.

Drive — "Wants to prove the product can become real without losing personal grounding."

Tension — "Ambition is pulling against exhaustion."

The drive field is valuable because many user messages are not just about tasks. They are about the force behind the tasks. But the model must keep this grounded. It should not invent grand motives from thin evidence.

Relationships

Relationships capture recurring named people and the dynamics around them when there is enough grounding. This is not meant to turn every name into a permanent object. It is a cautious bridge between relational context and memory.

Named relationships can later become candidates for People Cards, but only when there is enough evidence. The system should avoid creating people-memory objects from weak, accidental, or one-off mentions.

Why this shape

People are often the most important part of a user's context, but they are also the easiest place to overreach. The model treats relationships as grounded candidates, not automatic conclusions.

Self-image

Self-image captures roles the user actively claims and roles that appear to be emerging. This is especially important because people often change before they are ready to name the change.

For example, a user may call themselves a builder or founder, but not yet feel comfortable calling themselves a leader. The model can represent that as a budding role with a do_not_label constraint.

The assistant should not prematurely label the user. Emerging identity should be protected. A budding role can guide sensitivity, but it should not be pushed back at the user as a statement of fact.

Energy

Energy captures what appears to replenish the user, drain them, or be missing from their current life.

Sources	Drains	Currently absent
deep work	ambiguous relationships	rest
long walks	context switching	founder community
clear feedback

This dimension helps the assistant respond more usefully to vague emotional states. If a user says they are tired, the product can respond from a more situated understanding without overexplaining or psychoanalyzing.

Prompt rules that matter

Each dimension carries a status and a confidence level, so the output never pretends every field is equally well supported. Where evidence is thin, the prompt prefers blank space over invention — sparseness is correct output, not failure.

When signals conflict, the model is told to record tension rather than smooth it away — important for drive and self-image, where people are often mid-transition. Named people only appear when grounded, and budding roles are protected with a "do not label" constraint so emerging identity is never pushed back at the user as fact.

Dimensions also move at different speeds: fast-moving ones can shift from a single recent session, slow-moving ones need repeated evidence. That temporal shape keeps the model responsive without becoming volatile.

The rules in one place

Never invent facts.

Named people only when grounded.

Hold contradictions as tension.

Protect emerging identities.

Use confidence fields.

Leave blanks when evidence is thin.

The schema

The schema is where the Narrative Model stops being a loose idea and becomes an inspectable system.

At the center is a top-level model containing schema version, update time, session count, dimensions, and a few auxiliary fields like thread, openness, tone, and archetype traits. But the most important part is the dimensions object itself.

The production schema is larger, but the simplified version below shows the important shape: typed dimensions, bounded fields, and explicit uncertainty.

# simplified
from pydantic import BaseModel, Field

class NamedPerson(BaseModel):
    name: str
    role: str = ""
    dynamic: str = ""
    emotional_weight: str = ""
    people_card_candidate: bool = False

class DriveDimension(DimensionSlice):
    intrinsic: str = ""
    tension: str = ""

class SelfImageDimension(DimensionSlice):
    active_roles: list[str] = Field(default_factory=list)
    performed_roles: list[str] = Field(default_factory=list)
    budding_roles: list[BuddingRole] = Field(default_factory=list)

class EnergyDimension(DimensionSlice):
    sources: list[str] = Field(default_factory=list)
    drains: list[str] = Field(default_factory=list)
    currently_absent: list[str] = Field(default_factory=list)
    current_capacity: str = ""

class NarrativeModel(BaseModel):
    schema_version: str = "1"
    updated_at: str = ""
    session_count: int = 0
    dimensions: NarrativeDimensions
    archetype_traits: str = ""
    narrative_tone: str = "neutral"
    openness: str = "medium"
    thread: str = ""

Simplified and sanitized. Some supporting types (DimensionSlice, BuddingRole, NarrativeDimensions) are omitted for brevity. The point is the architectural shape, not exact production source.

This structure matters because it makes memory usable. A free-form blob is hard to validate, hard to compare over time, and hard to inject back into downstream systems in a disciplined way. A structured schema gives the product something it can reason about.

Related note: Reducing Token Cost While Working on Glia — Trimming prompts, caching context, and quiet wins on the cost line.

Confidence and status fields matter because they force the model to admit uncertainty. Named relationships matter because a person list without role or dynamic is too weak to be useful. Self-image matters because claimed roles and emerging roles are not the same thing. Energy matters because what fills, drains, and remains absent is often the difference between a generic response and a grounded one.

The schema is also what makes the system inspectable. It is not a mystical memory black box. It is a typed representation with explicit fields and visible failure modes.

A synthetic example

A populated model for a hypothetical user might look like this. All values are invented for illustration.

{
  "chapter": {
    "value": "Building an early company while learning how to ask for help",
    "confidence": "high"
  },
  "drive": {
    "value": "Wants to prove the product can become real without losing personal grounding",
    "tension": "Ambition is pulling against exhaustion",
    "confidence": "high"
  },
  "relationships": {
    "named_people": [
      {
        "name": "Nora",
        "role": "co-founder",
        "dynamic": "trusted but currently under-communicated",
        "emotional_weight": "high",
        "people_card_candidate": true
      }
    ]
  },
  "self_image": {
    "active_roles": ["founder", "builder"],
    "budding_roles": [
      {
        "role": "leader",
        "evidence": "The user is starting to coordinate others but does not yet fully claim this identity.",
        "do_not_label": true
      }
    ]
  },
  "energy": {
    "sources": ["deep work", "long walks", "clear feedback"],
    "drains": ["ambiguous relationships", "context switching"],
    "currently_absent": ["rest", "founder community"],
    "current_capacity": "medium"
  }
}

Synthetic example. This is not a production record and does not describe a real user.

Pipeline overview

The model is refreshed asynchronously after conversation. The system gathers recent conversation evidence, builds a transcript representation, reads the current narrative model if one exists, asks the Narrative Model agent to produce strict JSON, validates the output, and only then saves the structured representation.

┌────────────┐   ┌────────────┐   ┌─────────────────┐   ┌────────────┐
│  Capture   │ → │   Reason   │ → │    Validate     │ → │  Surface   │
├────────────┤   ├────────────┤   │ (validation     │   ├────────────┤
│01 Conv.    │   │04 Build    │   │       gate)     │   │08 Prepare  │
│   happens  │   │   transcript│  ├─────────────────┤   │   context  │
│02 Async    │   │05 Read curr.│  │06 Validate      │   │   for      │
│   refresh  │   │   model     │  │   strict JSON   │   │   product  │
│03 Fetch    │   │06 Run NM    │  │07 Save struct.  │   │   surfaces │
│   messages │   │   agent     │  │   model         │   │            │
└────────────┘   └────────────┘   └─────────────────┘   └────────────┘
                                                              │
                                                              ▼
                                              available to: Chat ·
                                              Stories · People Cards

Legend: Evidence in · Structure out · Validation gate · Product context

A simplified view of the job flow looks like this:

# simplified
def refresh_narrative_model_job(user_id: str, conversation_id: str):
    if debounce_should_skip(user_id):
        return
    messages = fetch_messages_for_conversation(user_id, conversation_id, limit=200)
    transcript = transcript_from_messages(messages)
    current = get_narrative_model_dict(user_id)
    updated = run_narrative_model_agent(
        current_model=current,
        transcript=transcript,
    )
    if not updated:
        return
    save_narrative_model(user_id=user_id, model=updated)
    maybe_trigger_people_cards(user_id, updated)

Architectural shape: debounce, fetch, build, run, validate, save, optionally trigger people candidates.

Grounding rule

The model may infer only from grounded evidence. It should preserve uncertainty, tolerate blanks, and avoid converting weak signals into confident claims.

Storage and history

In the current implementation, storage is intentionally simple. Rather than introducing a separate narrative-model table, the system keeps the active model in a structured JSON field on the user profile. That keeps the read path lightweight and avoids an extra join for the main consumers.

The save path also snapshots the prior model before overwrite. Each time a new model is successfully saved, the previous one is appended into model_history. That history is capped at 50 snapshots so it remains bounded.

updated_at must be system-owned. A simplified version of the save path looks like this:

# simplified — field names below are illustrative
def save_narrative_model(db, user_id: str, model: NarrativeModel) -> None:
    row = load_profile_row(db, user_id)
    existing = load_current_model(row) or {}
    history = (existing.get("model_history") or [])
    if existing:
        snapshot = {k: v for k, v in existing.items() if k != "model_history"}
        history = (history + [snapshot])[-50:]
    new_data = model.model_dump(mode="json")
    new_data["updated_at"] = datetime.now(timezone.utc).isoformat()
    new_data["model_history"] = history
    write_current_model(row, new_data)
    db.commit()

Simplified and sanitized. Save the active model, retain bounded history, and keep update metadata under system control.

The history is not yet a user-facing product surface. There is no built-out "you in March versus you in July" experience today. But keeping the snapshots creates a foundation for future longitudinal features without forcing those product decisions prematurely.

Injection into product experience

Chat

For chat, the system converts the stored model into a short plain-language narrative context block and injects that block into the prompt as soft background. That wording matters. It is not raw transcript text. It is not hidden truth. It is not meant to override what the user is saying now.

A simplified version of the injection path looks like this:

# simplified
narrative_block = build_narrative_context_block(model)
template_vars = {
    "messages": transcript,
    "memory_evidence_json": memory_evidence,
    "narrative_context": narrative_block or "",
}
prompt = render_prompt(contract["body"], template_vars)

Shape of the injection path: structured model → plain-language block → prompt variable.

The prompt contract for chat treats narrative_context as optional background from prior sessions. If it conflicts with what the user says in the present moment, the present message wins. That is the right boundary.

Stories

The story system can also receive narrative context. This helps story generation stay coherent with the user's current chapter, unresolved tension, relational landscape, and openness level.

That does not mean the story system should become more dramatic. It means it becomes less arbitrary. If the user is in a transitional chapter and carrying unresolved tension, the story generator should not suddenly write as if everything has already resolved.

People Cards

The relationships dimension also feeds a more object-like memory path. If a named person is sufficiently grounded and marked as a candidate, the system can create or resolve a provisional people entity. That gives Glia a bridge from narrative understanding into reusable product memory objects.

Related note: Glia Social / Share: How AI Memories Become Safely Shareable — Entity-scoped Connection, card-scoped Single Story Share, and the boundaries that keep memory sharing safe.

This is where the Narrative Model becomes product infrastructure rather than just metadata. The full job flow that drives this lives in the Pipeline overview above.

The bug that mattered: false staleness

One of the clearest engineering lessons in the Narrative Model rollout came from a surprisingly small field: updated_at.

Originally, the model's updated_at came from LLM output. That seemed harmless at first because the field fit naturally into the generated JSON. But it was a category mistake.

An LLM does not know the actual runtime date. It can generate something that looks like a timestamp, but that is not the same as owning operational truth. In practice, that meant freshly generated models could appear hundreds of days old.

That mattered because the narrative context builder uses updated_at to decide whether to prepend a stale-model warning. Once the timestamps were wrong, the system could tell downstream prompts to treat fresh models as stale.

The fix was simple and important: overwrite updated_at at save time with the system clock in UTC.

# simplified
new_data = model.model_dump(mode="json")
new_data["updated_at"] = datetime.now(timezone.utc).isoformat()

The timestamp is assigned by the system at save time.

Engineering lesson

LLMs can generate useful structure, but they should not be trusted with operational metadata like timestamps, IDs, or state transitions.

Current validation state

In the current validation state, the core pipeline is in place and the work has shifted from building to observing. Chat and story paths receive narrative context. Provisional People Card candidates can be triggered when relationships are sufficiently grounded. Freshly saved models no longer carry false stale warnings caused by LLM-generated timestamps.

Across validation runs, the system shows the expected shape: richer conversational histories produce denser dimensions and more coherent continuity, while sparser histories produce blanks or lower-confidence fields rather than fabricated certainty. Sparse output is a feature, not a bug.

Attention at this stage sits on correctness, prompt health, and structured-output reliability — whether refreshes happen on schedule, whether the agent keeps returning valid structured output, and whether richer models translate into useful continuity without over-personalization.

Implementation overview

Narrative Model was built in staged increments rather than as a single drop. Each stage was small enough to validate on its own and large enough to land a real piece of product behavior. The end-to-end flow is in the Pipeline overview above; the rollout shape is below.

Stage	Title	Description
Stage 1	Core foundations	Agent, typed schema, persistence, async refresh job, chat-side hook, and the regression tests that anchor them.
Stage 2	Schema expansion & onboarding	Additional dimension fields and explicit support for early-session users, where evidence is naturally thin and blanks are correct output.
Stage 3	Story composition injection	Wire the structured model into the Story Agent so longer-form reflections stay coherent with the user's current chapter and tension.
Stage 4	Provisional People Card triggering	Promote well-grounded named relationships into candidate memory objects, with grounding requirements rather than name-matching.
Stage 5	Validation & monitoring	Operational notes covering refresh health, structured-output reliability, and the boundaries of what the model is allowed to claim.

The current surface area covers the agent, the async refresh job, the schema, the persistence layer, the prompt contract, the chat and story injection hooks, the provisional People Card trigger, and a regression suite. This is a product pipeline, not a prompt-only experiment.

Why this matters

Memory that lives only inside a prompt is fragile. Memory that lives in a validated schema, a persistence layer, and a refresh job is something the product can actually depend on.

The product surface: what continuity feels like

Simulated UI · synthetic content. These mockups are illustrative. They show how a structured memory layer gently changes the surface a user actually touches — without exposing hidden fields or making the system feel like it is watching.

Without vs. with the Narrative Model.

Without Narrative Model — Glia · Chat

User: I don't know why I'm so tired this week.

Glia: Sounds like you may be under stress. Try getting more sleep, reducing workload, and taking breaks.

Generic but plausible.

With Narrative Model — Glia · Chat

User: I don't know why I'm so tired this week.

Glia: You've been carrying a lot between the launch push and trying to keep things steady around it. The tiredness may be tracking that, not just the week itself.

✨ context used lightly

More situated, without exposing hidden fields.

Specific recall, used lightly (synthetic example)

User: I'm going for a hill workout today.

Glia: Back to that hill again? Your legs may complain, but you'll probably feel better after.

✨ context used lightly

Synthetic example inspired by early usage patterns. The user does not see raw memory fields; they feel continuity through a small, relevant detail.

Story Agent preview

The Cost of Holding Momentum A reflection on building, uncertainty, and staying grounded.

You are not only trying to move faster. You are trying to move without losing the parts of yourself that make the work worth doing.

chapter-aware · tone-matched · grounded reflection

Narrative context keeps story generation coherent with the current chapter — not transcript replay.

People Card candidate

Nora — candidate · needs grounding co-founder

Relationship signal: Trusted, but currently under-communicated.

Provisional. Only recurring, grounded relationships are promoted into memory objects.

The user does not see the raw model. They feel its effect through continuity: a reply that starts from the right emotional neighborhood, a story that matches the current chapter, or a relationship object that appears only when there is enough grounding.

The better answer is not better because it is more poetic or more invasive. It is better because it is more situated — grounded in what the system has already understood about the user's ongoing context.

The important constraint is how gently that context gets used. The assistant should not dump hidden model fields back at the user, sound like it is reading from a file, or psychoanalyze. The best outcome is a response that feels naturally coherent, not theatrically personalized.

Design principles

Structured memory beats raw memory. Raw history contains evidence, but structure makes it reusable. The product needs something more disciplined than transcript replay.
Confidence matters. If the system cannot say how sure it is, it will tend to sound more certain than it should.
Blank is better than hallucinated. A sparse field is not a failure. In personal memory systems, invention is usually worse than omission.
Emerging identity should be protected. A user can be becoming something without wanting that identity imposed on them.
Relationships must be grounded. Names matter, but only when supported. Otherwise relationship memory turns into noise.
Operational metadata must be system-owned. Timestamps, IDs, save semantics, and state transitions should belong to the runtime, not the generator.
Context should guide tone, not dominate it. The goal is better continuity, not to make every answer feel narrated by memory.

What we are not doing yet

Narrative Model is intentionally narrow, and some boundaries are important.

Selective injection. Narrative context is not yet wired into every adjacent surface. Broader theme- or line-style injection should be evaluated deliberately.
History stays internal. Model snapshots are retained but not surfaced to the user. The system preserves longitudinal shape before committing to a public surface.
Context-size limits. Long-context users may eventually need explicit limits or a summarization layer when many named relationships are active.
Structured-output reliability. Treated as an ongoing operational concern, not a solved problem. Validation and recovery paths stay first-class.

Most importantly, this is a validation and monitoring phase, not a claim of perfect memory. The responsible thing at this point is to observe, refine only where evidence justifies it, and resist turning a narrow memory layer into a sprawling system too early.

Optional future surface: model history over time

Because each successful save snapshots the prior model, the system is quietly retaining the shape of change. That opens a possible future surface where a user could see continuity over time — not as surveillance, not as overconfident analysis, but as reflective movement.

[Snapshot 1] ──→ [Snapshot 2] ──→ [Snapshot 3]
Chapter:        Chapter:           Chapter:
Starting the    Learning to        Delegating
company alone   ask for help       without feeling
                                   absent

That surface is not shipped today, and should not be rushed. The underlying structure is there when the product question is ready.

Closing

Narrative Model is an attempt to make AI memory feel less like retrieval and more like continuity — without pretending the system fully knows the person, and without turning memory into a black box that can't be inspected.

What it holds is small: where the user seems to be in life, what tensions are still alive, who matters, what identity movement is emerging, and what fills or drains them. A modest claim, but a useful one.

If personal AI is going to feel human over time, it needs memory that is more than recall and less than overreach.

Privacy checklist

This note describes a real production system, but everything shown on this page has been reviewed for privacy.

✓ All chat examples, JSON payloads, and named people are synthetic. No real user data is shown.
✓ All personal identifiers — usernames, avatars, timestamps, and platform UI — have been removed from the early-signal feedback image.
✓ No internal commit SHAs, file paths, or deployment targets are exposed in the code snippets.
✓ Code blocks are simplified for readability and labeled accordingly; they are not production source.
✓ Real cofounder feedback is paraphrased and presented without handles or social media chrome.

Originally published at: https://dearartist.xyz/blog/glia-narrative-model-v1

Originally published at https://dearartist.xyz/blog/glia-narrative-model-v1.

Glia Social / Share: How AI Memories Become Safely Shareable

Thu, 23 Apr 2026 11:00:00 GMT

In most software products, sharing is simple. You share a photo, a document, or a link, and the system mostly needs to answer one question: who can access this resource?

In Glia ↗, the object being shared is not a file or a URL. It is a memory. And memory is different.

A memory is generated from conversations, lived context, relationships, and AI-assisted narrative structure. The thing being shared is not just content — it is content about someone. That changes the problem entirely.

Core idea

Glia does not simply send content outward. It delivers memories about a person to the right people, within the right boundary.

Object · about · boundary — three columns of the share model.

Why sharing memory is not ordinary sharing

Once memory becomes the object, sharing is no longer just an access-control feature. It becomes a question of who the memory is about, who should be allowed to see it, whether relationship context matters, and how to make sharing useful without making it dangerously broad.

That is why Glia's social/share is not a traditional "share sheet + URL" feature. It is a memory-sharing system built around people entities and controlled boundaries. The architecture below is what it takes to make that workable in production.

What Glia Social / Share actually is

Glia Social / Share is implemented through two different sharing models that look similar from the outside but mean very different things underneath:

Connection Share — entity-scoped relationship access.
Single Story Share — card-scoped outbound access.

That distinction is the foundation of the whole system. Collapsing the two into one mechanism would blur the semantics and weaken the permission model.

The four semantic rules

Glia's current social/share system is defined by four rules. They look compact, but they determine nearly every important design choice in the stack.

Final semantics

Connection grants access only to moment stories related to a specific people entity; it does not grant access to all of the owner's stories.

Single Story Share applies only to one moment story.

The current share-link is a public bearer token; it is not identity-bound to the viewer.

Theme stories are excluded from the current social/share system.

Core concepts at a glance

Connection Share — A long-term sharing boundary anchored to a people entity. Opens access only to the owner's moment stories that relate to that entity.

Single Story Share — An explicit, card-scoped outbound share. One token, one story. It does not establish any relationship with the receiver.

Public Bearer Token — The current share-link is not identity-bound. Anyone who holds the token can view that single story until it expires.

Theme Story Excluded — Theme stories are deliberately outside this system today. Their narrative shape does not fit entity-scoped sharing.

Two business lines: Connection vs Single Story

Connection Share

Suppose user A has a people entity in Glia representing a real person B. If A invites B and the connection is accepted (see also: fixing Glia's social invite flow), B does not get access to all of A's stories. Instead, B gets access only to A's moment stories that are specifically related to that people entity.

The permission anchor is not the owner account, and not the feed as a whole. It is entity_id. This makes connection feel less like a traditional social graph and more like a controlled sharing contract around a person-centered memory scope.

Single Story Share

Single Story Share is the explicit, card-level outbound share. A user can generate a share-link for one specific moment story. It is mainly used to send a memory outward — especially to someone who does not yet have access through connection.

it applies to one card
it does not create a lasting relationship
it does not establish connection
it does not verify that the receiver is "the person in the story"

In the current version, this link is a public bearer token. Anyone who has the token can view that one story. That is exactly why share-link is not the same thing as connection.

Diagram 1 — Business semantics

┌──────────────────────────┬──────────────────────────┬──────────────────────────┐
│ Connection Share         │ Single Story Share       │ Theme Story              │
│ (entity-scoped)          │ (card-scoped)            │ (excluded)               │
├──────────────────────────┼──────────────────────────┼──────────────────────────┤
│ Anchored to a            │ One moment story per     │ Not part of social/share │
│ people entity            │ token                    │                          │
│                          │                          │                          │
│ Only related moment      │ Public bearer token      │ Different narrative      │
│ stories                  │                          │ shape                    │
│                          │                          │                          │
│ Per-card self_only       │ No recipient identity    │ May get its own model    │
│ override                 │ binding                  │ later                    │
└──────────────────────────┴──────────────────────────┴──────────────────────────┘

The permission model

The most important thing about this system is not that it has endpoints. It is that its permissions match its semantics.

Connection authorization

A viewer can read a story via connection only if all of the following are true:

there is an active connection

the connection owner matches the card owner

the connection recipient matches the viewer

the connection entity_id matches the card entity_id

the card is a social moment story

the card is not marked self_only

Invite framed as 'memories about you' — consent to a relationship, not a file.

This is what prevents a subtle but dangerous failure mode: "I have one connection to this owner, therefore I can read unrelated stories from the same owner." That is exactly the kind of permission expansion the model is designed to avoid.

Share-link authorization

The token-based path has a different shape. A viewer can read a story via share-link if:

the token exists

the token is not expired

the token belongs to the same card

the card is a social moment story

There is no viewer identity check in this path. That is why the link is a public bearer token.

System architecture

At a high level, Glia splits the responsibility across three surfaces — the authenticated Social API, the public Social Web for landing pages, and a dedicated Social OG router for preview images. Underneath sits the policy and store layer, which is where the real business logic lives.

Diagram 2 — System architecture

CLIENT             ROUTERS                  DOMAIN                  DATABASE
─────────          ─────────                ─────────               ─────────
iOS App      ──►   Social API Router  ──►   Social Store / Policy ──► entity_shares
Public Web   ──►   Social Web Router  ──►   Push Jobs            ──► story_share_tokens
                   Social OG Router                                   cards
                                                                      card_visibility_overrides
                                                                      user_profiles
                                                                      social_notifications

Routers delegate to the store / policy layer. The store is the single place where authorization rules live.

The system does not rely on routers alone to express semantics. It relies on a centralized domain layer to keep the rules coherent across iOS, web, and background jobs.

Data model

The data model is what makes the semantics enforceable.

`entity_shares`

Represents connection-level sharing. It captures the share owner, which people entity it is about, who accepted the connection, and the current state. The key point is that entity_id is the permission anchor. That is what makes connection sharing entity-scoped.

`story_share_tokens`

Represents single-story sharing — the sharer, the card, the token, the expiration. It does not represent a relationship. It represents a single-card grant.

`card_visibility_overrides`

Captures per-card deny rules. Even when a connection exists, the owner can still hide a specific card by marking it self_only. That preserves an important real-world truth: not every memory about someone should automatically be visible to them.

Connection flow

An invite is created by the owner, previewed by the recipient, and accepted before any feed access opens up. The state machine stays simple: pending → active.

Diagram 3 — Connection flow

01  User A → API     POST /api/social/invites
02  API    → Store   create_invite(owner, entity_id)
03  Store  → DB      insert entity_share(status=pending)
04  API    → User A  returns invite_url
05  User A → User B  sends invite link
06  User B → API     GET /api/social/invites/{token}
07  Store  → DB      load preview (entity, owner, sample)
08  API    → User B  renders invite preview
09  User B → API     POST /api/social/invites/{token}/accept
10  Store  → DB      update status → active
11  API    → User B  connection created

Step 11 in production — connection accepted, feed opens.

Single Story Share flow

The share-link path never touches connection state. It mints a token for one card, drops the receiver onto a public landing page, and forwards a token-bearing request into the detail endpoint.

Diagram 4 — Single Story Share flow

01  User A → API     POST /api/social/share-link
02  API    → Store   create_or_get_share_link(card_id)
03  API    → User A  returns /s/{token}
04  User A → Viewer  sends link
05  Viewer → Web     GET /s/{token}
06  Web    → Store   get_by_share_token(token)
07  Web    → Viewer  renders public landing page
08  Viewer → API     GET /api/social/story/{card_id}?token=...
09  Store  → API     validate token (exists, not expired)
10  API    → Viewer  story content or 403

Step 1 in production — the panel that mints /s/{token}.

Key code patterns

The authorization rules above translate into very small, legible functions. The boundaries do most of the work.

Social moment story detection

def is_social_moment_story(card):
    return (
        card.type == "story"
        and card.entity_id is not None
    )

Two conditions are enough to define the boundary: it must be a story card and it must be anchored to a people entity. This naturally excludes theme stories from the current social/share system.

Connection authorization

def can_view_via_connection(viewer_user_id, card):
    share = find_active_entity_share(
        owner_user_id=card.user_id,
        recipient_user_id=viewer_user_id,
        entity_id=card.entity_id,
    )
    if not share:
        return False
    if has_self_only_override(card.user_id, card.id):
        return False
    return is_social_moment_story(card)

The critical line is entity_id=card.entity_id. That is what prevents owner-wide leakage and keeps connection access entity-scoped.

Share-link authorization

def can_view_via_share_token(token, card):
    record = get_story_share_token(token)
    if not record:
        return False
    if record.card_id != card.id:
        return False
    if record.expires_at < now():
        return False
    return is_social_moment_story(card)

Notice what is missing: no viewer identity check, no connection requirement, no recipient matching. This is why the share-link is a public bearer token.

`self_only` visibility control

def visible_in_connection_feed(card):
    if not is_social_moment_story(card):
        return False
    if has_self_only_override(card.user_id, card.id):
        return False
    return True

Connection defines the default sharing scope. self_only defines the owner's right to narrow that scope at the card level. Small but conceptually important.

Supporting surfaces: landing pages, OG, push

A complete share system is not just an API. To feel like a real product, it also needs a few well-aligned surrounding surfaces.

Public landing pages

/s/{token} is how a shared memory enters the outside world. It gives the receiver a readable preview, a path into the app, and a stable handoff point between web and native UX.

Open Graph images

Invite links and story-share links are not the same thing. An invite means "I want to establish a connection." A story share means "I want to send you one memory." They deserve different OG previews because they mean different things — and the previews are part of how users learn what each link is.

Push behavior

Notifications also need semantic integrity. For example, new_shared_card should respect daily push limits, while connection_accepted should not be blocked by the same budget. This sounds like an implementation detail, but it is really part of the product contract.

Product and engineering tradeoffs

Why connection and share-link must be separate

This is the most important design decision in the whole system. If connection and share-link are collapsed into one mechanism, everything gets blurry: is this relationship access? a temporary grant? an invitation? a durable permission? By keeping them separate, the system stays legible: connection = entity-scoped relationship access; share-link = card-scoped outbound access.

Why share-link is not recipient-bound

A natural question: if the story is about someone, why not bind the share-link to that recipient? The answer is not that it is impossible. It is that it is not the right tradeoff for the current version. Recipient-bound links would require solving much heavier problems:

how to identify the intended viewer before login
whether each person gets a unique token
how to model a story that references multiple people
what to do when links are forwarded
how native app, web landing, install flow, and authentication all connect

That is an entirely different system. The current design is intentionally narrower: connection handles relationship access, share-link handles explicit card sharing, and the link is public but the scope is narrow. That keeps the model simple and the rollout practical.

Why theme stories are excluded

This is not just an implementation gap. It is a deliberate boundary. Moment stories are closely tied to specific people, events, and relational context. Theme stories are more abstract, more synthesized, and often span broader narrative terrain. The current social/share system is intentionally optimized for people-related moment stories, where entity-scoped sharing is coherent. Theme stories may eventually deserve their own sharing model — but reusing this one would blur the semantics.

Why pending invite tri-state is intentionally not implemented yet

A tri-state model (e.g. none / pending / connected) in the iOS UI sounds like a small change, but it pulls in invite ownership semantics, sender-vs-recipient pending states, expiration UI, and resolution flows on both ends. Until those are designed end-to-end, the iOS surface keeps the simpler binary "connected vs not connected" model. It is better to be honest about that than to ship a third state that does not have a coherent product story yet.

What Glia explicitly does not support yet

Some systems become confusing because they quietly imply features they do not really have. This one is better understood by being explicit.

Pending invite tri-state. The current iOS model is still effectively connected vs not connected.
Recipient-bound share links. Current links are public bearer tokens; they are not identity-bound.
Theme story socialization. Theme stories are excluded from this sharing model entirely.

Calling these out matters. It keeps the system honest and it keeps the boundaries from drifting one PR at a time.

Closing reflection

What makes Glia Social / Share interesting is not that it adds a share button. It is that it treats memory as a different kind of object. A memory (see also: how Glia actually remembers) is not just content. It is content with relationship context. It is often about someone. And that means sharing it cannot be modeled as generic URL access alone.

Glia's answer is to separate two ideas cleanly: Connection for ongoing, entity-scoped sharing, and Single Story Share for explicit, card-scoped outbound sharing. That separation is what makes the system useful without making it careless.

Glia does not simply send content outward. It delivers memories about a person to the right people, within the right boundary.

Originally published at https://dearartist.xyz/blog/glia-social-share.

How Glia Actually Remembers

Mon, 20 Apr 2026 10:00:00 GMT

Raw messages, async patches, hybrid retrieval, and the evidence pipeline behind continuity.

When a user says, "Glia ↗ remembered something I mentioned weeks ago," it is tempting to explain that experience with a single word: memory.

In a production system, memory is rarely one thing. Sometimes it means durable storage. Sometimes summarization. Sometimes retrieval. Sometimes the model is only seeing a carefully assembled slice of prior evidence and turning that into a fluent reply. These mechanisms can look identical from the outside while behaving very differently under the hood.

In Glia, continuity does not come from stuffing full history into the model on every turn. It comes from a layered architecture: raw messages are persisted, recent windows are transformed into structured memory artifacts, those artifacts are projected into a searchable index, and each new chat request builds a bounded evidence payload that the writer model uses as grounding.

Core idea

Glia does not remember by keeping everything in context. It remembers by turning conversations into retrievable memory artifacts.

Pipeline / conversation → evidence → reply.

Diagram 1 — Memory architecture overview

WRITE PATH                      READ PATH
──────────                      ─────────
User message                    New chat request
    ↓                                ↓
messages                        planner / fallback
    ↓                                ↓
extract_memory_patch_task       build_evidence_pack
    ↓                                ↓
memory_patches                  hybrid_retrieve
    ↓                                ↓
ingest_from_patch               memory_evidence_json
    ↓                                ↓
memory_items                    llm_chat_stream
                                     ↓
                                 reply

Stories · themes · entities also feed memory_items.

Why this distinction matters

Products that feel personal are easy to mis-explain. If a team talks about AI memory as if it were human memory, it becomes harder to reason about failure modes, observability, privacy boundaries, and product trust. Talking about memory in terms of persistence, extraction, indexing, retrieval scopes, and prompt contracts makes the system easier to debug and easier to improve.

This is not a philosophical essay about AI memory. It is a technical description of what memory currently means in Glia's deployed architecture: what is stored, what is derived, what is retrieved, and what the model actually sees before it replies.

What users call memory

From the user's perspective, memory is any behavior that makes the assistant feel continuous. It may name a person mentioned once before. It may reconnect to an earlier emotional thread. It may surface a detail that the user did not repeat in the current conversation. The engineering task is to separate that experience from the mechanisms that create it.

In Glia today, those mechanisms include several distinct artifact types: messages as the source of truth for raw text; memory_patches as structured post-chat extractions; memory_items as a unified retrieval index for patch-, story-, and note-derived content; entities, aliases, and profiles that stabilize recurring people, concepts, and organizations; and a request-time memory_evidence_json payload that becomes part of the writer model's prompt context.

The last piece is the most important. The assistant's final language is still generated by a model, but the product's practical memory behavior depends heavily on what evidence is assembled and injected before generation begins.

Diagram 2 — Memory layers

Layer	Name	Code	Description
L1	Raw conversation	`messages`	Source of truth for what was actually said.
L2	Structured memory	`memory_patches`	Async extractions distilled from recent windows.
L3	Searchable index	`memory_items`	Unified retrieval substrate across sources.
L4	Narrative artifacts	`stories · themes · entities`	Longer-horizon structures that also feed memory.
L5	Request-time evidence	`memory_evidence_json`	Curated payload assembled per turn.
L6	Final model response	`llm_chat_stream`	Generated under transcript and evidence constraints.

The architecture in one sentence

Glia's current memory system is a split write/read architecture:

the write path persists raw messages, extracts structured memory patches, and projects them into retrievable memory items;
the read path gathers relevant memory artifacts at request time and passes them to the model as bounded evidence.

That division turns out to be the cleanest way to understand the system.

Write path: from messages to memory artifacts

The write path begins with persistence. Every conversation turn is stored in messages. This remains the canonical record of what was actually said. If a later memory artifact exists, it ultimately traces back here.

After a turn completes, a background task runs extract_memory_patch_task. This task loads a recent message window, generates a structured memory patch, and stores it in memory_patches with metadata including conversation_id, story_thread_id, success state, and supporting_message_ids.

This is the first important compression step. Instead of relying on future prompts to carry large raw transcript segments, the system distills a recent conversational slice into a more structured representation.

If the patch is valid, the next step is projection. ingest_from_patch converts structured patch content into one or more memory_items rows. These rows become part of the retrieval substrate used later by Memory V2. A fact, event, or relationship is no longer only buried in chat logs — it becomes an indexed artifact.

There is a second write path beyond chat patches. Story pipelines, theme pipelines, and entity refresh jobs can also produce durable artifacts that later become retrievable. Stories, in particular, may be ingested into memory_items, which lets longer-horizon narrative structure participate in the same memory retrieval layer as patch-derived artifacts.

Read path: from memory artifacts to prompt evidence

The read path happens inside POST /api/chat. Before the writer model generates a reply, Glia constructs an evidence bundle. That bundle can include:

recent memory_patches across conversation, user, and story-thread scopes;
entity lookups when the current turn contains known names;
temporal message windows for time-oriented prompts;
and Memory V2 hybrid retrieval results from memory_items.

This evidence is serialized into memory_evidence_json and passed into the chat streaming layer alongside only a limited tail of recent transcript messages. The model is not receiving unbounded history. It is receiving a curated, typed summary of what the application thinks is relevant.

This is the key to the product's continuity. The system does not rely on the model to preserve everything across sessions. The application reconstructs relevant context on each turn.

Where the evidence enters generation

The bridge between retrieval and generation is explicit in code:

# app/api/chat.py
base_stream = llm_chat_stream(
    user_id,
    conversation_id,
    topic_id,
    message_text,
    memory_evidence_json,
    ...
)

The writer model receives a curated evidence payload, not full history. Memory is an application-level construct injected into the prompt, not an invisible model property.

The fallback path is more important than it looks

Memory planning is not always LLM-driven. A conservative gate sits in front of the LLM-based read planner: if the user turn does not strongly resemble recall, temporal lookup, reflection, or an explicit "who is X?" question, the system stays on the fast path.

The fast path is not empty. It includes a deterministic fallback retrieval plan that pulls recent structured memory at multiple scopes:

# app/api/chat.py
qs.append({"type": "memory_patches_recent", "scope": "conversation", "days": 30, "limit": 12, ...})
qs.append({"type": "memory_patches_recent", "scope": "user",         "days": 90, "limit": 12, ...})
if story_thread_id:
    qs.append({"type": "memory_patches_recent", "scope": "story_thread", "days": 30, "limit": 12, ...})

When the planner is skipped, this deterministic plan still pulls structured memory across three scopes — conversation, user, and story thread.

Continuity is not waiting for a special memory mode. Even without planner involvement, recent structured memory can re-enter the evidence pack through normal fallback retrieval. In practice, that makes the fallback path one of the most important memory features in the system.

A redacted traced example

About the example

This article uses a redacted QA case study. Message text, user-identifiable details, and sensitive context have been removed. The goal is to explain the architecture and lineage of memory artifacts, not to expose user data.

In a recent QA trace, three messages from a single conversation (A, B, C) were later linked to a successful patch row:

memory_patches.id = mp_8ae2…124f
supporting_message_ids = {A, B, C}
story_thread_id = sth_b3be…63d

The patch payload contained one extracted entity and two events. A blog-safe anchor in that payload was a public literary entity. That patch was then projected into memory_items, including rows whose source_id was shaped like mp_8ae2…124f:entity:0, carrying the patch's structured fact representation.

Later, a memory_read_runs row for the same user showed a fallback read plan whose JSON matched the deterministic strategy in code: conversation-scoped patches, user-scoped patches, and story-thread patches including the same thread identifier.

Diagram 3 — Redacted memory trace

01  Earlier conversation
       ↓
02  message A · B · C
       ↓
03  structured patch created — mp_8ae2…124f
       ↓
04  projected into memory item — mi_5ac4…13f7d
       ↓
05  later chat request
       ↓
06  fallback / hybrid retrieval
       ↓
07  evidence injected — model can reference prior detail

This gives a concrete chain: messages → memory_patch → memory_items → read plan → evidence eligibility. What the chain does not establish on its own is that a later assistant reply explicitly named the artifact in a user-visible response. That stronger claim would require a tighter request-level reconstruction.

The narrower, more precise statement: the storage path is proven, the retrieval path is proven, and later resurfacing through patch recall or hybrid retrieval is technically supported by the deployed system.

Redacted artifacts

// memory_patches
{
  "id": "mp_8ae2…124f",
  "supporting_message_ids": ["A", "B", "C"],
  "story_thread_id": "sth_b3be…63d",
  "ok": true,
  "payload_json": {
    "entities": [{"display_name": ""}],
    "events":   [{"event_type": "life_event"}]
  }
}

// memory_items
{
  "id": "mi_5ac4…13f7d",
  "source_type": "chat_patch",
  "source_id":   "mp_8ae2…124f:entity:0",
  "item_type":   "fact",
  "content_text":"",
  "embedding":   null
}

// memory_read_runs
{
  "plan_source": "fallback",
  "memory_patches_n": 12,
  "plan_json": {
    "queries": [
      {"scope": "conversation",  "days": 30},
      {"scope": "user",          "days": 90},
      {"scope": "story_thread",  "days": 30}
    ]
  }
}

Hybrid retrieval is the real production story

It is easy to describe memory retrieval as vector search and move on. Memory V2 in Glia is more honest and more robust than that. The retrieval logic combines vector recall, keyword recall, and weighted fusion — a pattern often called hybrid retrieval ↗:

# app/core/memory_retrieval_v2.py
# Stage 1: vector recall
if query_embedding:
    vector_results = vector_search(...)

# Stage 2: keyword recall
kw_items = keyword_search(...)

score = (
    W_SEMANTIC * semantic
    + W_KEYWORD * keyword
    + W_RECENCY * recency
    + W_SOURCE_QUALITY * source_q
    + W_IMPORTANCE * importance
)

Vector and keyword recall are fused with recency, source quality, and importance weights — not a single similarity score.

Production nuance

Not every memory item has an embedding. In production systems, hybrid retrieval matters because vector coverage is often incomplete. Keyword overlap and recency-based patch recall remain important parts of the memory path.

A system that assumes perfect vector coverage is elegant in theory. A system that combines vector and lexical retrieval is usually stronger in practice.

Deterministic, heuristic, and LLM-driven layers

The cleanest way to reason about Glia's memory stack is to separate its different kinds of logic:

Stage	Kind
Message persistence	Deterministic
Fallback retrieval plan	Deterministic
Patch extraction	LLM-driven
Story / theme composition	LLM-driven
Hybrid ranking	Heuristic
Final reply	LLM under constraints

A concise way to put it: the database remembers, the retriever selects, the model interprets under constraints. That is more useful than saying "the model remembered."

Why this feels seamless to users

The complexity is hidden at the right layer. Users do not see patch ids, retrieval plans, or evidence JSON. They see a final reply that incorporates relevant prior context. Because retrieval happens before generation, and because the writer contract encourages proactive use of memory evidence, the output reads like direct recall rather than assembled grounding.

Why it feels seamless

The product feels continuous because retrieval happens before writing, and the writer model sees a bounded evidence payload rather than full history.

Tradeoffs and failure modes

Compression risk. Structured patches are compact and retrievable, but they can be wrong or subtly distorted (see also: Reducing token cost).
Durability. A detail only survives over time if it remains accessible through patches, memory items, stories, entities, or notes.
Observability. Internal rows like memory_read_runs provide useful telemetry, but user-facing provenance is still limited.
Retrieval coverage. Null embeddings mean not every memory item participates equally in vector search, which raises the importance of lexical retrieval and recency-based patch recall.
Telemetry ambiguity. Some read-side signals still blur the difference between "planner skipped" and "planner failed," which makes debugging harder than it needs to be.

These are not signs that the architecture is weak. They are signs that memory in production is a systems problem, not a slogan.

What this architecture gets right

It avoids the cost and unreliability of sending full history to the model every turn.
It preserves raw source-of-truth messages while creating compact derived artifacts.
It separates write-time extraction from read-time retrieval.
It allows narrative systems like stories and entities to feed into the same memory substrate.
It preserves lineage through supporting_message_ids, which is far better than treating all summaries as free-floating facts.

What should improve next

Telemetry should distinguish more clearly between "planner skipped" and "planner failed."
Internal tooling should expose top evidence artifacts per request, including patch and memory item ids.
Embedding coverage should be treated as an operational metric, not an invisible implementation detail.
Lightweight provenance UX could surface whether a detail came from earlier chat summaries, story artifacts, or other memory sources — without exposing raw internals.

Closing

Glia's memory is not a single module and not a single model behavior. It is an architecture. Messages live in Postgres. Background jobs compress them into structured artifacts. Those artifacts are indexed, retrieved, and assembled into bounded evidence. The model then writes inside those rails.

That is a more useful way to think about continuity than asking whether the model "really remembers." The better question is:

What artifacts should become durable memory, how should they be indexed, and what evidence should be injected on each turn?

That is where memory becomes a product system rather than a metaphor.

Originally published at https://dearartist.xyz/blog/glia-actually-remembers.

From Claude HUD to My Own Status Line

Fri, 17 Apr 2026 16:00:00 GMT

Two lines, one horizontal scan: model · plan · project · branch — context · usage · weekly.

Intro

Claude Code's status line is small, but I look at it constantly. It sits at the bottom of every session, summarising who I'm talking to, where I'm working, and how much budget is left. When the layout is stable, it fades into the background. When it isn't, it becomes surprisingly distracting.

This note walks through how I went from installing a third-party plugin to tracing its rendering path, finding a few layout issues, watching some get patched upstream, and eventually open-sourcing a small replacement that matches the two-line layout I wanted from the start.

What I wanted from the status line

My target UX was specific, not vague:

Line 1: model + plan + project + git branch
Line 2: context + usage + weekly usage, all on the same line

Compact, horizontally scannable, predictable. This is not an aesthetic preference — it is about making the data easy to scan during real use. I wanted to know, in one glance, which model I'm on, which project I'm in, and how close I am to my limits. Two lines. Same shape every time.

What I changed

The goal was never "make it pretty." It was "make the same information land in the same place every time." Layout stability beats visual polish.

Starting with claude-hud

The obvious starting point was claude-hud. It is well-presented, it ships through Claude Code's plugin system, and the README screenshots showed something close to the layout I wanted. Installation is a four-command flow inside Claude Code itself.

/plugin marketplace add jarrodwatts/claude-hud
/plugin install claude-hud
/reload-plugins
/claude-hud:setup

Within a minute I had a status line. Within five minutes I knew it wasn't quite the layout I wanted. That gap is what this note is about.

What worked

What worked

Plugin installation worked first try.

It exposed the key pieces of session state I cared about: model, context, usage, and git/project info.

It made the statusLine surface concrete enough that I could debug it instead of guessing.

The README and examples made the plugin feel approachable as a starting point.

It made the available stdin payload concrete, which mattered later when I built my own.

None of this is faint praise. claude-hud was a perfectly reasonable starting point. It taught me how the rendering surface worked. It also gave me a clear reference for what "good enough" looked like before I knew what I actually wanted.

Investigation

The README implied a layout close to this:

Expected:
[Sonnet 4.6 | Max] | project-name git:(main)
Context ████░░ 40% | Usage ██░ 20% | Weekly █ 8%

Observed in my env:
Sonnet 4.6 | project-name
Context ████░░ 40%
Usage ██░ 20%
Weekly █ 8%

In real use, the second line stacked into three or four rows. The data was the same, but the scanning experience was very different — instead of one horizontal bar of state, I had a little column of independent facts.

Before assuming this was a bug, I checked the obvious sources of a mismatch: my configuration (compact vs expanded), the stdin payload Claude Code was sending, terminal width detection, and claude-hud's own rendering logic. I traced this through the source rather than guessing — where the layout is rendered, how compact differs from expanded, what happens when rate_limits is missing, and how width is detected when stdout is a pipe instead of a TTY.

Each of those turned out to matter. None of them in isolation explained the full picture.

Root cause

There were four distinct factors, not one.

1. Terminal width fallback (primary bug)

Claude Code launches the status line as a subprocess with stdout piped, not a TTY. That means standard terminal width detection can fail. When all fallbacks failed, claude-hud used:

UNKNOWN_TERMINAL_WIDTH = 40

40 columns is far narrower than any modern terminal. Under that constraint, the horizontal layout had no choice but to wrap into stacked rows. In my environment, this was the main reason the README-style layout didn't appear.

2. Expanded layout pre-split context and usage too early

The expanded renderer pre-split context and usage before the smarter wrapping logic could decide whether to keep them together. Combined with the 40-column fallback, this made the layout fragile in a predictable way.

3. Missing rate_limits data from Claude Code stdin

When rate_limits is absent from the stdin payload, the usage block becomes incomplete or disappears entirely. Even with correct layout logic, the visible output can still diverge from the README because the data itself is incomplete.

4. README screenshots represent ideal conditions

The README implicitly assumes that terminal width is detected correctly, that rate_limits is present, that the model name is short, and that the outer Claude Code UI isn't consuming critical width. It's a useful conceptual example, not a guaranteed representation of every runtime environment.

It's worth noting that none of this is reducible to "user error." Even with reasonable configuration, the 40-column fallback could force the HUD into a stacked layout, and switching to compact mode could make it worse — compact emits one long line that wraps aggressively when width detection fails. I couldn't find a config-only path that reproduced the README layout reliably in my environment. In my case, the issue was primarily implementation and runtime behavior, not configuration.

Root cause

One implementation bug (width fallback), one rendering ordering bug (pre-split), one runtime data gap (rate_limits), and one expectation gap (README under ideal conditions). Each looks like the others until you separate them.

What got fixed in claude-hud

The investigation surfaced two real rendering issues, and fixes were applied for both.

Terminal width fallback was raised from 40 → 80.
Expanded rendering no longer pre-splits context and usage. They are combined first and then passed through the smarter wrapping logic, so they only split when there isn't enough room.

The effect was incremental but real:

Before either fix: often 4 rows.
After the width fix: typically 3 rows.
After the rendering fix: context and usage group much more intelligently.
With a correctly detected real terminal width, the layout becomes much closer to the README expectation.

What got fixed

Two real implementation fixes — not cosmetic tuning. The width fallback was a bug. The pre-split ordering was a bug. Both are now better.

What still remained unsolved

Even after both fixes, the result still depends on runtime conditions.

If the real terminal width is still not detected and the renderer falls back to 80 columns, Context + Usage + Weekly may still not fit on one line, so Weekly drops to a third row.
If rate_limits is missing from stdin, usage info still disappears or becomes incomplete.
The "Max" plan label I wanted was never implemented in claude-hud at all. This is not a config issue and not a bug — the codebase simply does not render a plan field. Adding it would require Claude Code to include that field in stdin and claude-hud to parse and render it.

What still remained

Some things were bugs that got fixed. Some things were runtime conditions that no plugin can fully control. Some things were missing features. They look the same from the outside, and they aren't.

Why I still replaced it

Even after the fixes, I still chose to build my own. Not because claude-hud was bad — it had genuinely improved — but because my requirements were narrow and specific:

Full layout control.
Deterministic output, every session.
A graceful fallback when rate_limits is missing.
A renderer small enough that I could understand and change end-to-end.
A final UI that matches my workflow exactly, not approximately.

At that point, a small custom renderer felt simpler than continuing to tune wrapping behavior indirectly. For a surface this small, owning the renderer was easier to maintain than negotiating with someone else's heuristics.

My custom open-source solution

The result lives here: yuannh/claude-code-custom-statusline. It replaces plugin rendering with a small Python script that Claude Code invokes through its native statusLine command interface.

Wire it up in ~/.claude/settings.json:

{
  "enabledPlugins": {
    "swift-lsp@claude-plugins-official": true
  },
  "statusLine": {
    "type": "command",
    "command": "python3 \"$HOME/.claude/statusline.py\""
  }
}

The script itself is intentionally short and readable.

#!/usr/bin/env python3
import sys
import json
import os
import subprocess
from datetime import datetime, timezone
from pathlib import Path

CACHE = Path.home() / ".claude" / "statusline-debug.json"

def bar(pct, width=10):
    try:
        pct = float(pct)
    except Exception:
        pct = 0
    pct = max(0, min(100, pct))
    filled = int(round(width * pct / 100))
    return "█" * filled + "░" * (width - filled)

def time_left(ts):
    try:
        ts = int(ts)
        now = int(datetime.now(timezone.utc).timestamp())
        diff = max(0, ts - now)
        d = diff // 86400
        h = (diff % 86400) // 3600
        m = (diff % 3600) // 60
        if d > 0:
            return f"{d}d {h}h"
        if h > 0:
            return f"{h}h {m}m"
        return f"{m}m"
    except Exception:
        return ""

def git_text():
    project = os.path.basename(os.getcwd())
    try:
        branch = subprocess.check_output(
            ["git", "branch", "--show-current"],
            stderr=subprocess.DEVNULL,
            text=True
        ).strip()
    except Exception:
        branch = ""

    dirty = ""
    try:
        inside = subprocess.run(
            ["git", "rev-parse", "--is-inside-work-tree"],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL
        ).returncode == 0
        if inside:
            dirty_work = subprocess.run(
                ["git", "diff", "--quiet"],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL
            ).returncode != 0
            dirty_index = subprocess.run(
                ["git", "diff", "--cached", "--quiet"],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL
            ).returncode != 0
            if dirty_work or dirty_index:
                dirty = "*"
    except Exception:
        pass

    if branch:
        return f"{project} git:({branch}{dirty})"
    return project

def load_json(text):
    try:
        return json.loads(text) if text else {}
    except Exception:
        return {}

raw = sys.stdin.read().strip()
current = load_json(raw)

cached = {}
if CACHE.exists():
    try:
        cached = json.loads(CACHE.read_text())
    except Exception:
        cached = {}

if current.get("rate_limits"):
    try:
        CACHE.write_text(json.dumps(current))
    except Exception:
        pass

if not current.get("rate_limits") and cached.get("rate_limits"):
    current["rate_limits"] = cached["rate_limits"]

for key in ["model", "context_window", "workspace", "cwd"]:
    if not current.get(key) and cached.get(key):
        current[key] = cached[key]

data = current

model = data.get("model", {}).get("display_name") or "Sonnet 4.6"
plan = "Max"

ctx = data.get("context_window", {}).get("used_percentage")
if ctx is None:
    ctx = 0

five = data.get("rate_limits", {}).get("five_hour", {})
week = data.get("rate_limits", {}).get("seven_day", {})

five_pct = five.get("used_percentage", 0)
week_pct = week.get("used_percentage", 0)

five_left = time_left(five.get("resets_at"))
week_left = time_left(week.get("resets_at"))

line1 = f"[{model} | {plan}] | {git_text()}"
line2 = (
    f"Context {bar(ctx)} {int(ctx)}% | "
    f"Usage {bar(five_pct)} {int(five_pct)}%"
    + (f" ({five_left})" if five_left else "")
    + " | "
    f"Weekly {bar(week_pct)} {int(week_pct)}%"
    + (f" ({week_left})" if week_left else "")
)

print(line1)
print(line2)

It reads stdin JSON, caches the last good payload to ~/.claude/statusline-debug.json, falls back to the cache when rate_limits is missing, and prints exactly two lines. No wrap heuristics. No conditional layouts. Same shape, every time.

What my implementation fixes

What I changed

Deterministic two-line layout.

Stable horizontal second line.

Cache fallback when rate_limits is missing.

Simpler rendering path — no width detection at all.

Easier debugging and full ownership.

Explicit layout control instead of indirect tuning.

It is not trying to be more capable than claude-hud. It is simply narrower, smaller, and easier for me to reason about.

Side-by-side comparison

README expected:
[Sonnet 4.6 | Max] | proj git:(main)
Context ███░ 40% | Usage ██ 20% | Weekly █ 8%

Actual claude-hud:
Sonnet 4.6 | proj
Context ███░ 40%
Usage ██ 20%
Weekly █ 8%

Patched claude-hud:
Sonnet 4.6 | proj git:(main)
Context ███░ 40% | Usage ██ 20%
Weekly █ 8%

My implementation:
[Sonnet 4.6 | Max] | proj git:(main*)
Context ███░ 40% | Usage ██ 20% | Weekly █ 8%

Screenshots — before vs after

These are CSS recreations of what the status line actually rendered as in each phase — same data, different layout behavior.

Status line evolution across four iterations.

The open-source repo on GitHub.

Links and repositories

jarrodwatts/claude-hud — The original plugin I started from and investigated. A great way to understand the Claude Code statusLine surface.
yuannh/claude-code-custom-statusline — My smaller, purpose-built implementation for a deterministic two-line layout. Python, no dependencies, ~110 lines.

Lessons learned

Plugins are a great starting point. Sometimes they aren't the final answer.
Rendering bugs can hide behind what looks like a config mismatch.
Terminal width detection matters far more than it first appears, especially when stdout is piped.
Documentation screenshots usually represent ideal conditions, not all runtime realities.
Even when a tool improves, owning the final rendering path can be the better tradeoff if the UX needs to be exact.
Source-level investigation is often the fastest way to separate user error from real implementation problems.

Closing reflection

I started by trying to customize a plugin. I ended up tracing the rendering path, understanding the limits of the original implementation, and open-sourcing a version that matched the interface I actually wanted.

In the end, I didn't need a more feature-rich status line. I needed one whose layout I could predict. For my workflow, that turned out to be easier to build than to negotiate. That tradeoff won't be right for everyone, but it was the right one for me.

Originally published at https://dearartist.xyz/blog/claude-statusline.

My Terminal Workstack: Ghostty + Yazi + lazygit + Claude Code

Fri, 17 Apr 2026 15:00:00 GMT

One terminal window, four tools, and a workflow that finally feels like a workstation instead of a command line.

Three panes, one terminal: Claude Code · Yazi · lazygit.

Stack Summary

Ghostty → terminal and panes
Yazi → file navigation
lazygit → Git workflow
Claude Code → AI coding layer

Why I moved more of my work into the terminal

For a while, my day looked like this: Finder for files, a GUI Git client for commits, a browser tab for searching the codebase, an IDE for editing, another window for the terminal, and a separate chat window for AI. Each tool was fine in isolation. The context switching was not.

I wanted one place where navigation, code context, Git operations, and AI-assisted development could sit next to each other — composable, focused, and fast. The terminal turned out to be the most honest answer. Not because the terminal is morally superior, but because it lets me arrange the pieces I actually use into a single layout that never moves.

The stack

Four tools, each doing one thing well.

Ghostty — terminal

The foundation. Fast, GPU-accelerated, native-feeling on macOS, and serious about split panes. It's the first terminal I've used where the window itself feels like part of the work, not a frame around it.

GitHub
Docs

Yazi — file manager

A blazing-fast terminal file manager written in Rust. I use it to move through projects and directories without lifting my hands off the keyboard, and to drop back into a shell exactly where I left off.

GitHub
Docs

lazygit — git ui

A keyboard-first Git UI. Staging hunks, reviewing diffs, writing commit messages, rebasing — all without typing long commands or leaving the terminal. The thing GUI Git clients always wanted to be.

GitHub
Wiki

Claude Code — ai layer

The AI layer in the terminal. I use it for repo-level understanding, code audits, implementation help, and quick iteration on patches. It reads the actual code, not a summary I had to paste in.

Docs
GitHub

How the workflow fits together

A typical session looks like this.

Open Ghostty. Split it into a left pane and a right pane.
In the left pane, run yy to launch Yazi and jump into the project I want.
In the right pane, start Claude Code. Ask it to explain a module, find usages, or sketch a fix.
Split the right pane vertically. Top half stays with Claude. Bottom half opens lazygit so I can watch the working tree change as I edit.
Edit in my editor of choice (often invoked from the same terminal). Stage, review, and commit in lazygit. Push.

Nothing here is novel on its own. The point is that all of it lives in one window, in panes I can see at a glance, with the same key bindings I use for everything else.

What I actually like about this setup

Fewer context switches. Files, code, Git, and AI are visible at the same time. I stop losing my train of thought between apps.
More keyboard flow. Once your hands stop moving to the trackpad, you start thinking in steps instead of clicks.
Better repo awareness. Yazi keeps the directory tree in my peripheral vision, so I always know where I am.
Faster Git reviews. lazygit's diff view makes it easy to actually read what I'm about to commit instead of rubber-stamping it.
Terminal as a real workspace. Not a command prompt I tolerate, but a layout I designed.

My basic setup and config

The relevant pieces from my dotfiles. Nothing exotic — just the things that earn their keep.

Ghostty config

Lives at ~/.config/ghostty/config.

# =====================================
# Ghostty — macOS polished setup
# =====================================
font-family = JetBrainsMono Nerd Font Mono
font-size = 13
window-padding-x = 18
window-padding-y = 16
cursor-style = block
cursor-style-blink = false
theme = Catppuccin Mocha
background-opacity = 0.96
background-blur-radius = 20
macos-titlebar-style = transparent
scrollback-limit = 300000
shell-integration = detect
clipboard-read = allow
clipboard-write = allow
copy-on-select = false
window-inherit-working-directory = true
confirm-close-surface = true
mouse-scroll-multiplier = precision:1.0,discrete:3
keybind = cmd+t=new_tab
keybind = cmd+w=close_surface
keybind = cmd+n=new_window
keybind = cmd+]=next_tab
keybind = cmd+[=previous_tab
keybind = cmd+d=new_split:right
keybind = cmd+shift+d=new_split:down
keybind = alt+left=goto_split:left
keybind = alt+right=goto_split:right
keybind = alt+up=goto_split:up
keybind = alt+down=goto_split:down
keybind = cmd+alt+left=resize_split:left,10
keybind = cmd+alt+right=resize_split:right,10
keybind = cmd+alt+up=resize_split:up,10
keybind = cmd+alt+down=resize_split:down,10
keybind = cmd+ctrl+f=toggle_fullscreen
keybind = cmd+shift+p=toggle_command_palette

Yazi shell helper

The yy function launches Yazi and, when you quit, drops your shell into whatever directory you ended up in.

function yy() {
    local tmp="$(mktemp -t "yazi-cwd.XXXXXX")"
    yazi "$@" --cwd-file="$tmp"
    if cwd="$(cat -- "$tmp")" && [ -n "$cwd" ] && [ "$cwd" != "$PWD" ]; then
        cd -- "$cwd"
    fi
    rm -f -- "$tmp"
}

Helpful aliases

alias codehome='cd ~/Code'
alias lg='lazygit'

Typical usage

yy ~/Code
cd ~/Code/glia-core
lazygit

Notes on how I use Git in the terminal

lazygit doesn't change what Git is. It just makes the parts you do every day easier to see. The mental model I keep in mind:

Working tree — what's actually changed on disk.
Staging — picking which of those changes belong in the next commit.
Commit — saving a meaningful checkpoint with a message you'd be willing to read later.
Push — publishing it to the remote so it stops being only yours.

When all four are visible in a pane next to my code, I commit smaller, stage more deliberately, and write better messages. That's the whole pitch.

What this setup is not

It's not a manifesto for ditching every GUI forever. I still use a browser, a Figma window, and a real editor. I'm not pretending the terminal is universally better.

It's also not about terminal purity. If a tool only exists as a GUI and it's good, I use it.

What this setup does is reduce the friction in the part of my day where I'm actually building. The value isn't in any single tool. It's in the composition — four small, well-behaved pieces sitting next to each other, doing one job each, never getting in the way.

Resources

Ghostty — Terminal emulator with native UI and GPU acceleration
- GitHub
- Docs
Yazi — Async terminal file manager written in Rust
- GitHub
- Docs
lazygit — Keyboard-first terminal UI for Git
- GitHub
- Wiki
Claude Code — Terminal-native AI coding workflow
- Docs
- GitHub

Still evolving. Like any good terminal setup.

Originally published at https://dearartist.xyz/blog/terminal-workstack.

Reducing Token Cost While Working on Glia

Fri, 17 Apr 2026 11:00:00 GMT

Engineering Note · Glia · AI-Assisted Development · Cost Optimization · Codebase Governance

2026-04-17 09:30 · 12 min read · Engineering, AI-Assisted Development, Cost Optimization, Codebase Governance, Refactoring

A practical governance pass on an AI-heavy codebase: separating fixed prompt noise from real context cost, extracting safe helpers, validating aggressively, and stopping before optimization became a refactor project.

Key Findings

In Glia, the main token cost driver was repeated large-file context loading, not just prompt overhead.

Trimming CLAUDE.md helped reduce fixed noise, but it was not the main billing lever.

Extracting pure helper functions improved file-load granularity for narrow AI queries.

The biggest remaining cost center is still orchestration-level reasoning inside giant job bodies.

Opening

While working on Glia ↗, I ran into a practical problem: our AI-assisted development workflow was getting expensive in ways that were easy to feel but hard to measure. I was not trying to clean up the codebase for aesthetics. I was trying to understand what was actually driving token usage, and which changes would reduce cost without turning the effort into an uncontrolled refactor.

This was an engineering economics problem more than a code quality one. Repeated context loading of giant files, a long always-applied CLAUDE.md, and dense iterative debugging loops were quietly compounding. The instinct to make things "cleaner" is not the same instinct as making things "cheaper", and the two diverge quickly once you look at where the bytes actually go.

The Glia-specific problem

The concrete situation was easy to characterize once I sat down with the file sizes. A handful of files dominated the working surface that AI tooling kept reaching into.

File	Size	Notes
`extract_story_job.py`	~10,598 lines · ~467 KB	Compose pipeline orchestration
`app/api/cards.py`	~223 KB	Card-surface API
`theme_story_service.py`	~175 KB	Theme story assembly
`CLAUDE.md`	~18 KB (pre-cleanup)	Always-applied instructions

At first glance, there were two obvious suspects: the oversized CLAUDE.md injected into every conversation, and several extremely large Python files, especially extract_story_job.py.

The first important insight was that the instinct to trim CLAUDE.md was directionally right, but incomplete. It was fixed overhead. The much more expensive pattern was repeated full-file loading of giant source files whenever AI tooling needed even a tiny helper function, a local rule, or a narrow explanation of behavior. The same 467 KB was being shipped into context for questions that genuinely only needed 200 lines.

Fixed noise vs dynamic cost

The reframing that mattered was separating two things that often get conflated in AI cost discussions:

Fixed noise	Dynamic cost
`CLAUDE.md`	giant source files
always-applied context	repeated full-file loads
lower leverage	much higher leverage

In Glia, CLAUDE.md still mattered, but mainly as context pollution rather than the primary billing driver. The true cost driver was large-file ingestion during repeated debugging and code review cycles. That changed the optimization strategy from "compress everything" to "reduce the granularity of what needs to be loaded."

Process flow

Cost Investigation  →  CLAUDE.md Trim  →  Helper Extraction  →  Verification  →  Stop / Observe
(What actually        (Reduce fixed       (Smaller modules)     (Tests +         (Measure,
 loads)                noise)                                    container)       don't refactor)

Phase 1: Trimming CLAUDE.md

The first move was the low-risk one. I went through CLAUDE.md and removed historical rollout and recovery sections that no longer affected current decision-making. I preserved only the still-relevant constraints, and merged isolated environment variables back into the main feature flag table so they would be findable in one place.

✓ Kept	✗ Removed
Still-active env variables	Completed rollout history
The one still-open issue worth preserving	Resolved issue logs
Active constraints such as the `SM-5` guard	Validation snapshots
	Temporary monitoring notes
	Recovery narrative no longer affecting behavior

The result was modest in raw numbers: CLAUDE.md went from roughly 385 lines to 322 lines, and from about 18 KB to about 13.5 KB. This was worth doing, but not because it would dramatically lower cost by itself. Its real value was reducing fixed noise and improving the quality of the model's working context. A leaner instruction file is easier to keep coherent and easier to reason about during review.

Phase 2: Reducing file-load granularity

The higher-leverage move was structural rather than textual. Instead of debating giant files in the abstract, I extracted pure helper functions out of extract_story_job.py into smaller modules so future AI queries could load narrow-purpose files instead of a 467 KB monolith.

My rule was intentionally strict: only move helpers that were pure, self-contained, independently queryable, and one-way-import safe. No DB access. No LLM calls. No Celery or task context. No orchestration logic. The whole point was that the extracted modules had to be safe to load in isolation.

Honestly, the first pass overshot the original minimal plan. Instead of extracting only a tiny timeline helper cluster, the change moved a broader pure-function surface into two new modules:

app/core/story_text_utils.py
app/core/story_identity_guard.py

This was broader than originally intended. That immediately raised the bar for validation, because once you stop being minimal you can no longer rely on minimality as your safety argument.

Concrete examples from the Glia codebase

extract_story_job.py contained helper logic that AI tools might need to inspect independently — tokenization rules, sentence normalization, identity-token comparison — but those narrow questions often dragged in a 467 KB file. After extraction, helper-level questions could often target story_text_utils.py or story_identity_guard.py instead. That changed the loading shape for tasks like tokenization, sentence normalization, identity token comparison, and similar rule-level debugging.

# Imports after extraction
from app.core.story_text_utils import (
    _tokenize_universal,
    _normalize_timeline_sentence,
)
from app.core.story_identity_guard import (
    _identity_tokens,
    _token_jaccard,
)

Loading shape

Before: helper question → load extract_story_job.py (~467 KB)
After:  helper question → load story_text_utils.py / story_identity_guard.py

The governance correction

The most important lesson was not just about code movement. It was about governance. A broader refactor cannot be accepted on the basis of vague or incomplete validation, and the first acceptance pass on this change was honestly a little too soft.

Partial import issues appeared during early checks. Some validation claims were too loose. "All good" was asserted too early, on evidence that did not actually cover the QA worker container. That kind of premature green light is exactly how a small extraction quietly turns into a regression in a runtime path you never tested.

Instead of continuing into more extraction work, I stopped and required a strict verification pass. The required checks were explicit:

Enumerate every moved function.
Confirm no duplicate definitions remained in extract_story_job.py.
Confirm one-way import direction only.
Verify re-export behavior where tests still imported symbols from the old path.
Run a larger, reproducible test set.
Verify import success in the QA worker container.

Note

This was not a heroic rewrite. It was a constrained governance pass: reduce cost where the evidence was strong, validate aggressively, and stop before the optimization frontier became a refactor project.

What was actually verified

After the corrected verification pass, the outcomes I was willing to stand behind were narrow but solid.

✓ No duplicate moved-function definitions remained in extract_story_job.py.
✓ New modules did not import back from the original giant job file.
✓ Representative behavior checks passed for the moved helper functions.
✓ 191 tests passed in the selected verification scope.
✓ QA worker import check succeeded for from app.jobs.extract_story_job import extract_story_job.

The QA worker check mattered most. It closed the gap between "seems fine in a partial local environment" and "imports correctly in the actual runtime container." Local greens that do not mirror runtime are the exact failure mode I was trying to avoid this time.

# qa-worker · import smoke check
$ docker exec glia-core-worker-1 python -c \
    "from app.jobs.extract_story_job import extract_story_job; print('import OK')"

import OK

What changed structurally

Before	After
Helper logic buried inside a massive job file.	`CLAUDE.md` contains less dead historical state.
Narrow questions often triggered giant context loads.	Text and identity helpers are available as smaller modules.
Fixed prompt context contained too much stale history.	Future AI-assisted queries can target smaller files in many cases.
	Validation discipline improved before accepting structural changes.

What this did NOT solve

Helper extraction solved only one class of cost: narrow helper-level queries. It did not solve the most expensive remaining problem.

⚠ Still Expensive

extract_story_job.py still contains two enormous orchestration bodies:

extract_story_job()

process_story_pipeline_job()

These functions remain the main cost center for compose-flow debugging. Any conversation that needs to understand those paths can still trigger very large context loads. Helper extraction is not the same thing as orchestration decomposition, and conflating the two would be a real architectural mistake.

Why I did not continue into theme_story_service.py

I did analyze theme_story_service.py as the next candidate, and I found a reasonable pure helper cluster that could potentially be extracted later. The file contained candidate pure helpers such as date formatting helpers, summary sanitation helpers, feed summary helpers, and lightweight text transformation helpers.

But I deliberately stopped. The analysis surfaced near-duplicate utility behavior that was not obviously safe to unify without design intent review.

Two concrete examples from the codebase:

_tokenize in theme_story_service.py was close to _tokenize_universal in story_text_utils.py — but their CJK behavior was not identical.
_truncate_at_sentence also looked very close to _truncate_snippet_at_sentence, but "near-duplicate" is not the same as "safe to consolidate".

The analysis showed that theme_story_service.py was a plausible next target, but it also showed why obvious cleanup is sometimes less obvious than it looks. Unifying two functions that behave subtly differently is a content decision, not a tidy-up.

Why I deliberately stopped

Once the validated low-risk improvements were in place, I stopped. Beyond this point, optimization would no longer be "small extraction work." It would become real refactoring:

Redefining pipeline stage boundaries.
Extracting orchestration phases.
Designing interfaces between evidence selection, compose, gating, and publish.
Increasing coordination and regression risk.

Not every technically possible optimization should be executed immediately. Good engineering governance includes knowing when to stop. The cheapest mistake at that point was to keep going on momentum and pretend it was still a focused optimization.

Key lessons

Separate fixed noise from dynamic cost. Always-applied context and per-query file loads are different cost regimes. Treat them with different tools.
Large-file loading is often the real billing driver in AI-assisted workflows. Prompt overhead is visible. Repeated multi-hundred-KB ingest is not, but it is usually where the money goes.
"Cleaner" is not the same as "cheaper". A nicer-looking codebase does not automatically reduce token cost. Tie every change to a specific load path you actually want to shrink.
Validation quality matters more when scope expands. The minute a minimal change becomes a broader extraction, the minimality argument disappears. Verification has to scale with scope.
Governance is not bureaucracy; it is how you stop local improvements from turning into uncontrolled refactors. Knowing when to stop is part of the work. Do not continue optimizing without measurement.

What I would do next

Observe real-world token usage after the current changes before committing to more structural work.
Measure cost across typical task types so future decisions are data-grounded, not vibes-grounded.
Determine whether future helper extraction in theme_story_service.py is justified once the duplicate-behavior questions are resolved.
Treat decomposition of the giant orchestration functions as a dedicated engineering sprint, not a side quest.

Closing

The most useful outcome of this pass was not just a smaller CLAUDE.md or a couple of helper modules. It was a clearer mental model of where AI development cost actually comes from inside Glia. Once that became visible, decision-making changed. Some optimizations were worth shipping immediately. Others were worth deferring. That distinction is the difference between optimization as engineering and optimization as noise.

"This was not a refactor sprint. It was a governance pass. The goal was to reduce real AI development cost without confusing motion for progress."

Originally published at: https://dearartist.xyz/blog/glia-token-cost

Originally published at https://dearartist.xyz/blog/glia-token-cost.

From cat to bat: A Small CLI Change That Made My Terminal Better

Thu, 16 Apr 2026 10:00:00 GMT

A personal walkthrough of installing and configuring bat — the cat clone with syntax highlighting, Git integration, and theme support.

I don't remember exactly when I first heard about bat. It was probably a passing mention in someone's dotfiles repo, or a one-liner in a Hacker News comment. The premise was simple: it's like cat, but with syntax highlighting, line numbers, and Git-aware diff markers. I installed it, tried it once, and then forgot about it for weeks.

When I finally came back to it, I realized there was more to configure than I expected. The defaults are fine, but getting it to feel right — choosing a theme, understanding how paging works, wiring it into my shell — took a bit of reading and trial and error. This post is that process, written down so I can reference it later and so anyone else starting from scratch doesn't have to piece it together from scattered GitHub issues.

What bat Actually Does

If you've used cat to quickly dump a file to the terminal, bat does the same thing — but makes the output substantially easier to read. Here's what it adds:

Syntax highlighting. bat detects the file type and applies color to code, config files, Markdown, JSON, YAML, and hundreds of other formats. You see structure instead of a wall of monochrome text.
Line numbers. Every line gets a number in the gutter, which makes it easier to reference specific lines when debugging or discussing code.
Git modification markers. If the file is inside a Git repository, bat shows additions, deletions, and modifications in the left margin — similar to what you'd see in a code editor.
Automatic paging. When the output is longer than your terminal window, bat pipes it through a pager (like less) so you can scroll. Short files print directly.
Theme support. bat ships with dozens of themes and supports custom ones. You can match your terminal's color scheme exactly.
It just feels nicer. This is subjective, but looking at a syntax-highlighted config file is meaningfully less fatiguing than reading raw plaintext. Small thing, real difference.

bat is originally created by sharkdp. It's open source, written in Rust, and actively maintained. What follows is entirely about my own experience setting it up and using it — not a fork or modification of the project itself.

Why I Even Cared

I spend a lot of time reading config files, checking environment variables, and scanning log output in the terminal. Most of the time I'm not editing — I'm just looking. And plain cat output makes that harder than it needs to be. No line numbers, no color, no context. For short files it's fine. For anything over twenty lines, I'd find myself opening the file in an editor just to get syntax highlighting, which felt excessive for a read-only glance.

bat solves exactly that problem. It turns "let me quickly look at this file" into something that actually works, without requiring me to leave the terminal or launch a heavier tool.

cat vs bat — same file, different reading experience.

My Actual Setup Journey

I installed bat via Homebrew — brew install bat — and ran it against a random file. It worked. Colors appeared. I thought: great, done.

Then I ran bat --list-themes to see what other themes were available, and got hit with a wall of output — every single built-in theme rendering the same sample file, one after another, scrolling endlessly through the pager. It was useful information, but overwhelming in presentation. I didn't know which theme I was even looking at half the time because the theme name scrolled past before I could read it.

That's when I learned the difference between bat's own output and the pager that wraps it. By default, bat pipes everything through less, which is great for long files but confusing when you're trying to visually compare themes. Running bat --list-themes --paging=never let me see all themes in a continuous stream, which was much easier to scan.

bat --list-themes — previewing color schemes side by side.

Next came the config file. bat supports a persistent configuration file so you don't have to pass flags every time. Finding it was its own small adventure:

$ bat --config-file
/Users/yuanh/.config/bat/config

The file didn't exist yet. I created it and opened it:

$ open -e "$(bat --config-file)"

From there I started experimenting. I tried a few themes, toggled different style options on and off, and eventually settled on a configuration that felt right. I chose Catppuccin Mocha — it matches the palette I already use in my editor and terminal, so the colors don't clash when I switch between tools.

Editing ~/.config/bat/config.

The Confusing Parts

Two things tripped me up early on, and I think they'd trip up most people who aren't already familiar with how Unix pagers work.

Pager output vs. bat output. bat sends its colored, styled output to a pager by default. The pager controls scrolling, search, and when the output disappears. If you're used to cat, which just dumps everything and returns to the prompt, this feels different. You're suddenly in a scrollable view and you have to press q to exit. It's not wrong — it's actually better for long files — but it takes a moment to understand what's happening.

Theme previews are overwhelming. bat ships with a lot of themes. Running --list-themes shows all of them at once. There's no "pick from a menu" experience. You scroll through a long list, try to remember which names you liked, then set them in your config and restart. It works, but it's not a polished selection flow. I eventually just searched for Catppuccin by name after deciding on it separately.

My Final Configuration

Here's what I ended up with. The config file lives at the path returned by bat --config-file.

# Theme
--theme="Catppuccin Mocha"

# Display style
--style="numbers,changes,header"

# Paging behavior
--paging=auto

# Typography
--italic-text=always
--tabs=4

# Syntax mappings
--map-syntax="*.ino:C++"
--map-syntax=".ignore:Git Ignore"

And the shell aliases I added to ~/.zshrc:

# Use bat as default cat replacement
alias cat='bat --paging=never'

# Configure bat's pager
export BAT_PAGER="less -RF"

After saving, reload the shell and verify:

$ source ~/.zshrc

$ bat ~/.zshrc
  # outputs .zshrc with syntax highlighting, line numbers, and header

$ cat ~/.zshrc
  # now uses bat under the hood, but without paging

$ bat --list-themes --paging=never
  # browse all available themes in a continuous stream

$ bat /etc/hosts
  # quick system file inspection with full styling

Before / After

The difference is easier to feel than to describe. Here's a rough comparison of viewing the same file with cat versus a configured bat.

Left — cat .zshrc (plain, monochrome):

# Path exports
export PATH="/usr/local/bin:$PATH"
export EDITOR="vim"

# Aliases
alias ll='ls -la'
alias cat='bat --paging=never'

# Starship
eval "$(starship init zsh)"

Right — bat .zshrc (line numbers, syntax highlighting, Git + marker on changed line 7):

1  # Path exports
2  export PATH="/usr/local/bin:$PATH"
3  export EDITOR="vim"
4
5  # Aliases
6  alias ll='ls -la'
7+ alias cat='bat --paging=never'
8
9  # Starship
10 eval "$(starship init zsh)"

Left: plain cat. Right: bat with Catppuccin Mocha, line numbers, and Git markers.

Comparing terminal output before and after configuration.

What Changed in Daily Use

bat is not a transformative tool. It doesn't change what you can do — it changes how it feels to do the things you already do. And that matters more than I expected.

Terminal output is easier to read. I don't skim past important lines as often because syntax highlighting creates visual structure that my eyes can follow. Comments look different from code. Strings look different from keywords. That's it, but it's enough.

Config files feel less intimidating. Opening a long .env or Nginx config with bat makes it immediately parseable. The line numbers let me say "look at line 47" instead of "scroll down a bit, it's somewhere in the middle."

The friction for everyday file inspection dropped. I used to open files in VS Code just to glance at them. Now I stay in the terminal. It's a small workflow change, but it compounds across dozens of file reads per day.

It's a small tool. It improves the texture of daily development work. That's the best way I can describe it.

Credits and Links

bat is originally created and maintained by sharkdp. This post is not affiliated with the project — it's a personal write-up about my own setup experience and daily usage.

Originally published at https://dearartist.xyz/blog/bat-setup.

Fixing "Repetitive and Contrived" AI Stories as a Structural Systems Problem

Wed, 15 Apr 2026 20:00:00 GMT

A user complaint that looked like a writing issue turned out to be an upstream problem in entity modeling, theme deduplication, and narrative assignment.

"What looked like a writing problem was actually a systems problem."

Metric	Value	Detail
Ready themes	23 → 9	After dedup + suppression
Org entities	13 → 10	3 spelling variants merged
Overlap pairs	13 → 3	Jaccard ≥ 0.50
Suppressed	14	5 overlap + 9 min-members

TL;DR

Summary

Sometimes the most revealing user feedback sounds subjective. In this case, the feedback was simple: "These stories all feel repetitive and contrived." The obvious reaction would have been to tweak prompts, soften the writing style, or make the output sound more natural. But after auditing the full data path in Glia ↗'s theme story system, it became clear that this was not primarily a composition problem. It was a structural systems problem. A three-phase fix (low-signal suppression → overlap dedup → fuzzy entity merge) brought ready themes from 23 down to 9, and overlap pairs from 13 down to 3.

Background: What Theme Stories Are Supposed to Do

Glia is trying to do something unusually difficult: turn a user's ongoing conversations, places, projects, relationships, and recurring reflections into a narrative interface they can revisit and make sense of.

A theme story is not just a summary. It is meant to act more like a durable narrative unit around an ongoing topic in a person's life:

A long-running project
An important relationship
A place that shapes a phase of life
A recurring value, tension, or reflective thread

When this works, the system feels like it genuinely understands what the user is living through.

When it fails, the effect is immediate:

"Why am I seeing the same thing over and over, just rewritten with a different title?"

That second failure mode is what this post is about.

The User Complaint: 8–9 Moments, but 21+ Stories

The user was based in Singapore and gave direct feedback that the generated stories felt repetitive and contrived.

For a narrative product, repetition is not just a stylistic issue. It is a trust problem. Once the system starts rewriting the same thing as multiple stories, users stop feeling understood and start feeling processed.

The QA snapshot for this user looked like this:

-- user_audit_snapshot.sql
-- User data snapshot (Singapore user)
-- ─────────────────────────────────────────
  metric                    │ value
 ──────────────────────────┼───────
  Messages                  │ 109
  Distinct threads           │ 21
  Ready theme stories        │ 23
  Active org entities        │ 13
  Duplicate org variants     │ 4 (same project)
  Overlap pairs (J ≥ 0.50)  │ 13
  Single-member themes       │ 9
 ──────────────────────────┴───────

Even before reading any generated text, the shape of the data already looked wrong.

Twenty-one threads should not naturally produce twenty-three ready themes. That meant the system was not distilling themes. It was amplifying fragments.

Why I Did Not Change the Prompt

At first glance, it would have been easy to blame the language model: maybe the prose was too literary, maybe the prompt was too eager to find meaning, maybe the titles were over-written.

But when I traced the chain upstream, the problem was already present before composition ever ran.

If the system splits one real-world project into multiple entities, creates separate themes for each variant, and then feeds many of the same threads into each of those themes, no prompt can save the result. Even a perfectly restrained writing prompt would still produce outputs that feel repetitive and forced.

The writing layer was not causing the issue. It was faithfully exposing it.

The real failure was in theme formation.

Root Cause Analysis: Three Gaps in One Bad Chain

1. Entity Fragmentation

The user was really talking about one project, but the system had created multiple org entities for it:

# entity_variants
Project Kaiwen
Project Kaiwan
Project Kai Wen
Kawen

These were not four distinct projects. They were spacing, spelling, and prefix variants of the same one. Without canonical convergence at the entity layer, every downstream stage would continue to treat them as separate sources of meaning.

2. Theme Duplication

Once entity variants exist, theme creation can easily multiply them. If deduplication only checks exact string equality, then kawen, project kaiwen, and project kai wen all survive as separate theme candidates.

So one entity-level split quickly becomes multiple theme states.

3. Member Over-assignment

Even with some duplication upstream, a stricter membership layer could have reduced the damage. But in this case, member assignment behaved more like independent attraction than coordinated allocation. Similar themes each absorbed overlapping threads using keyword overlap, without enough cross-theme restraint.

The result:

Theme A contains those threads
Theme B also contains those threads
Theme C contains them again

By the time the user sees the output, it looks like multiple stories with different titles but almost the same underlying material.

Root Cause Chain

┌──────────────┐   ┌──────────────┐   ┌────────────────┐   ┌──────────────┐
│   Entity     │ → │    Theme     │ → │    Member      │ → │  Repetitive  │
│ Fragmentation│   │ Duplication  │   │ Over-assignment│   │   Stories    │
└──────────────┘   └──────────────┘   └────────────────┘   └──────────────┘

Designing the Fix: Stop the Bleeding, Then Close the Source

I did not want to solve this with a large architectural rewrite. The safer path was a phased, reversible repair strategy.

Phase D1: Suppress Low-Signal Themes

The quickest and lowest-risk win was to stop generating themes from only one member thread. If a theme has just one member thread, it usually does not deserve to become a standalone theme story.

# feature_flags.env
GLIA_THEME_MIN_MEMBERS=2

If a theme has fewer than two member threads, it is suppressed before composition. This immediately removes a large class of shallow, low-signal stories.

Phase D2: Deduplicate Themes by Member Overlap

Next, I added a guard against duplicate themes at the theme level. If a candidate theme's member_thread_ids overlap too heavily with an older ready theme, the newer one is suppressed.

# feature_flags.env
GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD=0.50

This is not the final form of cross-theme allocation, but it is an effective v1. It directly reduces the symptom the user actually experiences: repeated stories built from the same thread set.

Phase D3: Fuzzy Merge for Org Entities

Finally, I addressed the source of the duplication. I added conservative fuzzy matching to the org entity resolve path, specifically to catch spelling variants, spacing variants, and prefix noise.

Removing leading noise tokens like "project"
Collapsing spaces
Comparing core forms
Applying a conservative Levenshtein threshold

# feature_flags.env
GLIA_ORG_ENTITY_FUZZY_ENABLED=1

I kept this intentionally conservative and avoided broad fuzzy merge behavior for people entities.

Full Feature Flag Configuration

# glia-core/.env (theme repair flags)
GLIA_THEME_MIN_MEMBERS=2
GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD=0.50
GLIA_ORG_ENTITY_FUZZY_ENABLED=1

What Changed in Code

1. Low-Signal Theme Suppression

# theme_story_compose.py
min_members = settings.GLIA_THEME_MIN_MEMBERS

if min_members > 0 and len(member_thread_ids) < min_members:
    suppress_theme(
        reason="min_members_guard",
        detail=f"members={len(member_thread_ids)} < threshold={min_members}",
    )
    return

This was a small change, but it removed a large number of thin themes that never should have composed in the first place.

2. Jaccard-Based Theme Deduplication

# theme_dedup.py
def jaccard(a: set[str], b: set[str]) -> float:
    if not a or not b:
        return 0.0
    return len(a & b) / len(a | b)

score = jaccard(
    set(current.member_thread_ids),
    set(existing.member_thread_ids),
)

if score >= settings.GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD:
    suppress_theme(
        reason="member_overlap_dedup",
        detail=f"overlap_with={existing.theme_key}, jaccard={score:.2f}",
    )

This let the system reject duplicate themes without rewriting the entire membership architecture.

3. Org Entity Fuzzy Merge

# org_entity_resolve.py
def _org_core_collapsed(name: str) -> str:
    parts = normalize(name).split()
    if parts and parts[0] in {"project", "the", "my"}:
        parts = parts[1:]
    return "".join(parts)

def _levenshtein(a: str, b: str) -> int:
    # standard dynamic programming implementation
    ...

def _find_fuzzy_org_candidate(
    name: str,
    candidates: list[str],
) -> str | None:
    core = _org_core_collapsed(name)
    for candidate in candidates:
        other = _org_core_collapsed(candidate)
        if len(core) >= 4 and _levenshtein(core, other) <= 2:
            return candidate
    return None

This made it possible for variants like Kaiwen, Kaiwan, Kai Wen, and Kawen to converge to a single canonical entity rather than continuing to branch.

QA Validation: Did the Fix Actually Work?

The important question was not whether the code looked reasonable. It was whether the user-level outcome improved.

I ran a before/after audit on the target user in QA.

Metric	Before	After
Ready themes	23	9
Active org entities	13	10
Overlap pairs	13	3
Single-member themes	9	—
Suppressed themes	—	14 (5 overlap + 9 min-members)

# repair_user_themes.sh
# QA verify
docker compose exec -T api python scripts/repair_user_a45d_themes.py --dry-run

# deploy
git push origin main
ssh glia-qa "cd ~/glia-core && docker compose pull && docker compose up -d"

# before_after_output.txt
Before:
  Ready themes: 23
  Active org entities: 13
  Overlap pairs: 13

After:
  Ready themes: 9
  Active org entities: 10
  Overlap pairs: 3

Those remaining 3 overlap pairs were not bugs. They represented natural cross-theme sharing between truly different narrative angles: a close collaborator as a person, Singapore as a lived context, New Zealand as a retreat plan, The Future of Truth as a reflective theme. They share some threads, but they are still distinct stories.

Surviving Narrative Angles

After the repair, the surviving themes made sense as separate angles rather than duplicate rewrites:

Kawen: the core project arc
User E: personal influence on the project vision
Co-working Space: work environment and network
User T: quiet support
Restless Egg: place and internal tension
Singapore: life context and self-compassion
The Future of Truth: reflective thread
New Zealand: retreat and planning
User R: creative follow-through

That was the moment the original complaint stopped looking subjective. The data had changed shape in exactly the right way.

Results and Engineering Impact

On the surface, this looks like a simple reduction from 23 to 9. But the real improvement was not that the number got smaller. It was that the system stopped amplifying fragments and started converging on durable narrative units.

1. Better Narrative Trust

The system now behaves more like a thematic compressor and less like a duplication engine. When users see fewer, more distinct stories, the product feels like it understands their life rather than reprocessing their data.

2. Correct Diagnosis Over Superficial Patching

The most important decision in this repair was not a code trick. It was resisting the urge to patch the writing layer before understanding the upstream structure. Prompt tuning would have masked the symptom while leaving the structural problem intact.

3. Safer Rollout Through Feature Flags

I shipped the fix behind three reversible flags: GLIA_THEME_MIN_MEMBERS, GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD, and GLIA_ORG_ENTITY_FUZZY_ENABLED. That made it easy to validate in QA, tune thresholds, and roll back safely if needed.

What Still Remains Imperfect

I did not implement full thread-level cross-theme exclusivity. That would require a more global membership architecture, because the current refresh flow runs mostly per theme rather than per user with a joint allocation pass.

For this stage, that heavier redesign was unnecessary. The combination of entity merge, low-signal suppression, and overlap dedup was enough to resolve the main user-facing failure mode.

A few surviving theme bodies still contained the old "Kaiwen" spelling, but that was only stale generated text from before the entity merge. It was a surface artifact, not a structural issue, and would disappear on recompose.

How I Would Extend This in a Later Phase

The natural next step would be a global thread allocation pass: instead of each theme independently attracting member threads via keyword similarity, run a single per-user allocation step that assigns each thread to its highest-scoring theme and enforces cross-theme exclusivity.

This would require reworking the refresh flow from a per-theme loop to a per-user joint optimization. It's a meaningful architectural change, and the current phased fix deliberately stops short of it. But the data model and feature flags are already in place to support it when the time is right.

Diagnostic Method: When Users Say AI Output Feels Fake

One lesson from this repair is that when users say AI output feels fake, repetitive, or contrived, the problem is often not where the prose is written.

Before changing prompts, it is worth asking:

Did the system split one real-world object into multiple identities?
Did it model similar themes more than once?
Did it assign the same source material into too many narrative units?

If the answer to any of those is yes, then what looks like a content problem is probably a systems problem.

That is the kind of engineering problem I find most interesting: taking a vague human complaint and turning it into a concrete, testable, structural diagnosis.

Engineering Takeaways

When generated output looks fake, audit upstream data structures before touching the prompt.
Entity fragmentation is a class of system degradation that is easy to overlook. Spelling variants get amplified exponentially downstream.
Jaccard overlap is a simple but effective dedup signal. Good enough as a pragmatic v1 before building a global allocation system.
Feature flags are not just for gradual rollouts. They are an expression of engineering judgment, letting you validate layer by layer and roll back with confidence.

Originally published at https://dearartist.xyz/blog/glia-theme-debugging.

When "No New Moments" Wasn't a Simple Bug

Wed, 15 Apr 2026 16:00:00 GMT

Debugging thread routing, gate logic, and LLM failure modes in Glia ↗.

Metric	Value	Detail
Messages → Thread	24 → 1	Over-merge on 4/14
Ready rate recovery	37.5% → 72.7%	After replay
Baseline	76%	Prior steady state
Root causes	3	Identified & fixed

TL;DR

Summary

A user reported barely receiving new moments despite uploading many photos. After tracing through QA data, backend routing, gate prompts, and compose fallbacks, the real picture was more nuanced: the main issue was thread over-aggregation (24 messages merged into 1 thread), a secondary issue was an over-conservative story gate, and a diagnostic blind spot in compose validation made failures harder to debug. After targeted fixes and historical replay, the effective ready rate recovered from 37.5% to 72.7%, close to the 76% baseline.

Background

The original product feedback was simple:

"I've uploaded quite a few photos recently, and they should be linkable to moments. But I haven't really been getting many moments in the past few days. I'm wondering whether other users are seeing the same issue too."

This was a good reminder that user-facing symptoms often compress several different failure modes into one sentence.

The temptation with LLM-backed systems is to jump straight to the most obvious theory: maybe image uploads are not triggering generation, maybe photos are not being converted into evidence, maybe media-only inputs are ignored, maybe the moment generation pipeline silently broke.

But the correct first move was not "fixing." It was instrumented narrowing.

The first hypothesis was wrong

One early idea was that the user's recent activity might have been mostly photos with little or no text, and that media-only inputs were simply less likely to produce moments.

That hypothesis did not survive data validation.

After checking QA data directly, it became clear that her recent messages were not empty media posts. Most uploads were accompanied by text captions. So the issue was not "photo-only content can't generate moments."

That mattered because it changed the entire debugging direction. Instead of asking:

"Why is meaningful recent activity ending up with so few visible moments?"

The data pattern that changed everything

Looking at two adjacent windows of activity made the shape of the problem visible.

-- user_e_thread_comparison.sql
-- User E's thread / output comparison
-- ─────────────────────────────────────────────────
  window       │ messages │ threads │ ready │ rate
 ──────────────┼──────────┼─────────┼───────┼──────
  4/03 – 4/08  │  ~40     │  8      │  6    │ 76%
  4/10 – 4/14  │  ~38     │  8      │  3    │ 37.5%
 ──────────────┴──────────┴─────────┴───────┴──────

The first thing that stood out was not just the drop in ready rate. It was the collapse in thread count.

   24 messages   →   1 thread

On 4/14, the user sent 24 messages, but they all landed in one thread. Historically, their activity looked more like many short threads, multiple distinct topics, several chances to generate moments. On 4/14, it looked like one long continuous session with many topic shifts but only one story thread — therefore at most one ready moment.

That changed the priority stack immediately.

Root Cause #1: Session-Based Thread Aggregation Was Over-Merging

At the top of the stack was a design flaw in the thread router. The router had a session-aware aggregation path designed for diary-like usage: if messages arrived within a short time window, they were assumed to belong to the same thought stream.

In principle, that sounds reasonable. In practice, it was too blunt.

The problem

Messages were being merged into the most recent thread based almost entirely on time proximity:

# story_thread_router.py
session_enabled = os.getenv(
    "GLIA_STORY_THREAD_SESSION_ENABLED", "0"
).strip().lower() in {"1", "true", "yes", "on"}

session_window_min = max(
    1,
    _env_int("GLIA_STORY_THREAD_SESSION_WINDOW_MINUTES", 10)
)

if session_enabled and candidates and not debug.get("low_signal"):
    most_recent = min(
        candidates,
        key=lambda c: (now - c.last_msg_at).total_seconds(),
    )
    gap_min = max(
        0.0,
        (now - most_recent.last_msg_at).total_seconds() / 60.0
    )
    if gap_min <= float(session_window_min):
        return most_recent.story_thread_id

The key issue was not that session aggregation existed. It was that it bypassed semantic boundary checks. That meant: no robust topic continuity check, no hard cap on thread growth, no protection against "continuous chat about multiple unrelated things."

In User E's 4/14 case, that produced exactly the wrong behavior:

playlist
outfit
dinner party prep
cleaning
running
reflections about people

All of that got swallowed into one story thread because each message arrived within the session window.

Why this mattered so much

Downstream, story compose was effectively bounded by thread granularity. In practice:

1 thread → 1 daily compose → ≤ 1 ready moment

So if routing crushed four meaningful clusters into one thread, every downstream component was already operating on a degraded input structure. This is why I considered router over-merge the true upstream root cause.

Root Cause #2: The Story Gate Was Too Conservative

The second issue sat in the story gate. Even when a thread was coherent and meaningful, the gate sometimes suppressed it as too lightweight.

This was especially visible for content like:

preparing for a dinner party
discussing outfits
reacting to a friend's playlist or gesture
small but real emotional movement in social life

These are not dramatic events. But they are exactly the kind of material that memory-native products should preserve.

Two examples

A 3-message thread about playlist / outfit / social anticipation had been suppressed with a rationale along the lines of: "brief reaction / acknowledgment; low narrative signal."

A 6-message thread around dinner party prep was suppressed as: "short positive update without deeper narrative or new life event."

From a narrow "major event detection" lens, that might sound reasonable. From a product lens, it was too strict.

The prompt was partially to blame

The gate prompt contained a conservative instruction:

# gate_prompt.txt
If uncertain between none and story, choose none.

This sounds safe, but in practice it biases the model against the entire middle band of content: not trivial, not dramatic, but still personally meaningful.

The few-shot examples also leaned too hard toward extremes. There were not enough examples of everyday but meaningful social content. So the gate was not "broken" in the usual sense. It was miscalibrated.

Root Cause #3: Compose Failure Was Harder to Diagnose Than It Should Have Been

The third issue was smaller in impact, but important operationally. A high-value thread about self-discovery and relationship reflection failed during compose. The failure surfaced as:

# compose_logs
story_thread_compose_contract_invalid:unknown

That "unknown" was a problem in itself. It made failures harder to cluster and debug. After inspecting the compose path, I found a bad observability pattern: a redundant try/except Exception: pass was swallowing signal.

Before

# compose_validator.py — before
try:
    from app.validators.contracts import _validate_story_blocks as _vsb
    ok, reason = _vsb(payload)
except Exception:
    pass

After

# compose_validator.py — after
ok, reason = _validate_story_blocks(payload)

if sub_reason == "unknown":
    logger.warning(
        "compose_contract_sub_reason_unknown",
        extra={
            "story_thread_id": story_thread_id,
            "request_id": request_id,
            "payload_keys": list(payload.keys()) if isinstance(payload, dict) else [],
            "has_blocks": isinstance(payload, dict) and "blocks" in payload,
        },
    )

This did not magically fix compose quality. But it fixed something equally important: the ability to understand why compose failed. In LLM systems, sometimes you should improve generation. Sometimes you just need to stop losing the error signal.

Root Cause Summary

Priority	Title	Description
P0	Over-merge in session routing	Session-aware aggregation merged semantically unrelated messages into a single thread based purely on time proximity.
P1	Over-conservative gate prompt	Gate suppressed everyday social and life-planning content as 'too lightweight' due to conservative uncertainty policy.
P2	Poor failure attribution	Compose validation failures surfaced as `sub_reason=unknown`, swallowing diagnostic signal behind a bare `except`.
—	Independent provider issue	Anthropic API returning HTTP 400 in QA, forcing fallback to OpenAI with different quality characteristics.

Fix strategy: solve the upstream quantity problem first

Once the root causes were ranked, the repair sequence became straightforward.

User feedback → QA validation → Code audit → Router fix (P0) → Gate fix (P1) → Replay → Provider split
   Symptom       Data audit     Root causes                                       Verification    Separate track

P0 — Fix thread over-merge first

I intentionally chose the smallest effective change rather than redesigning the entire routing system.

Guard A: Cap session thread growth.

# story_thread_router.py — guardrail
session_max_user_messages = _env_int(
    "GLIA_STORY_THREAD_SESSION_MAX_USER_MESSAGES",
    8,
)

if session_enabled and candidates and gap_min <= float(session_window_min):
    current_user_msgs = _count_user_messages(db, most_recent.story_thread_id)
    if (
        session_max_user_messages > 0
        and current_user_msgs >= session_max_user_messages
    ):
        debug["session_blocked"] = "max_user_messages"
    else:
        return most_recent.story_thread_id

This avoids infinite growth in a single active session. Not a fancy model — a practical guardrail.

Guard B: Topic-shift protection. I also added a topic-shift gate based on lexical overlap and Jaccard similarity. The initial version was directionally right but too sensitive in replay — it fragmented sessions too aggressively. So I rolled its default back to 0 and kept the cap-based protection as the primary deployed fix.

Not every theoretically good guard belongs in the default config immediately.

P1 — Relax the gate for meaningful everyday content

The second repair was prompt-level. I updated the gate prompt with new few-shot examples for dinner prep, playlist/friend interaction, outfit/social anticipation, a weaker uncertainty policy, and clearer instruction:

# gate_prompt.txt — updated
Everyday social moments that reflect a real experience,
relationship dynamic, preparation, anticipation, or emotional reaction
can still be worth preserving as a story.

If uncertain, lean none —
unless the thread clearly reflects a real lived moment.

This was a better first move than changing scoring formulas, because it directly addressed the model's decision boundary without broadening the whole system indiscriminately.

P2 — Improve validation observability

Finally, I cleaned up contract-failure attribution: remove swallowed signal, log unresolved attribution properly, make future compose failures easier to inspect.

Verifying the fixes

One of the most important parts of this case was resisting the urge to declare victory based only on code changes.

At first, the fixes were deployed, but the user had not sent any new messages yet. That meant no new router runs, no new gate decisions, and no real-world verification. So I took a safer middle path: do not rewrite historical thread assignments, but replay specific suppressed threads that were safe to re-run.

# replay_results.log
replay job: 3 historical threads
──────────────────────────────────────────────────────
 thread                        │ previous   │ result
──────────────────────────────┼────────────┼─────────
 self-discovery + reflection   │ compose ✗  │ ready ✓
 playlist / outfit / social    │ gate: none │ ready ✓
 dinner party prep             │ gate: none │ ready ✓
──────────────────────────────┴────────────┴─────────

The first two were the clearest signal. They showed that the gate prompt fix was not theoretical — it changed actual outcomes.

   37.5%   →   72.7%   |   76%
   before      after        baseline
                replay

That improvement includes targeted replay of historical suppressed threads. So it is not identical to "natural live traffic recovered by itself." But it is a strong validation that the main logic was repaired correctly.

A separate problem emerged: provider reliability

One thread remained suppressed even after replay. During replay, QA Anthropic requests were returning HTTP 400 consistently, which forced all calls to fail over to OpenAI. That introduced a separate class of failures: banned phrase checks, title grounding failure, provider-specific output quality issues.

This mattered, but it did not invalidate the earlier fixes. It simply meant the debugging work had split into two different tracks:

Track A — Fixed	Track B — Follow-up
Router over-merge	Anthropic API 400 in QA
Gate over-suppression	OpenAI fallback compose quality
Validation observability gap	Finer attribution between validation layers

That separation is useful. Otherwise engineering work turns into one giant undifferentiated bucket.

Lessons learned

1. Session continuity is not the same thing as topic continuity

Just because messages arrive within ten minutes of each other does not mean they belong to the same semantic unit. That assumption works for some journaling behavior, but not for all conversational behavior.

2. Fix upstream granularity problems before downstream quality problems

If routing collapses four opportunities into one, then gate and compose are already operating at a disadvantage. This is why "fix the router first" was the right call.

3. Conservative gates can quietly erase the product's real value

The easiest thing for a classifier or gate to do is say "no." But products like this are not supposed to preserve only dramatic life events. A lot of value lives in smaller but emotionally real moments.

4. Observability is part of product quality

An "unknown" failure reason is not just an ops annoyance. It slows down every future iteration. In LLM systems, diagnosability is not optional.

5. Not every discovered issue belongs in the same fix batch

The Anthropic 400 problem was real, but it was not the same as the original root cause. Separating "fixed main issue" from "new independent issue" kept the work focused.

Closing thoughts

What I liked about this case is that it reinforced a pattern I trust more and more in product engineering:

The visible symptom is often not the real unit of failure.

"No new moments" sounded like a generation problem. In reality, it was a combination of routing granularity, product calibration, and observability quality.

And once those were disentangled, the fixes became smaller, clearer, and much more effective.

That is usually a good sign you are finally solving the right problem.

Originally published at https://dearartist.xyz/blog/glia-thread-debugging.

Closing the Invite Flow Gaps in Glia's Social Feature

Wed, 15 Apr 2026 11:30:00 GMT

Engineering Case Study · 2026-04-15 11:30 · 14 min read · Engineering, Product, iOS, Backend, Systems Thinking, Rollout

An end-to-end audit of invite creation, onboarding handoff, state sync, push routing, and rollout readiness.

TL;DR

Summary

I audited the full invite flow of Glia's social feature across backend and iOS. The backend capabilities mostly existed, but the client had several critical flow breaks. I fixed onboarding handoff, connection state refresh, dead-end push routing, feature flag gating, decline feedback, and funnel logging. After these changes, the main invite path became closed-loop and suitable for gradual rollout.

Context

Glia ↗ is a memory-native product. One of its social surfaces lets a user tap Invite to Connect from a people entity card, send an invite link, and allow another user to accept that invite to unlock memory sharing around that person.

This work was not about building a brand new feature from scratch. It was about answering a more important question:

Is this feature actually rollout-ready, or does it only look complete when reading the code?

That distinction matters. In social systems, a feature can "exist" in the sense that endpoints are implemented, views render, and the happy path works in isolation. But users do not experience isolated endpoints. They experience a chain of transitions: from one screen to another, from one app state to another, and from one expectation to another.

I audited both glia-core backend and glia-ios client. The goal was to determine whether the flow was actually complete enough to support gradual rollout.

Why this audit mattered

The problem was not that the feature did not exist. The problem was that the user could still fall out of the flow.

That is the kind of issue that is easy to underestimate when looking at a codebase. A team can see invite creation working, see a landing page render, see an accept endpoint return success, and conclude that the feature is basically done. But from a product perspective, that is not enough.

A social invite flow is only real if it stays intact across the edges:

a new user installing from an invite link
a returning user re-opening a people profile after acceptance
a push notification that actually lands somewhere useful
a disabled feature that stays hidden instead of surfacing as a broken interaction

This was a good example of why endpoint completeness is not the same thing as product completeness.

The end-to-end flow I audited

The audit started from the exact user action that matters most: tapping Invite to Connect on a people entity card.

People Card  →  Create Invite  →  Share Link  →  Install / Open  →  Accept Invite  →  Social Feed
(Tap invite)    (Backend API)    (External)     (Onboarding)        (State trans.)    (Connection live)

At a glance, most of these parts already existed. The backend could create invites. The accept endpoint existed. Push jobs were present. The feed path existed. But when I reconstructed the full chain, several important gaps became obvious.

What I found

The audit identified three true P1 breaks and several P2 issues.

Finding	Severity	Description
Onboarding handoff broken	P1 / high	New users from invite links lost the pending invite after completing onboarding. They had to manually reopen the link.
Stale connection state	P1 / high	People card showed "Invite to Connect" even when the connection was already active. Misled the inviter.
Dead-end push routing	P1 / high	`new_shared_card` notifications opened an empty static page with no content and no useful CTA.

The three P1 breaks

1. Pending invite was dropped after onboarding for new users

A new user could install the app from an invite link, complete onboarding, and still lose the pending invite flow. The token had been captured, but onboarding completion did not consume it. That meant the user had to reopen the invite link manually.

This is exactly the kind of bug that makes a feature look healthy in code and broken in reality.

2. The connection state on the people card did not refresh

A connection could already be active, but the people profile still showed Invite to Connect. This was not just stale UI. It actively misled the inviter into thinking the invite had not worked.

This was not a backend failure. It was a state synchronization failure on the client.

3. new_shared_card push notifications led to a dead end

Tapping the notification opened an empty static detail page with no content and no useful CTA. The same dead-end behavior also existed in Notification Center.

This was a classic example of a route existing without being a good destination.

The P2 issues

Finding	Severity	Description
Feature flag not respected	P2 / medium	Social invite button still rendered when the backend feature flag was disabled.
No decline feedback	P2 / medium	Decline action had no success or error feedback. Users could not tell if their action worked.
Missing env documentation	P2 / medium	`.env.example` was missing social-related environment variables for safe deployment.
No funnel analytics	P2 / medium	There was no basic invite funnel logging, making it impossible to judge conversion during rollout.

These were not the highest-severity breaks, but they mattered for rollout quality.

How I fixed them

I tried to keep the fixes narrow and explicit instead of redesigning the surface.

Fix 1: Consume pending invite at onboarding exits

On iOS, the pending invite token now gets consumed at onboarding completion paths, including the normal completion flow and the skip/test path.

// OnboardingCompleteView.swift
await MainActor.run {
    SocialDeepLinkHandler.shared.consumePendingIfNeeded()
    router.clear()
    router.navigateTo(.chatHome)
}

The important point was not the exact screen name. It was the guarantee that onboarding could no longer swallow the pending invite state.

Fix 2: Refresh connection state on people profile appear

PeopleProfileDetailView now refreshes social connections on appear, so the UI reflects the real connection state.

// PeopleProfileDetailView.swift
.onAppear {
    loadProfile()
    loadTimeline()
    Task { await socialService.refreshConnections() }
}

This was intentionally small. I did not redesign the people card or move the logic into a more elaborate state machine. The goal was simply to ensure that the button reflects the actual backend state instead of stale cached data.

Fix 3: Gate social entry by backend-driven availability

The client now derives social feature availability from actual backend responses, instead of relying on build-time assumptions.

// SocialService.swift
@Published var isSocialFeatureAvailable: Bool = true

func refreshConnections() async {
    do {
        let response = try await getConnections()
        cachedConnections = response.connections
        isSocialFeatureAvailable = true
    } catch {
        if Self.isFeatureDisabledError(error) {
            isSocialFeatureAvailable = false
            cachedConnections = []
        }
    }
}

When the backend returns feature disabled, the invite button is hidden rather than shown-and-fail.

Fix 4: Route push notifications to socialFeed

Instead of opening an empty static page, new_shared_card now routes to socialFeed and triggers a feed refresh.

// NotificationRouter.swift
case "new_shared_card":
    NotificationCenter.default.post(
        name: .socialFeedNeedsRefresh, object: nil
    )
    Router.shared.navigateTo(.socialFeed)

Both push handling and Notification Center taps converge on the same route. That route already had real content-loading behavior.

Fix 5: Decline feedback

Decline now exposes loading and error state, and shows an alert on failure. Users should not have to guess whether their action actually happened.

Fix 6: Funnel logging

I added lightweight structured event logging consistent with the project's existing pattern.

// SocialAnalytics.swift
print("[event] social_invite_tapped entity_id=\(entityId)")
print("[event] social_invite_created entity_id=\(entityId) invite_id=\(response.inviteId)")
print("[event] social_invite_accepted")
print("[event] social_invite_declined")

These are intentionally basic. They are not meant to replace a full analytics pipeline. They are just enough to make the rollout observable.

Fix 7: Backend env documentation

The backend .env.example was updated with social-related environment variables and rollout notes.

# .env.example
# Social Connections
GLIA_SOCIAL_ENABLED=0
GLIA_SOCIAL_INVITE_EXPIRY_DAYS=30
GLIA_SOCIAL_INVITE_RATE_LIMIT=20
GLIA_SOCIAL_PUSH_DAILY_LIMIT=3

That did not change runtime behavior directly, but it reduced rollout risk by making the deployment surface explicit.

Why most of the fixes were on iOS, not backend

This is an important distinction.

The reason most fixes landed on iOS was not that the backend was ignored. The backend was part of the audit from the beginning. In fact, one of the useful outcomes of the audit was that it clarified what the backend already had:

invite creation
entity share state transitions
connection acceptance
push jobs
social feed read path

Those core capabilities mostly existed.

The bigger gaps were in how the client consumed and surfaced them:

onboarding was not preserving the flow
people profile state was not staying in sync
feature gating was not reflected in entry visibility
notifications did not route somewhere useful

The backend had the capabilities, but the client was still failing to turn them into a coherent user experience.

That is why the right fix was not "rewrite the backend." It was to close the gaps where user trust and flow continuity were actually breaking.

What I intentionally did not do

There are a few things I explicitly avoided.

I did not redesign the existing UI. The issue was not that the visual design was wrong.
I did not introduce a new backend capability endpoint just to represent feature availability.
I did not overcomplicate the rollout path. The fixes were intentionally narrow.

This work was less about building a feature, and more about making an existing feature trustworthy.

Before vs After

Before the audit, the feature looked mostly complete on paper. But from a user perspective, the flow still had major holes.

✗ Before	✓ After
Onboarding could swallow a pending invite	Onboarding exits now continue the pending invite flow
People profile could show the wrong connection state	People profile refreshes and reflects actual connection state
Push and Notification Center could route to an empty destination	Push and Notification Center route to `socialFeed`
A disabled feature could still expose a broken button	Feature-disabled state hides the social entry
Decline had no feedback	Decline provides loading and error feedback
Rollout had almost no funnel visibility	Basic invite funnel is observable

The main invite path became a real closed loop: create invite → share → accept → see feed.

What made the feature ready for gradual rollout

Before these changes, I would not have considered the feature rollout-ready. The problem was not just polish. There were genuine P1 breaks in the primary path.

Dimension	What changed
Entry gating	Client respects backend feature availability instead of exposing a button that fails when tapped.
State correctness	People card converges toward the real connection state instead of misleading the inviter.
Dead-end removal	Push and Notification Center now route to a useful destination.
Flow continuity	New-user onboarding no longer drops the invite chain.
Observability	Enough funnel logging to see where conversion drops during rollout.
Deployment safety	Social env variables documented in `.env.example` for safer rollout.

Rollout readiness depends on closed loops, not isolated endpoints.

Remaining known issues

A few issues remain, but they do not block gradual rollout.

The incoming direction semantics inside connectionForEntity are still imperfect. In certain two-sided connection scenarios, the button state may not fully reflect the relationship the way it should.

There is also a short visibility window on first load before the client learns that the backend social feature is disabled. I accepted that tradeoff for now because it keeps the default experience more natural while still preventing the broken tap-to-error behavior.

These are follow-up issues, not reasons to block rollout.

Final takeaways

This audit ended up reinforcing a lesson I keep seeing in product engineering:

Social systems fail at edges, not just in the happy path.

A feature can look complete in code while still being untrustworthy in practice. The gaps are often not dramatic failures in core business logic. They are routing gaps, stale state, dead-end destinations, missing entry gating, and silent feedback failures.

Those are exactly the kinds of issues that damage trust because they make the product feel inconsistent, even when much of the system technically works.

The most valuable work here was not adding more surface area. It was closing the distance between system capability and user-complete flow.

The job was not to make the feature bigger. It was to make the existing feature trustworthy.

Originally published at: https://dearartist.xyz/blog/glia-invite-flow-audit

Originally published at https://dearartist.xyz/blog/glia-invite-flow-audit.

From Audit to Deployment: Fixing Glia's Social Invite Flow End-to-End

Tue, 14 Apr 2026 18:24:00 GMT

Engineering Case Study · 2026-04-14 18:24 · 10 min read · Engineering, Product, Backend, iOS, QA, Deployment, Systems Thinking

A real product-engineering note on auditing a user-facing invite flow, tightening the state machine, aligning backend behavior, adding dynamic OG images, deploying to QA, and verifying the result.

Key Takeaways

The hardest part was not writing code. It was defining scope correctly.

A user-facing flow is not "done" if backend, client, deployment, and verification are not aligned.

Feature flags, state transitions, and preview surfaces are easy places for product truth and technical behavior to drift apart.

Deployment and smoke verification are part of the work, not an afterthought.

Documentation matters because otherwise the next person has to reverse-engineer intent from code.

Context

I recently spent time auditing Glia ↗'s social invite flow around people/entity cards.

At first glance, the problem looked narrow: when a user taps invite from a people card, does the flow actually work?

But once I started tracing it properly, it became clear that this was not really a question about one button or one API endpoint. It was a question about whether the whole user-facing chain actually closed: the entry point, the backend behavior, the state transitions, the landing page, the preview surface, the deployment setup, the runtime environment, and the final verification.

That distinction matters. A lot of systems look complete when you inspect them file by file. Far fewer are actually complete when you follow the path a real user would take.

What Glia social sharing actually is

Glia's social sharing is not really a generic public share feature.

It is more specific than that. A user has memories inside the system, and some of those memories get organized around people entities. From there, the product can open a share/invite flow tied to a specific person. In other words, the user is not just sharing a link. They are initiating a relationship-aware flow around memories connected to a particular person.

That matters because the object being shared is more sensitive than a normal URL. It has identity, context, preview behavior, landing behavior, accept/decline semantics, and downstream effects on connection state, feed surfaces, and notifications.

People Card  →  Invite Creation  →  Share / Landing  →  Accept / Decline  →  Connection
(Entry point)   (Backend API)       (Preview + OG)      (State transition)    (Feed + Notify)

So when I looked at this system, I was not asking whether a link could technically be generated. I was asking whether the whole product truth of that flow actually held together.

Why I looked into this flow

I was not interested in doing a generic code review.

The real question was whether the people-card invite experience was actually closed end to end. If a user starts from a people card, taps invite, shares something outward, and another person receives it, does the system really support that whole sequence in a coherent way?

That means checking more than backend correctness. It means checking whether the iOS entry point, backend semantics, preview behavior, landing page, connection state, and deployment reality all describe the same product.

That is the kind of work I care about most. Not just "does the code exist," but "does the system tell the truth."

What I was actually auditing

The object of the audit was not social sharing in a broad sense.

It was a narrower and more specific flow:

A user has a people entity card. They trigger an invite-related action from that card. That action eventually creates a share/invite flow tied to that person. The invite can then be opened, previewed, accepted, declined, and turned into a connection with downstream effects on notifications and social surfaces.

So the real question was: What exactly happens after the user taps invite on a people card, and is that experience actually closed end to end?

That required looking across multiple layers:

backend invite creation and validation
invite landing page behavior
feature flag behavior
accept/decline state transitions
preview content and metadata
deployment assumptions
QA runtime behavior
documentation quality

What the audit quickly revealed

What I found was not one catastrophic bug. It was something more common in real product systems: the flow mostly existed, but several important layers were not fully aligned.

The backend had the right general shape, but there were places where runtime behavior and product semantics could drift apart. The invite acceptance path needed a stronger transition model. The preview layer turned out to matter more than it first looked. And some pieces of rollout truth — especially around deployment and environment behavior — were not things I wanted to leave implicit.

That changed the nature of the work. This was not about adding a feature from scratch. It was about making an existing flow honest, stable, and actually deployable.

The biggest risks I found

The most important problems were not cosmetic. They lived in the places that usually create the biggest mismatch between product expectations and real behavior.

Finding	Severity	Description
Feature flag drift	Medium	Module-level constants vs runtime reads. Same flag, different behavior depending on code path and process timing.
Non-atomic acceptance	High	Read-check-write pattern on invite state. TOCTOU-vulnerable. Two concurrent accepts could both succeed.
Preview surface gaps	Medium	Missing OG images, mixed error semantics. The link preview did not represent the actual product state.

1. Feature flag behavior was not fully aligned

The API layer and the rest of the system were not all reading the social feature flag the same way.

Some paths read the flag dynamically at runtime. Others captured it as a module-level constant. That meant a feature toggle could produce different behavior depending on which path was executed and when the process started.

# config.py — stale module-level constant
# This value is captured at import time — never re-evaluated
SOCIAL_ENABLED = os.getenv("GLIA_SOCIAL_ENABLED", "0") == "1"

# Every call site that reads SOCIAL_ENABLED sees the value
# from when the module was first imported, not the current state.

# config.py — runtime check (fix)
def is_social_enabled() -> bool:
    """Read the flag at call time, not import time."""
    return os.getenv("GLIA_SOCIAL_ENABLED", "0") == "1"

That is the kind of issue that looks small until you try to operate the system in QA or production.

2. The invite acceptance path had a real state transition problem

The original accept flow had a classic read-check-write shape.

That means two near-simultaneous requests could both pass the pending check before one overwrote the other. In other words, the invite state machine was vulnerable to a TOCTOU-style race.

Even if the practical probability was low, the semantics were wrong. For a user-facing invite flow, that matters.

3. The preview surface and actual landing behavior needed clearer truth

Invite flows do not begin when the recipient opens the app. They begin when the link preview renders in a message thread.

That means the landing page, OG metadata, Twitter metadata, and preview image are not just presentation details. They are part of the functional experience.

If those pieces are broken or missing, the system is technically working but behaviorally incomplete.

Fixes shipped

I approached the fixes with one principle in mind: minimum correct scope.

Not every issue should become a refactor. Not every rough edge is a P0. The goal was to fix what affected correctness, product truth, and deployability.

Feature flag consistency

I aligned the social flag behavior so the system no longer relied on stale module-level values in critical paths.

That made the runtime behavior more honest and more predictable. If the flag changes, the relevant logic now reflects that at call time rather than depending on process start timing.

Atomic invite acceptance

The most important backend correctness change was making accept_invite atomic.

# services/social.py — atomic conditional update
result = db.execute(
    update(EntityShare)
    .where(
        EntityShare.invite_token == token,
        EntityShare.status == "pending",
        EntityShare.expires_at > now,
        EntityShare.owner_user_id != recipient_user_id,
    )
    .values(
        status="active",
        recipient_user_id=recipient_user_id,
        accepted_at=now,
    )
)

if result.rowcount == 0:
    raise InvalidInviteError("Token invalid, expired, or already used")

Instead of a read-check-write flow, the logic was changed to a conditional update path. That closes the race window and makes the state transition behave like an actual state transition rather than a sequence of loosely related checks.

That was one of the highest-value fixes in the whole effort because it tightened the core invariant of the invite lifecycle.

Invite landing behavior and preview cleanup

The invite page behavior was tightened so it no longer mixed unrelated semantics.

One of the things I wanted to avoid was pretending that feature disabled and invite expired were the same condition. They are not. The system should not tell a user a token has expired when the real issue is that the feature is turned off.

That kind of semantic precision matters more than it looks. It determines whether the product is understandable under failure conditions.

Basic observability

I also added the minimum level of server-side event logging needed to reason about the funnel.

Not a full analytics system. Not attribution. Just enough structured observability to answer basic questions like:

was an invite created
was it viewed
was it accepted
was it declined

That is the difference between operating blind and having a minimally useful signal.

Why atomic state transitions mattered

This was one of the clearest examples of engineering work that is easy to underestimate.

From the outside, accept invite sounds trivial. But in real systems, the meaning of acceptance is only as strong as the transition model underneath it.

If a flow can be accepted twice under race conditions, or if two requests can compete for ownership of the same invitation state, then the business object is not really stable.

I care a lot about this class of issue because users do not experience systems as source code. They experience them as truth claims. When a product says "this invitation was accepted," that statement should be backed by a state transition that is actually trustworthy.

Dynamic OG images

One of the more interesting parts of the work was the preview image layer.

At first glance, a missing OG image looks like a minor presentation bug. But that is not really what it is.

For link-based invite flows, the preview image is part of the product surface. It changes the way the link looks in iMessage, WhatsApp, Telegram, and Twitter/X. It changes how personal the invitation feels. It changes click behavior.

I ended up treating this not as a static asset cleanup problem, but as a better product opportunity.

# routes/og.py — dynamic OG image endpoint
@router.get("/og/invite/{token}.png")
async def social_invite_og_image(token: str):
    share = get_entity_share_by_token(token)
    if not share:
        return generate_fallback_og_image()

    return generate_og_image(
        inviter_name=share.owner.display_name,
        entity_name=share.entity.name,
        width=1200,
        height=630,
    )

Instead of relying on a missing or generic static image, I added a dynamic OG image route that generates a 1200x630 PNG for each invite.

For valid invites, the image is personalized using the inviter name and entity name.

For invalid or expired invites, the system returns a safe fallback image rather than throwing an error.

That moved the preview layer from broken asset to a real product capability.

Deployment and verification

This part matters just as much as the code.

The work was only meaningful once it was:

committed cleanly
pushed to both personal and organization remotes
confirmed consistent across local and remote heads
deployed to QA
rebuilt correctly
verified through smoke checks

# Push to remotes
$ git push origin main
To github.com:glia-app/glia-backend.git
   a3f1e2d..6dcd36f  main -> main

$ git push org main
Everything up-to-date

The rebuild detail mattered because I had introduced font dependencies for OG image generation. A plain restart would not have been enough. The container image had to be rebuilt so the runtime environment actually matched the code assumptions.

# Rebuild and deploy to QA
$ docker compose -f docker-compose.yml -f docker-compose.override.yml \
  -f docker-compose.qa.yml up -d api worker beat
=> [internal] load build context
=> [stage-1 4/8] RUN apt-get install -y fonts-jetbrains-mono
=> exporting to image
Container glia-api-1     Started
Container glia-worker-1  Started
Container glia-beat-1    Started

After deployment, I verified:

# Smoke verification
$ curl -s http://localhost:8000/health | jq .status
"ok"

$ curl -s http://localhost:8000/invite/
200 OK — invite landing rendered

$ curl -s http://localhost:8000/og/invite/.png -o /dev/null -w '%{http_code}'
200

$ pytest tests/social/ -q
62 passed in 4.31s

# HEAD consistency check
$ echo "local HEAD     = $(git rev-parse --short HEAD)"
local HEAD     = 6dcd36f
$ echo "personal HEAD  = $(git ls-remote personal main | cut -c1-7)"
personal HEAD  = 6dcd36f
$ echo "org HEAD       = $(git ls-remote org main | cut -c1-7)"
org HEAD       = 6dcd36f

That was the point where the work felt complete. Not when the code compiled. Not when tests passed locally. When the deployed system actually behaved the way the product claimed it behaved.

What I intentionally left out

One of the most important parts of this kind of work is deciding what not to do.

I did not try to turn this into a full analytics system.

I did not redesign the broader connection model.

I did not expand the share model into a more generalized multi-recipient system.

I did not try to solve unrelated entity or story problems under the excuse of "while I'm here."

That discipline matters. A lot of product-engineering work goes off the rails because the initial problem is real, but the response is too expansive.

What this revealed about product-engineering work

This work reinforced something I care about a lot:

Good engineering is not just implementation quality. It is scope judgment.

The useful part was not simply finding bugs. It was separating:

correctness problems
rollout problems
observability gaps
product-semantic questions
future design opportunities

Those are not the same class of problem, and treating them as if they were leads to messy priorities and shallow fixes.

It also reinforced that product truth lives across boundaries.

A flow is not closed just because the backend has an endpoint. It is not closed just because the iOS app can make the request. It is not closed just because QA can load a page.

It is closed when the system's behavior is coherent from user intent to deployed reality.

What I fixed

feature flag consistency in social invite paths
atomic invite acceptance
cleaner invite landing behavior
minimum observability for invite funnel actions
OG and Twitter metadata completion
dynamic OG image generation
QA deployment and runtime verification
documentation of final state

What remained intentionally out of scope

full analytics / attribution
broader social model redesign
generalized multi-recipient invite semantics
unrelated entity/story refactors
expansion beyond the current product definition

What I learned

The part of engineering work I trust most is the part that survives deployment.

I like work that can be described clearly after the fact: what was broken, what was actually fixed, what was left out on purpose, and what the system now truthfully does.

That is what makes a code change feel real.

This was a good reminder that some of the most valuable engineering work is not building something from scratch. It is making an already-existing system honest, closed, and operable.

Originally published at: https://dearartist.xyz/blog/glia-invite-flow

Originally published at https://dearartist.xyz/blog/glia-invite-flow.

My Starship Terminal Setup for macOS

Tue, 14 Apr 2026 09:30:00 GMT

A clean and practical terminal prompt setup using Starship and JetBrainsMono Nerd Font.

I spent more time in my terminal than I'd like to admit, so at some point I decided it should at least look the way I want it to. Not flashy. Not themed within an inch of its life. Just clean enough that I actually enjoy opening it, and informative enough that I don't have to think about where I am or what branch I'm on.

This is that setup. It's small, it's opinionated, and it works well for my day-to-day. I published the config on GitHub in case anyone wants to start from the same place.

Preview

Below is what the prompt actually looks like in practice — a powerline-style segmented prompt with user, directory, git branch + dirty flag, Python version, and a timestamp. Each segment uses a Catppuccin-style color and the seamless arrow transition typical of Starship.

Terminal — zsh
─────────────────────────────────────────────
 yuanh  ~  ❯ cd ~/code/starship-config
 yuanh  ~/code/starship-config  ❯ ls
README.md    screenshots    starship.toml
 yuanh  ~/code/starship-config  ❯ python3 --version
Python 3.11.7
 yuanh  ~/code/starship-config  ❯ cd ~/code/glia-core
 yuanh  …/glia-core   main $!?  🐍 v3.11.7  13:41 ❯ git status
On branch main
Changes not staged for commit:
    modified: .env.dev
    modified: .env.qa
 yuanh  …/glia-core   main $!?  🐍 v3.11.7  13:41 ❯ _

Powerline prompt: user · directory · git · python · time.

What the prompt looks like in practice.

The Stack

macOS Terminal — The native terminal app. Nothing fancy, nothing extra.
zsh — Default shell on macOS. Fast, extensible, well-supported.
Starship — A minimal, fast, cross-shell prompt written in Rust.
JetBrainsMono Nerd Font Mono — Monospace font with ligatures and icon glyphs baked in.

What the Prompt Shows

Current user
Current directory
Git branch and status
Python version
Timestamp

How to Use It

1. Install Starship

curl -sS https://starship.rs/install.sh | sh

2. Enable it in your shell

# Add to ~/.zshrc
eval "$(starship init zsh)"

3. Install the font

Download JetBrainsMono Nerd Font from nerdfonts.com
Set it as your terminal font.

4. Copy the config

cp starship.toml ~/.config/starship.toml

GitHub

View on GitHub

Clone it, tweak it, make it yours.

Originally published at: https://dearartist.xyz/blog/starship-setup

Originally published at https://dearartist.xyz/blog/starship-setup.