<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Dear Artist — Notes by Haixiang Yuan</title>
    <link>https://dearartist.xyz/blog</link>
    <description>Notes, experiments, and field reports from a solo builder. Product, AI infrastructure, and developer tooling.</description>
    <language>en</language>
    <lastBuildDate>Thu, 30 Apr 2026 16:50:21 GMT</lastBuildDate>
    <atom:link href="https://dearartist.xyz/rss.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Building devbrief: A Local History Browser for Claude Code Terminal Sessions</title>
      <link>https://dearartist.xyz/blog/devbrief-local-history-browser</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/devbrief-local-history-browser</guid>
      <pubDate>Tue, 28 Apr 2026 10:00:00 GMT</pubDate>
      <description><![CDATA[A technical note on building devbrief, a token-safe local history browser for Claude Code terminal sessions with optional AI briefs.]]></description>
      <category>devbrief</category>
      <category>claude-code</category>
      <category>developer-tools</category>
      <category>cli</category>
      <content:encoded><![CDATA[<p><em>Engineering Note · Open Source · Developer Tools</em></p>
<p>How a simple session summariser turned into a token-safe, local-first developer tool.</p>
<p><code>$ devbrief list → devbrief raw → devbrief estimate → devbrief brief</code></p>
<p><a href="https://github.com/yuannh/devbrief">View devbrief on GitHub →</a></p>
<p><img src="https://dearartist.xyz/substack-visuals/devbrief-local-history-browser/devbrief-local-history-browser-hero-split-pane-terminal-card.png" alt="Split-pane terminal mockup: sessions list on the left, preview on the right" /></p>
<p><em>Local preview, no LLM call.</em></p>
<hr />
<h2>1. The problem</h2>
<p>I started building devbrief because of a very practical problem.</p>
<p>I use Claude Code heavily in the terminal. It is where a lot of real work happens: debugging, refactoring, writing scripts, checking logs, reviewing files, and gradually moving a task from ambiguity to closure.</p>
<p>But after a session ends, the work becomes surprisingly hard to revisit.</p>
<p>Claude Code does store terminal sessions locally as JSONL transcripts under <code>~/.claude/projects</code>. The problem is not that the history does not exist. The problem is that it is not pleasant to browse.</p>
<p>If I want to remember what happened in a previous coding session, I do not want to dig through raw JSON. I want to know:</p>
<ul>
<li>What was the task?</li>
<li>What did I ask Claude to do?</li>
<li>What files were touched?</li>
<li>What commands were run?</li>
<li>Did the session finish?</li>
<li>Was it blocked?</li>
<li>Did it stop because of a usage limit?</li>
<li>Do I need to continue this later?</li>
</ul>
<p>That was the original motivation for devbrief: a local browser for Claude Code terminal history. But the first version went in the wrong direction.</p>
<p>The same instinct shows up in my <a href="https://dearartist.xyz/blog/claude-statusline">Claude status line</a> work and the broader <a href="https://dearartist.xyz/blog/terminal-workstack">terminal workstack</a> — I want clear visibility over what the AI is doing on my machine before I trust any of it.</p>
<h2>2. The first wrong design</h2>
<p>The obvious first idea was: automatically summarise every completed Claude Code session.</p>
<p>It sounded convenient. Every time a session ended, devbrief would read the transcript, call Claude, extract a nice summary, and store it locally. Something like:</p>
<pre><code class="language-text">Claude Code session ends
        ↓
devbrief reads full transcript
        ↓
Claude summarises problem → approach → outcome
        ↓
summary saved locally
</code></pre>
<p>On paper, that looked useful. In practice, it was the wrong default.</p>
<p>A Claude Code transcript is not just a clean conversation. It can contain tool calls, shell output, long logs, file contents, diffs, JSON, prompts, errors, repeated command output, and sometimes the output of previous analysis steps.</p>
<p>One of my sessions had this shape:</p>
<p><strong>One real session</strong></p>
<ul>
<li>Raw transcript chars: <strong>833,676</strong></li>
<li>Compact evidence chars: <strong>16,045</strong></li>
<li>Approx input tokens: <strong>4,011</strong></li>
<li>Excluded long tool outputs: <strong>85</strong></li>
<li>Excluded full file contents: <strong>39</strong></li>
</ul>
<p>That number changed how I thought about the product.</p>
<p>The expensive part was not &quot;summarising a task&quot;. The expensive part was blindly feeding a huge raw development transcript back into a model, sometimes repeatedly.</p>
<p>The first version had another problem: the hook could run automatically. That meant a tool designed to help me understand Claude Code sessions could itself start burning Claude usage in the background. That broke the product boundary.</p>
<blockquote>
<p><strong>Local evidence, not billing proof</strong></p>
<p>After the first implementation, I found local evidence that the unsafe Stop hook behaviour was real. The SQLite database contained recursive self-analysis rows — sessions where devbrief had effectively analysed its own analyser prompt instead of a normal development task. The local history also showed usage-limit endings during that period.</p>
<p>That is not billing proof, and it does not tell me the exact number of tokens spent. But it was enough evidence to confirm the product risk: an automatic summariser can become part of the problem it is supposed to explain.</p>
</blockquote>
<p>This is the same lesson I keep relearning in <a href="https://dearartist.xyz/blog/glia-token-cost">token cost</a> work and in <a href="https://dearartist.xyz/blog/label-every-llm-call-ai-backend-cost-audit">labelling every LLM call</a>: the costs you do not see are the ones that hurt.</p>
<h2>3. The product correction</h2>
<p>The real product principle became clear:</p>
<ul>
<li>Local visibility first.</li>
<li>AI only when explicitly asked.</li>
</ul>
<p>That changed the entire architecture. devbrief should not be an automatic AI summariser. It should first be a local terminal history browser. The raw history browser should be useful even with no API key, no Claude quota, no hook installed, and no AI calls. AI should be an optional second layer, not the default behaviour.</p>
<p>So the product split into two layers. The first layer is local and zero-token:</p>
<pre><code class="language-bash">devbrief list
devbrief raw SESSION_ID
devbrief view SESSION_ID
devbrief estimate SESSION_ID
devbrief doctor
</code></pre>
<p>The second layer is explicit and token-consuming:</p>
<p><code>devbrief brief SESSION_ID</code></p>
<p>That command only works on one selected session. It shows an estimate first. It asks for confirmation. Only then does it call Claude. This became the core safety model of the tool.</p>
<blockquote>
<p><strong>Local-first by default</strong></p>
<p>The best AI tooling is not the one that calls AI everywhere. It is the one that knows when not to.</p>
</blockquote>
<h2>4. What devbrief does</h2>
<p>devbrief is a local Claude Code terminal history browser with optional AI briefs.</p>
<p>The product surface is intentionally small:</p>
<ul>
<li><strong>Browse</strong> local Claude Code sessions, grouped by project, with raw previews that never call a model.</li>
<li><strong>Read</strong> a session: deterministic outcome detection (completed, blocked, usage-limited, needs-followup) without sending anything to Claude.</li>
<li><strong>Decide</strong>: estimate the cost of an AI brief locally, then optionally generate one brief for one selected session, after explicit confirmation.</li>
</ul>
<p>Everything else is a variation of these three. The goal is not to replace Claude Code. It is to make its terminal history easier to inspect, understand, and continue from.</p>
<h2>5. Raw history browsing</h2>
<p>The raw preview is the most important part of the product. It does not call Claude. It does not spend tokens. It reads the local JSONL transcript and organises the useful parts into a readable view.</p>
<p>A session preview includes:</p>
<ul>
<li>Session ID</li>
<li>Home project</li>
<li>CWD</li>
<li>Started / updated time</li>
<li>Status</li>
<li>Turn count</li>
<li>Session outcome</li>
<li>Human request</li>
<li>What happened locally</li>
<li>Files touched or inspected</li>
<li>Final assistant response</li>
</ul>
<p>For example, one real session ended like this:</p>
<pre><code class="language-text">$ devbrief raw 6f9fdb83

Session Outcome
  Status      usage_limited
  Completion  incomplete
  Confidence  high
  Reason      Final closeout/verification was requested,
              but Claude Code stopped because usage
              limit was reached.
  Signals     final response contains &quot;out of extra usage&quot;;
              late request asks for &quot;final closeout&quot;
</code></pre>
<p>This is more useful than a generic summary. It tells me not just what the session was about, but whether it actually finished.</p>
<p>That distinction matters. A session that ended with &quot;done&quot; is very different from a session that ended because usage limit was reached halfway through verification.</p>
<h2>6. Session outcome detection</h2>
<p>One of the features I care about most is local session outcome detection. Without calling a model, devbrief tries to infer whether a session is:</p>
<ul>
<li><code>completed</code></li>
<li><code>incomplete</code></li>
<li><code>blocked</code></li>
<li><code>usage_limited</code></li>
<li><code>interrupted</code></li>
<li><code>needs_followup</code></li>
<li><code>unknown</code></li>
</ul>
<p>It does this with deterministic local heuristics. For example, if the final assistant response contains <code>out of extra usage</code>, <code>usage limit</code>, <code>resets</code>, or <code>rate limit</code>, then the session can be marked as <code>usage_limited</code>.</p>
<p>If the final user request asked for a closeout report, verification, deploy status, or remaining risks, and the session ended with a usage limit message, devbrief marks the completion state as <code>incomplete</code>.</p>
<p>This is simple, but useful. It turns raw history into something closer to a task log.</p>
<h2>7. Token safety model</h2>
<p>The most important design constraint is that browsing should never silently spend tokens. In normal use, exactly one command is meant to spend tokens.</p>
<table>
<thead>
<tr>
<th>Command</th>
<th>Calls Claude?</th>
<th>Spends tokens?</th>
<th>Notes</th>
</tr>
</thead>
<tbody><tr>
<td><code>devbrief list</code></td>
<td>No</td>
<td>No</td>
<td>Local JSONL + SQLite only</td>
</tr>
<tr>
<td><code>devbrief raw SESSION_ID</code></td>
<td>No</td>
<td>No</td>
<td>Local preview</td>
</tr>
<tr>
<td><code>devbrief view SESSION_ID</code></td>
<td>No</td>
<td>No</td>
<td>Reads stored brief from SQLite</td>
</tr>
<tr>
<td><code>devbrief estimate SESSION_ID</code></td>
<td>No</td>
<td>No</td>
<td>Shows packet size only</td>
</tr>
<tr>
<td><code>devbrief doctor</code></td>
<td>No</td>
<td>No</td>
<td>Local diagnostics</td>
</tr>
<tr>
<td><code>devbrief tui</code></td>
<td>No</td>
<td>No</td>
<td>Local browsing</td>
</tr>
<tr>
<td><code>devbrief capture --hook</code></td>
<td>No</td>
<td>No</td>
<td>Metadata only</td>
</tr>
<tr>
<td><strong><code>devbrief brief SESSION_ID</code></strong></td>
<td><strong>Yes</strong></td>
<td><strong>Yes</strong></td>
<td><strong>Only after estimate + confirmation</strong></td>
</tr>
<tr>
<td><code>devbrief digest SESSION_ID</code></td>
<td>Yes</td>
<td>Yes</td>
<td>Deprecated alias for brief; avoid using it</td>
</tr>
<tr>
<td><code>devbrief report</code></td>
<td>No</td>
<td>No</td>
<td>Disabled compatibility stub</td>
</tr>
</tbody></table>
<p>The old multi-session report command was disabled because it did not fit the safety model. Multi-session AI reporting can easily become expensive. The product is now intentionally narrower:</p>
<ul>
<li>One session.</li>
<li>One brief.</li>
<li>Explicit confirmation.</li>
</ul>
<blockquote>
<p><strong>AI only after confirmation</strong></p>
<p>That restraint is a feature.</p>
</blockquote>
<h2>8. Compact evidence before AI</h2>
<p>When an AI brief is requested, devbrief does not send the full raw transcript. It first builds a compact evidence packet.</p>
<p><strong>The compactor removes or truncates:</strong></p>
<ul>
<li>long shell outputs</li>
<li>full file contents</li>
<li>large diffs</li>
<li>repeated JSON</li>
<li>internal analyser prompts</li>
<li>huge tool results</li>
</ul>
<p><strong>It keeps the things that matter:</strong></p>
<ul>
<li>human requests</li>
<li>final assistant response</li>
<li>commands run</li>
<li>files touched</li>
<li>errors and blockers</li>
<li>tool names</li>
<li>session metadata</li>
</ul>
<p>Then it shows an estimate. The estimate itself never calls Claude:</p>
<pre><code class="language-text">$ devbrief estimate 6f9fdb83

Token estimate for session 6f9fdb83

  Raw transcript chars       833,676
  Compact evidence chars      16,045
  Approx input tokens          4,011
  Truncated                   yes

No LLM call has been made.
</code></pre>
<p>Only when you explicitly run <code>devbrief brief</code> does devbrief ask before spending tokens:</p>
<pre><code class="language-text">$ devbrief brief 6f9fdb83

Session: glia-core/6f9fdb83

  Compact evidence chars : 16,045
  Approx input tokens    : 4,011
  ⚠ Transcript was truncated to fit max_chars

Generate brief and spend Claude tokens? [y/n] (n):
</code></pre>
<p>The default is <code>n</code>. That default matters.</p>
<h2>9. The optional hook</h2>
<p>Claude Code supports hooks, but devbrief treats them carefully. The optional hook is capture-only:</p>
<pre><code class="language-bash">devbrief capture --hook
</code></pre>
<p>It records lightweight metadata, such as:</p>
<ul>
<li><code>session_id</code></li>
<li><code>jsonl_path</code></li>
<li><code>project_name</code></li>
<li><code>cwd</code></li>
<li><code>created_at</code></li>
<li><code>updated_at</code></li>
<li><code>turn_count</code></li>
<li><code>status = pending/raw</code></li>
</ul>
<p>It does not call Claude. It does not generate a brief. It does not spend tokens.</p>
<p>Unsafe hook patterns are explicitly avoided. They are shown here as anti-patterns, not recommended commands:</p>
<pre><code class="language-bash">devbrief digest --hook
devbrief brief --hook
claude -p
claude --print
</code></pre>
<p>The hook is not required for browsing. devbrief can read Claude Code&#39;s local JSONL history directly. This means the safest default is:</p>
<ul>
<li>No hook installed.</li>
<li>Browse locally.</li>
<li>Generate AI only when needed.</li>
</ul>
<blockquote>
<p><strong>Capture-only hook</strong></p>
<p>The hook captures metadata into the local SQLite store and nothing more. No transcript content leaves the machine, and Claude is never invoked behind your back.</p>
</blockquote>
<h2>10. Interactive terminal browser</h2>
<p>devbrief also includes an interactive terminal UI:</p>
<pre><code class="language-bash">devbrief
# or
devbrief tui
</code></pre>
<p>The interface is a keyboard-first split-pane browser. The left pane shows sessions:</p>
<pre><code class="language-text">$ devbrief — sessions

ID         Status        Outcome          Project        Title
────────   ───────────   ──────────────   ────────────   ──────────────────────────
6f9fdb83   pending/raw   usage_limited    glia-core      quote provenance bug
e5aab70b   briefed       completed        glia-core      Fix LLM-fabricated prose
6534fc72   briefed       completed        devbrief       Build devbrief CLI
</code></pre>
<p>The right pane shows the selected session — outcome, request, what happened locally, files touched, and the final assistant response. If an AI brief already exists, the detail pane can show that instead.</p>
<p><img src="https://dearartist.xyz/substack-visuals/devbrief-local-history-browser/devbrief-local-history-browser-tui-split-pane.png" alt="TUI split-pane mockup with j/k navigation footer" /></p>
<p><em>TUI keybindings — local, fast, zero-token.</em></p>
<p>The important part is that opening and navigating the TUI is still local-only. It does not call Claude just because I browse around.</p>
<p>Key bindings include:</p>
<pre><code class="language-text">j / k or ↑ / ↓    move selection
Enter             open / focus detail
v                 toggle raw preview / AI brief
d or b            generate brief, after estimate + confirmation
r                 refresh
a                 toggle current project / all projects
?                 help
q                 quit
</code></pre>
<p>The interface is meant to make previous Claude Code work feel navigable, not buried inside JSONL files.</p>
<h2>11. Architecture</h2>
<p>The architecture is intentionally simple.</p>
<pre><code class="language-text">Claude Code JSONL transcripts
~/.claude/projects
        ↓
devbrief parser
        ↓
local raw preview
        ↓
session outcome detector
        ↓
SQLite metadata + stored briefs
        ↓
optional AI brief after confirmation
</code></pre>
<p><strong>Local paths</strong></p>
<ul>
<li>Claude transcripts: <code>~/.claude/projects</code></li>
<li>devbrief config: <code>~/.config/devbrief/config.toml</code></li>
<li>devbrief database: <code>~/.local/share/devbrief/sessions.db</code></li>
</ul>
<p>The stack is deliberately lightweight:</p>
<ul>
<li>Python</li>
<li>Click for CLI commands</li>
<li>Rich for terminal output</li>
<li>Textual for the interactive TUI</li>
<li>SQLite for local metadata and stored briefs</li>
</ul>
<p>The tool does not need a backend. It does not need a hosted service. It does not need a database server. It sits next to Claude Code and helps me see what already exists locally.</p>
<h2>12. Example workflow</h2>
<p>A normal pass through devbrief starts with local browsing and only reaches AI at the final, confirmation-gated step. Step through it here.</p>
<h3>Step 1 — Browse · zero-token</h3>
<pre><code class="language-text">$ devbrief list

glia-core
  6f9fdb83  2026-04-25 19:42   42 turns   usage_limited
  9c12ab07  2026-04-23 11:08   18 turns   completed

dear-artist
  4a3e51d2  2026-04-22 09:14   31 turns   completed
  2b8f0e6c  2026-04-20 22:01   7 turns    needs_followup

4 sessions across 2 projects.
</code></pre>
<p><em>Local listing of recent Claude Code sessions, grouped by project. No model called.</em></p>
<h3>Step 2 — Read · zero-token</h3>
<pre><code class="language-text">$ devbrief raw 6f9fdb83

Session: glia-core/6f9fdb83
Started:  2026-04-25 17:03    Ended: 2026-04-25 19:42
Turns:    42                  Outcome: usage_limited

— last user request —
Can you give me a closeout report on the migration: what landed,
what is still open, and what to verify before deploy?

— last assistant response —
You are out of extra usage. Limits reset in 2 hours.

No LLM call has been made.
</code></pre>
<p><em>Deterministic preview built from the local JSONL. No tokens spent.</em></p>
<h3>Step 3 — Estimate · zero-token</h3>
<pre><code class="language-text">$ devbrief estimate 6f9fdb83

Token estimate for session 6f9fdb83

  Raw transcript chars       833,676
  Compact evidence chars      16,045
  Approx input tokens          4,011
  Truncated                   yes

No LLM call has been made.
</code></pre>
<p><em>Estimate the cost of an AI brief locally, before deciding to spend tokens.</em></p>
<h3>Step 4 — Brief · confirmation-gated</h3>
<pre><code class="language-text">$ devbrief brief 6f9fdb83

Session: glia-core/6f9fdb83

  Compact evidence chars : 16,045
  Approx input tokens    : 4,011
  ⚠ Transcript was truncated to fit max_chars

Generate brief and spend Claude tokens? [y/n] (n):
</code></pre>
<p><em>The only command that can spend tokens. It always asks first; the default is no.</em></p>
<p>This keeps the default path local and cheap, while still allowing AI to be used when it actually adds value.</p>
<h2>13. From experiment to open source</h2>
<p>After the core product model stabilised, I prepared the project for GitHub. The bigger work was language, not code: devbrief is positioned as a local history browser with optional AI briefs, not an automatic summarisation tool. That difference is the whole product.</p>
<p>The repository is public at <a href="https://github.com/yuannh/devbrief">https://github.com/yuannh/devbrief</a>. The README covers installation, the full command reference, hook safety notes, where local data is stored, and usage examples. This note stays focused on the design decisions; the README handles the manual.</p>
<h3>Quickstart</h3>
<ul>
<li> <strong>Clone</strong> the repository.</li>
<li> <strong>Install</strong> in editable mode.</li>
<li> <strong>Verify</strong> the local setup with <code>devbrief doctor</code>.</li>
</ul>
<pre><code class="language-bash">$ git clone https://github.com/yuannh/devbrief.git
$ cd devbrief
$ pip install -e .
$ devbrief doctor
</code></pre>
<p><code>devbrief doctor</code> runs local health checks only — no network calls, no API key required.</p>
<p>References:</p>
<ul>
<li><a href="https://github.com/yuannh/devbrief#readme">README</a></li>
<li><a href="https://github.com/yuannh/devbrief/blob/main/docs/usage.md">Command reference</a></li>
<li><a href="https://github.com/yuannh/devbrief/blob/main/docs/token-safety.md">Hook safety notes</a></li>
</ul>
<h2>14. Reflection</h2>
<p>devbrief started as a convenience tool, but the real lesson was product restraint.</p>
<p>In AI tools, the tempting default is to send everything back to a model: transcripts, logs, diffs, files, prompts, command output. It feels intelligent, but it can quietly create cost, latency, and trust problems.</p>
<p>The better default is local visibility first, AI only when it adds clear value.</p>
<p>That decision shaped every command, every flag, and every line of the safety table. I started by trying to summarise more. I ended by building a tool that summarises less, but lets me see better.</p>
<p><em>Restraint is a feature.</em></p>
<hr />
<p><em>Originally published at: <a href="https://dearartist.xyz/blog/devbrief-local-history-browser">https://dearartist.xyz/blog/devbrief-local-history-browser</a></em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/devbrief-local-history-browser">https://dearartist.xyz/blog/devbrief-local-history-browser</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Label Every LLM Call: How We Cut AI Backend Cost Without Downgrading Quality</title>
      <link>https://dearartist.xyz/blog/label-every-llm-call-ai-backend-cost-audit</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/label-every-llm-call-ai-backend-cost-audit</guid>
      <pubDate>Mon, 27 Apr 2026 10:00:00 GMT</pubDate>
      <description><![CDATA[A real engineering case study on reducing LLM backend cost using per-callsite routing, prompt caching, token logging, and safer model fallback design.]]></description>
      <category>llm</category>
      <category>ai-infrastructure</category>
      <category>cost-optimization</category>
      <category>backend</category>
      <content:encoded><![CDATA[<p><em>A real backend refactor from one-model-fits-all dispatch to per-callsite LLM routing, prompt caching, and token-level observability.</em></p>
<blockquote>
<p><strong>Abstract</strong></p>
<p><em>&quot;The problem was not that we were calling LLMs too often. The problem was that every task was routed through the same expensive model tier.&quot;</em></p>
</blockquote>
<h2>Key changes — at a glance</h2>
<table>
<thead>
<tr>
<th>Change</th>
<th>Est. impact</th>
</tr>
</thead>
<tbody><tr>
<td>Per-callsite LLM routing across 30 callsites</td>
<td>structural</td>
</tr>
<tr>
<td>13 extraction/classification calls → Flash-class</td>
<td>~$3–5/day</td>
</tr>
<tr>
<td>Profile summaries → Haiku-class</td>
<td>~$0.5–1/day</td>
</tr>
<tr>
<td>Chat output cap 4096 → 2048 tokens</td>
<td>~$0.5–1/day</td>
</tr>
<tr>
<td>Anthropic prompt caching on stable prefixes</td>
<td>~$0.3–2/day</td>
</tr>
<tr>
<td>Token-level logging on every callsite</td>
<td>observability</td>
</tr>
<tr>
<td>Pinned quality paths before flipping primary</td>
<td>no regression</td>
</tr>
<tr>
<td><strong>Total expected QA saving</strong></td>
<td><strong>~$4.7–10/day</strong></td>
</tr>
</tbody></table>
<p>QA baseline ~$10–13/day → target ~$2–5/day. Pending 24h billing validation.</p>
<h2>Opening</h2>
<p>In the early days of building an AI-powered product, it is tempting to wire everything to your best model and ship.</p>
<p>The quality is great. Iteration is fast. The product feels smarter. Cost can feel like a future problem, especially when you are still at MVP scale.</p>
<p>Then you check the API bill.</p>
<p>In our QA environment, LLM costs were running around <strong>$10–13 per day</strong>. Not production. Not a high-traffic period. Just a testing environment with a handful of active users.</p>
<p>At first glance, nothing looked obviously broken. There was no runaway loop. No duplicate worker. No infinite retry storm. The infrastructure was behaving normally.</p>
<p>The problem was more structural:</p>
<blockquote>
<p>We had one expensive model acting as the default for almost every LLM task, and we had never priced the intent of each call.</p>
</blockquote>
<p>Chat, narrative composition, entity extraction, memory patching, story validation, <a href="https://dearartist.xyz/blog/glia-thread-debugging">thread routing</a>, profile summaries — everything was flowing through the same primary provider.</p>
<p>Some of those calls needed a high-quality narrative model. Most did not.</p>
<p>This post walks through how we audited 30 LLM callsites, introduced per-callsite model routing, added token observability, and reduced expected QA cost significantly without changing product logic or downgrading user-visible quality.</p>
<p>This post is a real engineering note from building <a href="https://gliahq.com">Glia</a>, a personal AI for reflection. The specific numbers come from our QA environment during MVP development, but the pattern is broadly applicable to any AI backend with multiple LLM tasks.</p>
<h2>1. The Product Context</h2>
<p><a href="https://gliahq.com">Glia</a> is a personal AI for reflection. Behind the conversational surface, several backend agents work together — chat, longer-form story composition, memory patching, people-card extraction, thread routing — each one an LLM call with different quality and cost requirements.</p>
<ul>
<li>extracts people, places, media, and other entities from messages</li>
<li>patches <a href="https://dearartist.xyz/blog/glia-actually-remembers">long-term memory</a></li>
<li>routes messages into <a href="https://dearartist.xyz/blog/glia-thread-debugging">topic threads</a></li>
<li>creates cards and relationship signals</li>
<li>periodically composes narrative stories from those threads</li>
<li>builds a <a href="https://dearartist.xyz/blog/glia-narrative-model-v1">user narrative model</a> used to personalize future responses</li>
</ul>
<p>The backend already had a unified LLM dispatch function:</p>
<pre><code class="language-python">result, meta = generate_text_with_failover(
    prompt=prompt,
    where=&quot;entity_extract&quot;,
    timeout_s=20.0,
    temperature=0.0,
    max_output_tokens=512,
    response_mime_type=&quot;application/json&quot;,
    context=ctx,
)
</code></pre>
<p>The important field here is <code>where</code>.</p>
<p>It was originally used for logging and error attribution. It told us which part of the system made the LLM call:</p>
<pre><code class="language-text">entity_extract
memory_patch
story_compose
theme_chapter_delta
entity_description
story_gate
cards_extract
</code></pre>
<p>But <code>where</code> did not control routing.</p>
<p>The routing was global.</p>
<pre><code class="language-bash">APP_PRIMARY_PROVIDER=anthropic
</code></pre>
<p>That meant almost every generic LLM call resolved to a Sonnet-class model first.</p>
<p>This was fine for chat and narrative composition. It was wasteful for structured extraction.</p>
<h2>2. The Real Problem: One Model for Every Job</h2>
<p>The issue was not simply &quot;too many LLM calls.&quot;</p>
<p>The issue was <strong>task-model mismatch</strong>.</p>
<p>A quality-critical narrative call and a deterministic JSON extraction call were using the same default model.</p>
<p>For example:</p>
<pre><code class="language-python"># narrative composition — quality matters
text, meta = generate_text_with_failover(
    prompt=compose_prompt,
    where=&quot;story_compose&quot;,
    timeout_s=55.0,
    temperature=0.7,
    max_output_tokens=2000,
    context=ctx,
)
</code></pre>
<p>This kind of task deserves a strong model. It is user-visible, tone-sensitive, and narrative-heavy.</p>
<p>But this was also using the same default provider:</p>
<pre><code class="language-python"># entity extraction — structured JSON
text, meta = generate_text_with_failover(
    prompt=extraction_prompt,
    where=&quot;entity_extract&quot;,
    timeout_s=20.0,
    temperature=0.0,
    max_output_tokens=512,
    response_mime_type=&quot;application/json&quot;,
    context=ctx,
)
</code></pre>
<p>This task is not asking the model to write beautifully. It is asking for structured extraction under a schema.</p>
<p>Same dispatcher. Same global model. Very different requirements.</p>
<p>That was the core cost bug.</p>
<h2>3. The Audit: 30 LLM Callsites</h2>
<p>The first step was not optimization. It was inventory.</p>
<p>We searched the codebase for every call to:</p>
<pre><code class="language-python">generate_text_with_failover(...)
</code></pre>
<p>Then we built a simple table.</p>
<p>For each callsite, we asked:</p>
<ul>
<li>What does this call do?</li>
<li>Is the output user-visible?</li>
<li>Is it free-form prose or structured JSON?</li>
<li>Does quality materially affect product experience?</li>
<li>Can failure be retried later?</li>
<li>Is this a background task?</li>
<li>Does this need a premium reasoning/writing model?</li>
</ul>
<p>That gave us 30 callsites.</p>
<p>They naturally collapsed into three tiers.</p>
<h2>4. Provider Tier Architecture</h2>
<p>We defined the target architecture as a tiered routing model.</p>
<p><strong>L1 — Narrative Generation (Sonnet-class model)</strong></p>
<ul>
<li>narrative_compose</li>
<li>collection_story_delta</li>
<li>user_context_model</li>
<li>timeline_summary</li>
<li>event_split</li>
<li>copyedit / polish / retry</li>
<li>onboarding_story</li>
</ul>
<p><strong>L2 — Structured Writing (Haiku-class model)</strong></p>
<ul>
<li>profile_summary generation</li>
</ul>
<p><strong>L3 — Extraction / Classification (Flash-class model)</strong></p>
<ul>
<li>entity_extract</li>
<li>entity_admission</li>
<li>memory_fact_extract</li>
<li>card_extract</li>
<li>quality_gate</li>
<li>thread_router</li>
<li>semantic_referee</li>
<li>soft_links</li>
<li>relationship_extract</li>
</ul>
<p><strong>Chat — independent path:</strong> <code>stream_chat</code> controlled by <code>APP_CHAT_PROVIDER</code>.</p>
<p>The design principle was simple:</p>
<blockquote>
<p><code>APP_PRIMARY_PROVIDER</code> should not be the business routing mechanism.</p>
</blockquote>
<p>Instead:</p>
<ul>
<li>quality-critical paths are explicitly pinned to the high-quality model</li>
<li>structured writing uses a cheaper writing-capable model</li>
<li>extraction/classification uses a cheaper fast model</li>
<li>global primary becomes a fallback/default, not a business decision</li>
</ul>
<p>This distinction matters. The full per-callsite dispatch — and the deployment sequence that has to follow it — is described in the next sections.</p>
<h2>5. The Routing Layer: Per-Callsite Dispatch</h2>
<p>We already had a label for each LLM call: <code>where</code>.</p>
<p>So we reused it.</p>
<p>The new convention:</p>
<pre><code class="language-bash">APP_LLM_PROVIDER_&lt;WHERE&gt;
APP_LLM_MODEL_&lt;WHERE&gt;
</code></pre>
<p>Examples:</p>
<pre><code class="language-bash">APP_LLM_PROVIDER_STORY_COMPOSE=anthropic
APP_LLM_PROVIDER_ENTITY_EXTRACT=gemini
APP_LLM_PROVIDER_MEMORY_PATCH=gemini
APP_LLM_PROVIDER_ENTITY_DESCRIPTION=anthropic
APP_LLM_MODEL_ENTITY_DESCRIPTION=claude-haiku-4-5-20251001
</code></pre>
<p>The <code>&lt;WHERE&gt;</code> suffix is derived by uppercasing the <code>where</code> string and replacing non-alphanumeric characters with underscores.</p>
<pre><code class="language-python">def _where_env_key(where: str) -&gt; str:
    return re.sub(r&quot;[^a-zA-Z0-9]+&quot;, &quot;_&quot;, where).strip(&quot;_&quot;).upper()

def _per_where_provider(where: str) -&gt; str | None:
    key = f&quot;APP_LLM_PROVIDER_{_where_env_key(where)}&quot;
    val = (os.getenv(key) or &quot;&quot;).strip().lower()
    return val if val in {&quot;anthropic&quot;, &quot;gemini&quot;, &quot;openai&quot;} else None

def _per_where_model(where: str) -&gt; str | None:
    key = f&quot;APP_LLM_MODEL_{_where_env_key(where)}&quot;
    val = (os.getenv(key) or &quot;&quot;).strip()
    return val or None
</code></pre>
<p>Examples:</p>
<table>
<thead>
<tr>
<th>where value</th>
<th>Provider env var</th>
</tr>
</thead>
<tbody><tr>
<td>story_compose</td>
<td>APP_LLM_PROVIDER_STORY_COMPOSE</td>
</tr>
<tr>
<td>entity_extract</td>
<td>APP_LLM_PROVIDER_ENTITY_EXTRACT</td>
</tr>
<tr>
<td>theme_chapter_delta</td>
<td>APP_LLM_PROVIDER_THEME_CHAPTER_DELTA</td>
</tr>
<tr>
<td>entity_description</td>
<td>APP_LLM_PROVIDER_ENTITY_DESCRIPTION + APP_LLM_MODEL_ENTITY_DESCRIPTION</td>
</tr>
</tbody></table>
<p>The routing flow became:</p>
<ol>
<li><code>generate_text_with_failover(where=&#39;entity_extract&#39;)</code></li>
<li>Check <code>APP_LLM_PROVIDER_ENTITY_EXTRACT</code> — if set (gemini), use <code>[gemini, global_secondary]</code></li>
<li>If unset, check global failover circuit breaker</li>
<li>Circuit open → use global secondary; closed → use <code>APP_PRIMARY_PROVIDER</code></li>
<li>Call LLM</li>
<li>On per-callsite failure: log warning, fall through to secondary</li>
<li>On global-primary failure: update circuit breaker, fall through to secondary</li>
<li>Return text + metadata</li>
</ol>
<p>A simplified version of the dispatch logic looks like this:</p>
<pre><code class="language-python">def generate_text_with_failover(*, prompt, where=&quot;unknown&quot;, ...) -&gt; tuple[str, dict]:
    per_where_provider = _per_where_provider(where)
    per_where_model = _per_where_model(where)

    decision = choose_provider(kind=&quot;text&quot;, request_id=request_id)
    global_primary = decision.primary
    global_secondary = decision.secondary

    if per_where_provider:
        providers = [per_where_provider]
        if per_where_provider != global_secondary:
            providers.append(global_secondary)
    else:
        providers = [decision.provider]
        if decision.provider == global_primary:
            providers = [global_primary, global_secondary]

    for provider in providers:
        explicit_model = (
            per_where_model
            if per_where_provider and provider == per_where_provider
            else None
        )
        model, fallback_models = model_chain_from_env(
            provider=provider,
            explicit_primary=explicit_model,
        )
        try:
            text = generate_text_with_fallback(
                provider=provider,
                model=model,
                fallback_models=fallback_models,
                prompt=prompt,
                ...
            )
            if provider == global_primary:
                record_primary_success(request_id=request_id, where=where)
            return text, {
                &quot;provider&quot;: provider,
                &quot;model&quot;: model,
                &quot;per_where_override&quot;: per_where_provider is not None,
                &quot;latency_ms&quot;: int((time.perf_counter() - started_at) * 1000),
            }
        except Exception as exc:
            if provider == global_primary and not per_where_provider:
                if should_failover(exc):
                    record_primary_failure(request_id=request_id, where=where)
                continue
            if per_where_provider and provider == per_where_provider:
                log.warning(
                    &quot;llm_per_where_primary_failed provider=%s where=%s&quot;,
                    provider,
                    where,
                )
                continue
            raise
</code></pre>
<p>Two decisions mattered here.</p>
<h3>Circuit breaker isolation</h3>
<p>A Gemini failure on <code>entity_extract</code> should not globally trip failover for story composition. Per-callsite provider failures are logged, but they do not update the global Redis circuit breaker.</p>
<h3>Secondary provider fallback</h3>
<p>L3 tasks are Gemini-first, not necessarily Gemini-only. If Gemini fails, the call can still fall back to the global secondary provider. This preserves reliability. If a product wants strict cost caps later, this can be tightened per callsite.</p>
<h2>6. Supporting Optimization 1: Independent Chat Provider</h2>
<p>Chat streaming used a separate path, not the batch <code>generate_text_with_failover</code> path.</p>
<p>That meant it needed its own provider pin:</p>
<pre><code class="language-python">chosen = (
    llm_provider
    or os.getenv(&quot;APP_CHAT_PROVIDER&quot;)
    or &quot;&quot;
).strip().lower() or None
</code></pre>
<p>Then the environment can say:</p>
<pre><code class="language-bash">APP_CHAT_PROVIDER=anthropic
APP_CHAT_MAX_OUTPUT_TOKENS=2048
</code></pre>
<p>This keeps chat on the high-quality model even if:</p>
<pre><code class="language-bash">APP_PRIMARY_PROVIDER=gemini
</code></pre>
<p>That separation is important. Chat is product-critical. It should not accidentally follow a global background-job optimization.</p>
<h2>7. Supporting Optimization 2: Prompt Caching</h2>
<p>Every chat request included a large system prompt: product voice, behavioral contract, memory instructions, and response policy.</p>
<p>A big part of that prompt was stable across consecutive turns. So we enabled Anthropic prompt caching for the system prompt:</p>
<pre><code class="language-python">if system_prompt:
    kwargs[&quot;system&quot;] = [{
        &quot;type&quot;: &quot;text&quot;,
        &quot;text&quot;: system_prompt,
        &quot;cache_control&quot;: {&quot;type&quot;: &quot;ephemeral&quot;},
    }]
</code></pre>
<p>The point is not to assume caching always works.</p>
<p>The cache is ephemeral and TTL-bound. Long idle periods, changing prompt prefixes, or moving cache boundaries can all reduce hit rate.</p>
<p>So we also logged:</p>
<pre><code class="language-text">cache_creation_input_tokens
cache_read_input_tokens
</code></pre>
<blockquote>
<p>Do not assume prompt caching savings. Measure them.</p>
</blockquote>
<h2>8. Supporting Optimization 3: Token Usage Logging</h2>
<p>Before this refactor, we had logs that said an LLM call happened. But we did not have reliable token usage per callsite. That made cost analysis guessy.</p>
<p>We added token logging on both paths.</p>
<h3>Sync path</h3>
<pre><code class="language-python">resp = client.messages.create(
    model=model,
    messages=messages,
    ...
)
usage = getattr(resp, &quot;usage&quot;, None)
if usage:
    log.info(&quot;llm_usage&quot;, extra={
        &quot;provider&quot;: &quot;anthropic&quot;,
        &quot;model&quot;: model,
        &quot;input_tokens&quot;: getattr(usage, &quot;input_tokens&quot;, None),
        &quot;output_tokens&quot;: getattr(usage, &quot;output_tokens&quot;, None),
        &quot;cache_creation_input_tokens&quot;: getattr(
            usage,
            &quot;cache_creation_input_tokens&quot;,
            None,
        ),
        &quot;cache_read_input_tokens&quot;: getattr(
            usage,
            &quot;cache_read_input_tokens&quot;,
            None,
        ),
    })
</code></pre>
<h3>Streaming path</h3>
<p>Chat streaming is different. Usage data arrives across stream events.</p>
<pre><code class="language-python">input_tokens = None
output_tokens = None
cache_creation_input_tokens = None
cache_read_input_tokens = None

async for event in stream:
    if event.type == &quot;message_start&quot;:
        usage = getattr(getattr(event, &quot;message&quot;, None), &quot;usage&quot;, None)
        if usage:
            input_tokens = getattr(usage, &quot;input_tokens&quot;, None)
            cache_creation_input_tokens = getattr(
                usage,
                &quot;cache_creation_input_tokens&quot;,
                None,
            )
            cache_read_input_tokens = getattr(
                usage,
                &quot;cache_read_input_tokens&quot;,
                None,
            )
    elif event.type == &quot;content_block_delta&quot; and hasattr(event.delta, &quot;text&quot;):
        text = event.delta.text
        if text:
            yield TextDelta(type=&quot;text_delta&quot;, text=text)
    elif event.type == &quot;message_delta&quot;:
        usage = getattr(event, &quot;usage&quot;, None)
        if usage:
            output_tokens = getattr(usage, &quot;output_tokens&quot;, None)

log.info(&quot;llm_usage_stream&quot;, extra={
    &quot;provider&quot;: &quot;anthropic&quot;,
    &quot;model&quot;: cfg.model,
    &quot;input_tokens&quot;: input_tokens,
    &quot;output_tokens&quot;: output_tokens,
    &quot;cache_creation_input_tokens&quot;: cache_creation_input_tokens,
    &quot;cache_read_input_tokens&quot;: cache_read_input_tokens,
})
</code></pre>
<p>This made the biggest cost path — chat — observable.</p>
<h2>9. Supporting Optimization 4: Configurable Output Caps</h2>
<p>Two output caps were too large by default:</p>
<pre><code class="language-python"># Before
max_tokens = 4096              # chat
max_output_tokens = 8192       # narrative model
</code></pre>
<p>Those values were not based on observed output length. They were &quot;safe&quot; defaults.</p>
<p>Safe defaults can be expensive defaults.</p>
<p>We made both configurable:</p>
<pre><code class="language-python">max_tokens = cfg.max_output_tokens or int(
    os.getenv(&quot;APP_CHAT_MAX_OUTPUT_TOKENS&quot;, &quot;2048&quot;)
)

max_output_tokens = int(
    os.getenv(&quot;APP_NARRATIVE_MODEL_MAX_OUTPUT_TOKENS&quot;, &quot;4096&quot;)
)
</code></pre>
<p>In QA, we used:</p>
<pre><code class="language-bash">APP_CHAT_MAX_OUTPUT_TOKENS=2048
APP_NARRATIVE_MODEL_MAX_OUTPUT_TOKENS=3000
</code></pre>
<p>Chat responses rarely needed 4096 tokens. The narrative model output fit comfortably within the lower cap.</p>
<h2>10. Supporting Optimization 5: Disable a Broken Feature</h2>
<p>One feature was enabled but failing on every call:</p>
<pre><code class="language-bash">APP_MEMORY_READ_PLAN_LLM_ENABLED=1
</code></pre>
<p>It was not a major billing driver because failed requests were rejected, but it added latency and noise to every chat turn.</p>
<p>We disabled it:</p>
<pre><code class="language-bash">APP_MEMORY_READ_PLAN_LLM_ENABLED=0
</code></pre>
<h2>11. The Fallback Escalation Trap</h2>
<p>This was the easiest bug to miss.</p>
<p>We wanted profile summary generation to use a Haiku-class model:</p>
<pre><code class="language-bash">APP_LLM_PROVIDER_ENTITY_DESCRIPTION=anthropic
APP_LLM_MODEL_ENTITY_DESCRIPTION=claude-haiku-4-5-20251001
</code></pre>
<p>But what happens when Haiku fails?</p>
<p>That depends on <code>APP_ANTHROPIC_MODEL_FALLBACKS</code>.</p>
<p>If fallbacks include a Sonnet-class model, then the call silently escalates:</p>
<pre><code class="language-text">Haiku fails → Sonnet fallback → successful response → larger bill
</code></pre>
<p>That defeats the point of routing the call to Haiku.</p>
<p>The safe case is:</p>
<pre><code class="language-bash">APP_ANTHROPIC_MODEL_FALLBACKS=claude-haiku-4-5-20251001
</code></pre>
<p>The model chain resolver filters fallback models that equal the primary model:</p>
<pre><code class="language-text">primary   = claude-haiku-4-5-20251001
fallbacks = [claude-haiku-4-5-20251001]

after filtering:
fallbacks = []
</code></pre>
<p>So if Haiku fails, the background task fails rather than escalating to Sonnet.</p>
<p>For this specific task, that is acceptable. A profile summary can retry later. Unexpected premium-model spend is worse.</p>
<p>We added tests for both cases:</p>
<pre><code class="language-python">def test_entity_description_no_sonnet_escalation(monkeypatch):
    monkeypatch.setenv(
        &quot;APP_ANTHROPIC_MODEL_FALLBACKS&quot;,
        &quot;claude-haiku-4-5-20251001&quot;,
    )
    primary, fallbacks = model_chain_from_env(
        provider=&quot;anthropic&quot;,
        explicit_primary=&quot;claude-haiku-4-5-20251001&quot;,
    )
    assert primary == &quot;claude-haiku-4-5-20251001&quot;
    assert fallbacks == []
</code></pre>
<p>And the risky case:</p>
<pre><code class="language-python">def test_entity_description_sonnet_escalation_risk_documented(monkeypatch):
    monkeypatch.setenv(
        &quot;APP_ANTHROPIC_MODEL_FALLBACKS&quot;,
        &quot;claude-sonnet-4-20250514&quot;,
    )
    primary, fallbacks = model_chain_from_env(
        provider=&quot;anthropic&quot;,
        explicit_primary=&quot;claude-haiku-4-5-20251001&quot;,
    )
    assert primary == &quot;claude-haiku-4-5-20251001&quot;
    assert &quot;claude-sonnet-4-20250514&quot; in fallbacks
</code></pre>
<blockquote>
<p>When you introduce model tiers, trace the entire fallback chain. A fallback configuration that was safe for a uniform-model setup can create expensive surprises in a tiered setup.</p>
</blockquote>
<h2>12. Deployment: Why Order Matters</h2>
<p>The dangerous step was not adding the routing code.</p>
<p>The dangerous step was flipping:</p>
<pre><code class="language-bash">APP_PRIMARY_PROVIDER=gemini
</code></pre>
<p>If you do that before pinning quality-critical paths, story generation and theme generation can silently move to a cheaper model.</p>
<p>That is the wrong kind of cost optimization.</p>
<p>The deployment sequence mattered.</p>
<p><strong>Deployment order:</strong></p>
<ol>
<li>Deploy routing code</li>
<li>Set memory/narrative config</li>
<li>Pin L1 quality paths to Sonnet</li>
<li>Pin profile summaries to Haiku</li>
<li>Pin extraction to Flash</li>
<li>Verify env inside containers</li>
<li>Flip <code>APP_PRIMARY_PROVIDER=gemini</code></li>
<li>Recreate containers</li>
<li>Verify health and logs</li>
</ol>
<p>The final routing looked like this:</p>
<pre><code class="language-text">Chat                         → Anthropic / Sonnet-class
Story generation             → Anthropic / Sonnet-class
Theme generation             → Anthropic / Sonnet-class
Narrative model              → Anthropic / Sonnet-class
Onboarding story             → Anthropic / Sonnet-class
Profile summaries            → Anthropic / Haiku-class
Entity extraction            → Gemini    / Flash-class
Memory patching              → Gemini    / Flash-class
Card extraction              → Gemini    / Flash-class
Story gates                  → Gemini    / Flash-class
Thread routing               → Gemini    / Flash-class
Semantic referee             → Gemini    / Flash-class
Soft links                   → Gemini    / Flash-class
Global primary fallback      → Gemini
Global secondary fallback    → Anthropic
</code></pre>
<h2>13. Docker Compose: .env Is Not Enough</h2>
<p>A deployment detail caught us.</p>
<p>We added all the new variables to <code>.env</code>, restarted containers, and expected them to appear.</p>
<p>They did not.</p>
<p>Why?</p>
<p>Because Docker Compose only injects variables into a container if they are referenced in the service&#39;s <code>environment:</code> block or passed through an <code>env_file</code>.</p>
<p>The <code>.env</code> file alone is used for Compose interpolation. It does not automatically expose every variable to the container.</p>
<p>So this was not enough:</p>
<pre><code class="language-bash"># .env
APP_LLM_PROVIDER_ENTITY_EXTRACT=gemini
APP_LLM_PROVIDER_STORY_COMPOSE=anthropic
</code></pre>
<p>We also had to update <code>docker-compose.yml</code>:</p>
<pre><code class="language-yaml">services:
  api:
    environment:
      - APP_PRIMARY_PROVIDER=${APP_PRIMARY_PROVIDER:-anthropic}
      - APP_CHAT_PROVIDER=${APP_CHAT_PROVIDER:-}
      - APP_CHAT_MAX_OUTPUT_TOKENS=${APP_CHAT_MAX_OUTPUT_TOKENS:-2048}
      - APP_LLM_PROVIDER_STORY_COMPOSE=${APP_LLM_PROVIDER_STORY_COMPOSE:-}
      - APP_LLM_PROVIDER_ENTITY_EXTRACT=${APP_LLM_PROVIDER_ENTITY_EXTRACT:-}
      - APP_LLM_PROVIDER_ENTITY_DESCRIPTION=${APP_LLM_PROVIDER_ENTITY_DESCRIPTION:-}
      - APP_LLM_MODEL_ENTITY_DESCRIPTION=${APP_LLM_MODEL_ENTITY_DESCRIPTION:-}
      # ... repeated for each callsite

  worker:
    environment:
      - APP_PRIMARY_PROVIDER=${APP_PRIMARY_PROVIDER:-anthropic}
      - APP_CHAT_PROVIDER=${APP_CHAT_PROVIDER:-}
      - APP_CHAT_MAX_OUTPUT_TOKENS=${APP_CHAT_MAX_OUTPUT_TOKENS:-2048}
      - APP_LLM_PROVIDER_STORY_COMPOSE=${APP_LLM_PROVIDER_STORY_COMPOSE:-}
      - APP_LLM_PROVIDER_ENTITY_EXTRACT=${APP_LLM_PROVIDER_ENTITY_EXTRACT:-}
      - APP_LLM_PROVIDER_ENTITY_DESCRIPTION=${APP_LLM_PROVIDER_ENTITY_DESCRIPTION:-}
      - APP_LLM_MODEL_ENTITY_DESCRIPTION=${APP_LLM_MODEL_ENTITY_DESCRIPTION:-}
      # ... repeated for each callsite
</code></pre>
<p>We first made this change directly on the QA server to unblock deployment.</p>
<p>Then we caught the problem: the server had a modified <code>docker-compose.yml</code> that was not committed to the repo.</p>
<p>That is configuration drift.</p>
<p>We moved the change back into the repository, committed it, pushed it, pulled it on QA, and recreated the containers from the committed Compose file.</p>
<p>The lesson:</p>
<blockquote>
<p>Infrastructure state that only exists on the server is a commit waiting to cause an incident.</p>
</blockquote>
<h2>14. What We Did Not Change</h2>
<p>This was a cost refactor, not a product behavior refactor.</p>
<p>We did not:</p>
<ul>
<li>change product prompts</li>
<li>lower chat quality</li>
<li>lower story generation quality</li>
<li>remove failover</li>
<li>ask business logic to know about model providers</li>
<li>make extraction tasks hard-fail on the first provider error</li>
<li>change user-facing memory or story logic</li>
<li>flip the global provider until quality-sensitive callsites were pinned</li>
</ul>
<p>That constraint mattered.</p>
<p>The goal was not &quot;use cheaper models everywhere.&quot;</p>
<p>The goal was:</p>
<blockquote>
<p>Use the expensive model where it buys product quality, and stop using it where it does not.</p>
</blockquote>
<h2>15. Tests</h2>
<p>We added tests for:</p>
<ul>
<li>chat provider pinning</li>
<li>chat max token env override</li>
<li>narrative model max token env override</li>
<li>prompt caching system prompt format</li>
<li>sync token usage logging</li>
<li>streaming token usage logging</li>
<li>per-callsite provider override</li>
<li>per-callsite model override</li>
<li>invalid provider override fallback</li>
<li>no-override behavior preserving global provider</li>
<li>Haiku profile summary fallback safety</li>
<li>Sonnet escalation risk if fallback env is misconfigured</li>
</ul>
<p>A few examples:</p>
<pre><code class="language-python">def test_per_where_routes_entity_extract_to_gemini(monkeypatch):
    monkeypatch.setenv(&quot;APP_PRIMARY_PROVIDER&quot;, &quot;anthropic&quot;)
    monkeypatch.setenv(&quot;APP_LLM_PROVIDER_ENTITY_EXTRACT&quot;, &quot;gemini&quot;)

    captured = []

    def fake_generate_text_with_fallback(*, provider, **kwargs):
        captured.append(provider)
        return &quot;ok&quot;

    monkeypatch.setattr(
        failover,
        &quot;generate_text_with_fallback&quot;,
        fake_generate_text_with_fallback,
    )

    text, meta = generate_text_with_failover(
        prompt=&quot;extract entities&quot;,
        where=&quot;entity_extract&quot;,
        timeout_s=10,
        temperature=0,
        max_output_tokens=512,
    )

    assert text == &quot;ok&quot;
    assert captured[0] == &quot;gemini&quot;
    assert meta[&quot;per_where_override&quot;] is True
</code></pre>
<pre><code class="language-python">def test_per_where_routes_story_compose_to_anthropic(monkeypatch):
    monkeypatch.setenv(&quot;APP_PRIMARY_PROVIDER&quot;, &quot;gemini&quot;)
    monkeypatch.setenv(&quot;APP_LLM_PROVIDER_STORY_COMPOSE&quot;, &quot;anthropic&quot;)

    captured = []

    def fake_generate_text_with_fallback(*, provider, **kwargs):
        captured.append(provider)
        return &quot;ok&quot;

    monkeypatch.setattr(
        failover,
        &quot;generate_text_with_fallback&quot;,
        fake_generate_text_with_fallback,
    )

    text, meta = generate_text_with_failover(
        prompt=&quot;write a story&quot;,
        where=&quot;story_compose&quot;,
        timeout_s=30,
        temperature=0.7,
        max_output_tokens=2000,
    )

    assert text == &quot;ok&quot;
    assert captured[0] == &quot;anthropic&quot;
    assert meta[&quot;per_where_override&quot;] is True
</code></pre>
<p>The test suite gave us confidence that changing the global primary provider would not accidentally move story generation to the cheaper tier.</p>
<h2>16. Results</h2>
<p>The cost impact was estimated before deployment and then set up for validation through logs.</p>
<table>
<thead>
<tr>
<th>Change</th>
<th>Est. daily saving</th>
</tr>
</thead>
<tbody><tr>
<td>13 extraction/classification callsites to Flash-class model</td>
<td>~$3–5</td>
</tr>
<tr>
<td>Profile summaries to Haiku-class model</td>
<td>~$0.5–1</td>
</tr>
<tr>
<td>Chat max output tokens 4096 → 2048</td>
<td>~$0.5–1</td>
</tr>
<tr>
<td>Narrative model token reduction</td>
<td>~$0.4</td>
</tr>
<tr>
<td>Prompt caching when cache is warm</td>
<td>~$0.3–2</td>
</tr>
<tr>
<td>Disable broken memory read plan</td>
<td>latency only</td>
</tr>
<tr>
<td><strong>Total expected saving</strong></td>
<td><strong>~$4.7–10/day</strong></td>
</tr>
</tbody></table>
<p>QA had been around $10–13/day.</p>
<p>The expected target after deployment was $2–5/day, pending 24-hour billing and log validation.</p>
<p>The important part is that we were no longer guessing.</p>
<p>With token logs and per-callsite metadata, we could now answer:</p>
<ul>
<li>Which callsite spent the most?</li>
<li>Which provider handled it?</li>
<li>Which model handled it?</li>
<li>How many input tokens?</li>
<li>How many output tokens?</li>
<li>Did prompt caching hit?</li>
<li>Did a cheap model fall back to an expensive one?</li>
</ul>
<p>That is the difference between &quot;the bill went up&quot; and &quot;this specific callsite is expensive.&quot;</p>
<h2>17. The Reusable Pattern</h2>
<p>This pattern applies to any AI backend with multiple LLM tasks.</p>
<p><strong>1. Label every LLM call.</strong> Every call needs a stable identifier such as <code>where=&quot;entity_extract&quot;</code>, <code>where=&quot;story_compose&quot;</code>, <code>where=&quot;memory_patch&quot;</code>. If you already have labels for logging, reuse them for routing.</p>
<p><strong>2. Classify by task type.</strong> Do not classify by code ownership. Classify by output requirement: narrative generation → premium writing model; structured writing → lightweight writing model; extraction/classification → fast cheap model.</p>
<p><strong>3. Separate routing from business logic.</strong> Business code should not decide providers. The caller says <code>where=&quot;entity_extract&quot;</code>; the dispatcher decides <code>entity_extract → Gemini Flash</code>.</p>
<p><strong>4. Pin quality paths before changing defaults.</strong> Never flip the global provider until user-visible paths are explicitly pinned. Global defaults are dangerous when the product has mixed call types.</p>
<p><strong>5. Make token caps configurable.</strong> Hardcoded max tokens are almost always too high. Configurable caps let you tune without redeploying code.</p>
<p><strong>6. Log token usage on every path.</strong> Sync calls and streaming calls behave differently. Instrument both.</p>
<p><strong>7. Trace fallback chains.</strong> A &quot;cheap&quot; callsite can silently become expensive if fallback models escalate. Always test the fallback chain.</p>
<p><strong>8. Verify container env, not just .env.</strong> This one is boring and important. The variable in <code>.env</code> does not matter if the container never receives it.</p>
<h2>18. LLM Cost Audit Checklist</h2>
<p>Here is the checklist I would use next time:</p>
<ol>
<li>List every LLM callsite.</li>
<li>Add or reuse a stable callsite label.</li>
<li>Record provider, model, latency, input tokens, output tokens.</li>
<li>Separate sync and streaming usage logging.</li>
<li>Classify each callsite by output type and user visibility.</li>
<li>Identify quality-critical paths.</li>
<li>Pin quality-critical paths explicitly.</li>
<li>Route extraction/classification to cheaper models.</li>
<li>Route structured writing to a lightweight writing model.</li>
<li>Make max output tokens configurable.</li>
<li>Add prompt caching where prompt prefixes are stable.</li>
<li>Log cache creation and cache read tokens.</li>
<li>Trace fallback chains for expensive escalation.</li>
<li>Verify environment variables inside running containers.</li>
<li>Treat Compose/Kubernetes/env passthrough as source code.</li>
<li>Monitor provider distribution after deploy.</li>
<li>Compare expected savings against 24-hour billing data.</li>
</ol>
<h2>19. Final Takeaway</h2>
<p>Early-stage AI products are usually right to optimize for iteration speed first.</p>
<p>But some cost is not a trade-off.</p>
<p>Routing a structured JSON extraction task through your best narrative model is not buying you quality. It is buying you a larger API bill.</p>
<p>The fix was not a giant rewrite. It was mostly:</p>
<ul>
<li>audit every callsite</li>
<li>label every call</li>
<li>route by task complexity</li>
<li>pin quality paths</li>
<li>add token observability</li>
<li>deploy carefully</li>
</ul>
<p>The hard part was not the code.</p>
<p>The hard part was admitting that &quot;one model for everything&quot; had quietly become part of the architecture.</p>
<p>Once every LLM call had a label, the cost model became visible.</p>
<p>And once the cost model became visible, the refactor was obvious.</p>
<p>For <a href="https://dearartist.xyz/blog/glia-token-cost">a complementary cost pass on AI-assisted development</a> — trimming fixed prompt noise and reducing file-load granularity — see the companion note.</p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/label-every-llm-call-ai-backend-cost-audit">https://dearartist.xyz/blog/label-every-llm-call-ai-backend-cost-audit</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Glia Narrative Model: Turning AI Memory Into Continuity</title>
      <link>https://dearartist.xyz/blog/glia-narrative-model-v1</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-narrative-model-v1</guid>
      <pubDate>Fri, 24 Apr 2026 10:00:00 GMT</pubDate>
      <description><![CDATA[A structured, cautious memory layer that helps Glia remember the user as a person-in-progress — without dumping raw transcript history back into the product experience.]]></description>
      <category>glia</category>
      <category>ai-memory</category>
      <category>product-design</category>
      <content:encoded><![CDATA[<p>A structured, cautious memory layer that helps Glia remember the user as a person-in-progress — without dumping raw transcript history back into the product experience.</p>
<p><img src="https://dearartist.xyz/substack-visuals/glia-narrative-model-v1/glia-narrative-model-v1-hero-three-nodes-card.png" alt="Three-node Narrative Model diagram: Conversation evidence → Narrative Model → Product context" /></p>
<p><em>Evidence becomes structured continuity, then surfaces as gentle context.</em></p>
<h2>Hero</h2>
<p>A user opens <a href="https://gliahq.com/">Glia</a> for the third time this month and types, <em>&quot;I don&#39;t know why I&#39;m so tired this week.&quot;</em> Without memory, the assistant answers competently but generically — sleep, workload, stress. The reply is plausible. It just doesn&#39;t sound like it remembers anything about the person on the other side of the screen.</p>
<p>Narrative Model is the layer we built so Glia can respond from a more situated place — not by replaying old transcripts, and not by claiming to know the user, but by maintaining a small, structured, inspectable sketch of who the user appears to be becoming.</p>
<p>It is not a profile. It is not a retrieval index. It is a narrow memory layer that distills conversation evidence into a typed, validated representation the product can reason about.</p>
<blockquote>
<p>The core path is simple: conversation evidence becomes a validated narrative model, and the product receives only a gentle context block — not raw transcript replay.</p>
</blockquote>
<h2>Why this exists</h2>
<p>Glia is a personal AI for reflection. Three surfaces matter most. Chat happens through the <strong>Dot Agent</strong>, the conversational agent users actually talk to. Longer-form reflections come from the <strong>Story Agent</strong>, which composes written pieces from a user&#39;s recent context. Recurring people are handled by the <strong>People Card pipeline</strong>, which promotes well-grounded named relationships into reusable memory objects.</p>
<p>All three share the same underlying problem: each session tends to start cold.</p>
<p>The instinct is to reach for retrieval. And <a href="https://dearartist.xyz/blog/glia-actually-remembers">retrieval has its place</a>. But it answers a narrow question — &quot;what was said before that resembles this?&quot; — and misses the more human one: &quot;what is the user in the middle of becoming?&quot;</p>
<blockquote>
<p><strong>Related note:</strong> <a href="https://dearartist.xyz/blog/glia-actually-remembers">How Glia Actually Remembers</a> — Raw messages, async patches, hybrid retrieval, and the evidence pipeline behind continuity.</p>
</blockquote>
<table>
<thead>
<tr>
<th>Retrieval / vector memory</th>
<th>Narrative Model</th>
</tr>
</thead>
<tbody><tr>
<td>Finds fragments of past text that look similar to the current query. Good at recall. Silent on shape.</td>
<td>Maintains a structured, validated state of who the user appears to be right now. Good at continuity. Silent on trivia.</td>
</tr>
</tbody></table>
<p>The two are complementary. Retrieval can recover a specific fragment when it&#39;s needed. The Narrative Model carries the steady picture between sessions so the product doesn&#39;t have to rebuild its sense of the user from scratch every time.</p>
<p>This has to be done with care. Memory shouldn&#39;t feel like surveillance. It shouldn&#39;t overfit. It shouldn&#39;t psychoanalyze. And it shouldn&#39;t leak hidden fields back at the user. The goal is a product that responds with more situated care, not more dramatic certainty.</p>
<blockquote>
<p><strong>Design constraint</strong></p>
<p>The model should help Glia carry continuity without making the assistant sound like it is secretly reading from a dossier.</p>
</blockquote>
<h2>Early signal: memory that feels situated</h2>
<p>One early signal was not simply that Glia could remember more. It was that memory felt emotionally situated.</p>
<p>The strongest feedback was about specificity: not a bigger context window, not transcript replay, but a system that can carry forward details that feel worth remembering.</p>
<p><img src="https://dearartist.xyz/blog-images/glia-narrative-model/glia-feedback-quote.jpg" alt="Quoted user feedback on a black background reading: &quot;Machine memory is fascinating, especially in the emotional context — not jamming the context window, but weaving layers that ask what is worth remembering.&quot; Identifying social media details have been removed." /></p>
<p><em>Feedback during early Narrative Model validation.</em></p>
<p>This distinction matters. The goal is not larger context for its own sake. The goal is memory that knows what is worth carrying forward, and uses it gently.</p>
<h2>The shape of the model</h2>
<p>Narrative Model is structured around five broad dimensions: Chapter, Drive, Relationships, Self-image, and Energy.</p>
<p>The purpose is not to classify the user permanently. The purpose is to maintain a cautious, updatable sketch of what appears to matter right now.</p>
<pre><code class="language-text">                  ┌────────────────────────────┐
                  │      Narrative Model       │
                  │  cautious · updatable ·    │
                  │         grounded           │
                  └────────────┬───────────────┘
                               │
   ┌──────────┬────────────────┼────────────────┬───────────┐
   │          │                │                │           │
Chapter    Drive          Relationships     Self-image    Energy
(arc)   (motivation +   (grounded people)   (claimed +   (sources ·
        tension)                            emerging      drains ·
                                            roles)       absence)
</code></pre>
<table>
<thead>
<tr>
<th>Dimension</th>
<th>Sub</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td>Chapter</td>
<td>current arc</td>
<td>What life phase or season the user seems to be in.</td>
</tr>
<tr>
<td>Drive</td>
<td>motivation + tension</td>
<td>What the user appears to be trying to prove, protect, build, or resolve.</td>
</tr>
<tr>
<td>Relationships</td>
<td>grounded people</td>
<td>Recurring named people and relational dynamics.</td>
</tr>
<tr>
<td>Self-image</td>
<td>claimed + emerging roles</td>
<td>Roles the user claims, resists, or is beginning to inhabit.</td>
</tr>
<tr>
<td>Energy</td>
<td>sources · drains · absence</td>
<td>What seems to replenish, drain, or be missing.</td>
</tr>
</tbody></table>
<h3>Chapter</h3>
<p>Chapter captures the user&#39;s current arc in plain language. It should feel like a careful summary of the season the user is in, not a permanent identity label.</p>
<p>Example: <em>&quot;Building an early company while learning how to ask for help.&quot;</em></p>
<p>A good chapter is specific enough to be useful, but not so specific that it traps the user. It should update as the user changes. It should avoid diagnosis, moral judgment, or dramatic interpretation.</p>
<h3>Drive</h3>
<p>Drive captures what appears to be pulling the user forward. In the current implementation, it can also include a tension — the friction between what the user wants and what is making that difficult.</p>
<blockquote>
<p><strong>Drive</strong> — <em>&quot;Wants to prove the product can become real without losing personal grounding.&quot;</em></p>
<p><strong>Tension</strong> — <em>&quot;Ambition is pulling against exhaustion.&quot;</em></p>
</blockquote>
<p>The drive field is valuable because many user messages are not just about tasks. They are about the force behind the tasks. But the model must keep this grounded. It should not invent grand motives from thin evidence.</p>
<h3>Relationships</h3>
<p>Relationships capture recurring named people and the dynamics around them when there is enough grounding. This is not meant to turn every name into a permanent object. It is a cautious bridge between relational context and memory.</p>
<p>Named relationships can later become candidates for People Cards, but only when there is enough evidence. The system should avoid creating people-memory objects from weak, accidental, or one-off mentions.</p>
<blockquote>
<p><strong>Why this shape</strong></p>
<p>People are often the most important part of a user&#39;s context, but they are also the easiest place to overreach. The model treats relationships as grounded candidates, not automatic conclusions.</p>
</blockquote>
<h3>Self-image</h3>
<p>Self-image captures roles the user actively claims and roles that appear to be emerging. This is especially important because people often change before they are ready to name the change.</p>
<p>For example, a user may call themselves a builder or founder, but not yet feel comfortable calling themselves a leader. The model can represent that as a budding role with a <code>do_not_label</code> constraint.</p>
<p>The assistant should not prematurely label the user. Emerging identity should be protected. A budding role can guide sensitivity, but it should not be pushed back at the user as a statement of fact.</p>
<h3>Energy</h3>
<p>Energy captures what appears to replenish the user, drain them, or be missing from their current life.</p>
<table>
<thead>
<tr>
<th>Sources</th>
<th>Drains</th>
<th>Currently absent</th>
</tr>
</thead>
<tbody><tr>
<td>deep work</td>
<td>ambiguous relationships</td>
<td>rest</td>
</tr>
<tr>
<td>long walks</td>
<td>context switching</td>
<td>founder community</td>
</tr>
<tr>
<td>clear feedback</td>
<td></td>
<td></td>
</tr>
</tbody></table>
<p>This dimension helps the assistant respond more usefully to vague emotional states. If a user says they are tired, the product can respond from a more situated understanding without overexplaining or psychoanalyzing.</p>
<h2>Prompt rules that matter</h2>
<p>Each dimension carries a status and a confidence level, so the output never pretends every field is equally well supported. Where evidence is thin, the prompt prefers blank space over invention — sparseness is correct output, not failure.</p>
<p>When signals conflict, the model is told to record tension rather than smooth it away — important for drive and self-image, where people are often mid-transition. Named people only appear when grounded, and budding roles are protected with a &quot;do not label&quot; constraint so emerging identity is never pushed back at the user as fact.</p>
<p>Dimensions also move at different speeds: fast-moving ones can shift from a single recent session, slow-moving ones need repeated evidence. That temporal shape keeps the model responsive without becoming volatile.</p>
<blockquote>
<p><strong>The rules in one place</strong></p>
<ul>
<li>Never invent facts.</li>
<li>Named people only when grounded.</li>
<li>Hold contradictions as tension.</li>
<li>Protect emerging identities.</li>
<li>Use confidence fields.</li>
<li>Leave blanks when evidence is thin.</li>
</ul>
</blockquote>
<h2>The schema</h2>
<p>The schema is where the Narrative Model stops being a loose idea and becomes an inspectable system.</p>
<p>At the center is a top-level model containing schema version, update time, session count, dimensions, and a few auxiliary fields like thread, openness, tone, and archetype traits. But the most important part is the dimensions object itself.</p>
<p>The production schema is larger, but the simplified version below shows the important shape: typed dimensions, bounded fields, and explicit uncertainty.</p>
<pre><code class="language-python"># simplified
from pydantic import BaseModel, Field

class NamedPerson(BaseModel):
    name: str
    role: str = &quot;&quot;
    dynamic: str = &quot;&quot;
    emotional_weight: str = &quot;&quot;
    people_card_candidate: bool = False

class DriveDimension(DimensionSlice):
    intrinsic: str = &quot;&quot;
    tension: str = &quot;&quot;

class SelfImageDimension(DimensionSlice):
    active_roles: list[str] = Field(default_factory=list)
    performed_roles: list[str] = Field(default_factory=list)
    budding_roles: list[BuddingRole] = Field(default_factory=list)

class EnergyDimension(DimensionSlice):
    sources: list[str] = Field(default_factory=list)
    drains: list[str] = Field(default_factory=list)
    currently_absent: list[str] = Field(default_factory=list)
    current_capacity: str = &quot;&quot;

class NarrativeModel(BaseModel):
    schema_version: str = &quot;1&quot;
    updated_at: str = &quot;&quot;
    session_count: int = 0
    dimensions: NarrativeDimensions
    archetype_traits: str = &quot;&quot;
    narrative_tone: str = &quot;neutral&quot;
    openness: str = &quot;medium&quot;
    thread: str = &quot;&quot;
</code></pre>
<p><em>Simplified and sanitized. Some supporting types (DimensionSlice, BuddingRole, NarrativeDimensions) are omitted for brevity. The point is the architectural shape, not exact production source.</em></p>
<p>This structure matters because it makes memory usable. A free-form blob is hard to validate, hard to compare over time, and hard to <a href="https://dearartist.xyz/blog/glia-token-cost">inject back into downstream systems</a> in a disciplined way. A structured schema gives the product something it can reason about.</p>
<blockquote>
<p><strong>Related note:</strong> <a href="https://dearartist.xyz/blog/glia-token-cost">Reducing Token Cost While Working on Glia</a> — Trimming prompts, caching context, and quiet wins on the cost line.</p>
</blockquote>
<p>Confidence and status fields matter because they force the model to admit uncertainty. Named relationships matter because a person list without role or dynamic is too weak to be useful. Self-image matters because claimed roles and emerging roles are not the same thing. Energy matters because what fills, drains, and remains absent is often the difference between a generic response and a grounded one.</p>
<p>The schema is also what makes the system inspectable. It is not a mystical memory black box. It is a typed representation with explicit fields and visible failure modes.</p>
<h2>A synthetic example</h2>
<p>A populated model for a hypothetical user might look like this. All values are invented for illustration.</p>
<pre><code class="language-json">{
  &quot;chapter&quot;: {
    &quot;value&quot;: &quot;Building an early company while learning how to ask for help&quot;,
    &quot;confidence&quot;: &quot;high&quot;
  },
  &quot;drive&quot;: {
    &quot;value&quot;: &quot;Wants to prove the product can become real without losing personal grounding&quot;,
    &quot;tension&quot;: &quot;Ambition is pulling against exhaustion&quot;,
    &quot;confidence&quot;: &quot;high&quot;
  },
  &quot;relationships&quot;: {
    &quot;named_people&quot;: [
      {
        &quot;name&quot;: &quot;Nora&quot;,
        &quot;role&quot;: &quot;co-founder&quot;,
        &quot;dynamic&quot;: &quot;trusted but currently under-communicated&quot;,
        &quot;emotional_weight&quot;: &quot;high&quot;,
        &quot;people_card_candidate&quot;: true
      }
    ]
  },
  &quot;self_image&quot;: {
    &quot;active_roles&quot;: [&quot;founder&quot;, &quot;builder&quot;],
    &quot;budding_roles&quot;: [
      {
        &quot;role&quot;: &quot;leader&quot;,
        &quot;evidence&quot;: &quot;The user is starting to coordinate others but does not yet fully claim this identity.&quot;,
        &quot;do_not_label&quot;: true
      }
    ]
  },
  &quot;energy&quot;: {
    &quot;sources&quot;: [&quot;deep work&quot;, &quot;long walks&quot;, &quot;clear feedback&quot;],
    &quot;drains&quot;: [&quot;ambiguous relationships&quot;, &quot;context switching&quot;],
    &quot;currently_absent&quot;: [&quot;rest&quot;, &quot;founder community&quot;],
    &quot;current_capacity&quot;: &quot;medium&quot;
  }
}
</code></pre>
<p><em>Synthetic example. This is not a production record and does not describe a real user.</em></p>
<h2>Pipeline overview</h2>
<p>The model is refreshed asynchronously after conversation. The system gathers recent conversation evidence, builds a transcript representation, reads the current narrative model if one exists, asks the Narrative Model agent to produce strict JSON, validates the output, and only then saves the structured representation.</p>
<pre><code class="language-text">┌────────────┐   ┌────────────┐   ┌─────────────────┐   ┌────────────┐
│  Capture   │ → │   Reason   │ → │    Validate     │ → │  Surface   │
├────────────┤   ├────────────┤   │ (validation     │   ├────────────┤
│01 Conv.    │   │04 Build    │   │       gate)     │   │08 Prepare  │
│   happens  │   │   transcript│  ├─────────────────┤   │   context  │
│02 Async    │   │05 Read curr.│  │06 Validate      │   │   for      │
│   refresh  │   │   model     │  │   strict JSON   │   │   product  │
│03 Fetch    │   │06 Run NM    │  │07 Save struct.  │   │   surfaces │
│   messages │   │   agent     │  │   model         │   │            │
└────────────┘   └────────────┘   └─────────────────┘   └────────────┘
                                                              │
                                                              ▼
                                              available to: Chat ·
                                              Stories · People Cards

Legend: Evidence in · Structure out · Validation gate · Product context
</code></pre>
<p>A simplified view of the job flow looks like this:</p>
<pre><code class="language-python"># simplified
def refresh_narrative_model_job(user_id: str, conversation_id: str):
    if debounce_should_skip(user_id):
        return
    messages = fetch_messages_for_conversation(user_id, conversation_id, limit=200)
    transcript = transcript_from_messages(messages)
    current = get_narrative_model_dict(user_id)
    updated = run_narrative_model_agent(
        current_model=current,
        transcript=transcript,
    )
    if not updated:
        return
    save_narrative_model(user_id=user_id, model=updated)
    maybe_trigger_people_cards(user_id, updated)
</code></pre>
<p><em>Architectural shape: debounce, fetch, build, run, validate, save, optionally trigger people candidates.</em></p>
<blockquote>
<p><strong>Grounding rule</strong></p>
<p>The model may infer only from grounded evidence. It should preserve uncertainty, tolerate blanks, and avoid converting weak signals into confident claims.</p>
</blockquote>
<h2>Storage and history</h2>
<p>In the current implementation, storage is intentionally simple. Rather than introducing a separate narrative-model table, the system keeps the active model in a structured JSON field on the user profile. That keeps the read path lightweight and avoids an extra join for the main consumers.</p>
<p>The save path also snapshots the prior model before overwrite. Each time a new model is successfully saved, the previous one is appended into <code>model_history</code>. That history is capped at 50 snapshots so it remains bounded.</p>
<p><code>updated_at</code> must be system-owned. A simplified version of the save path looks like this:</p>
<pre><code class="language-python"># simplified — field names below are illustrative
def save_narrative_model(db, user_id: str, model: NarrativeModel) -&gt; None:
    row = load_profile_row(db, user_id)
    existing = load_current_model(row) or {}
    history = (existing.get(&quot;model_history&quot;) or [])
    if existing:
        snapshot = {k: v for k, v in existing.items() if k != &quot;model_history&quot;}
        history = (history + [snapshot])[-50:]
    new_data = model.model_dump(mode=&quot;json&quot;)
    new_data[&quot;updated_at&quot;] = datetime.now(timezone.utc).isoformat()
    new_data[&quot;model_history&quot;] = history
    write_current_model(row, new_data)
    db.commit()
</code></pre>
<p><em>Simplified and sanitized. Save the active model, retain bounded history, and keep update metadata under system control.</em></p>
<p>The history is not yet a user-facing product surface. There is no built-out &quot;you in March versus you in July&quot; experience today. But keeping the snapshots creates a foundation for future longitudinal features without forcing those product decisions prematurely.</p>
<h2>Injection into product experience</h2>
<h3>Chat</h3>
<p>For chat, the system converts the stored model into a short plain-language narrative context block and injects that block into the prompt as soft background. That wording matters. It is not raw transcript text. It is not hidden truth. It is not meant to override what the user is saying now.</p>
<p>A simplified version of the injection path looks like this:</p>
<pre><code class="language-python"># simplified
narrative_block = build_narrative_context_block(model)
template_vars = {
    &quot;messages&quot;: transcript,
    &quot;memory_evidence_json&quot;: memory_evidence,
    &quot;narrative_context&quot;: narrative_block or &quot;&quot;,
}
prompt = render_prompt(contract[&quot;body&quot;], template_vars)
</code></pre>
<p><em>Shape of the injection path: structured model → plain-language block → prompt variable.</em></p>
<p>The prompt contract for chat treats <code>narrative_context</code> as optional background from prior sessions. If it conflicts with what the user says in the present moment, the present message wins. That is the right boundary.</p>
<h3>Stories</h3>
<p>The story system can also receive narrative context. This helps story generation stay coherent with the user&#39;s current chapter, unresolved tension, relational landscape, and openness level.</p>
<p>That does not mean the story system should become more dramatic. It means it becomes less arbitrary. If the user is in a transitional chapter and carrying unresolved tension, the story generator should not suddenly write as if everything has already resolved.</p>
<h3>People Cards</h3>
<p>The relationships dimension also feeds a more object-like memory path. If a named person is sufficiently grounded and marked as a candidate, the system can create or resolve a provisional people entity. That gives Glia a bridge from narrative understanding into <a href="https://dearartist.xyz/blog/glia-social-share">reusable product memory objects</a>.</p>
<blockquote>
<p><strong>Related note:</strong> <a href="https://dearartist.xyz/blog/glia-social-share">Glia Social / Share: How AI Memories Become Safely Shareable</a> — Entity-scoped Connection, card-scoped Single Story Share, and the boundaries that keep memory sharing safe.</p>
</blockquote>
<p>This is where the Narrative Model becomes product infrastructure rather than just metadata. The full job flow that drives this lives in the Pipeline overview above.</p>
<h2>The bug that mattered: false staleness</h2>
<p>One of the clearest engineering lessons in the Narrative Model rollout came from a surprisingly small field: <code>updated_at</code>.</p>
<p>Originally, the model&#39;s <code>updated_at</code> came from LLM output. That seemed harmless at first because the field fit naturally into the generated JSON. But it was a category mistake.</p>
<p>An LLM does not know the actual runtime date. It can generate something that looks like a timestamp, but that is not the same as owning operational truth. In practice, that meant freshly generated models could appear hundreds of days old.</p>
<p>That mattered because the narrative context builder uses <code>updated_at</code> to decide whether to prepend a stale-model warning. Once the timestamps were wrong, the system could tell downstream prompts to treat fresh models as stale.</p>
<p>The fix was simple and important: overwrite <code>updated_at</code> at save time with the system clock in UTC.</p>
<pre><code class="language-python"># simplified
new_data = model.model_dump(mode=&quot;json&quot;)
new_data[&quot;updated_at&quot;] = datetime.now(timezone.utc).isoformat()
</code></pre>
<p><em>The timestamp is assigned by the system at save time.</em></p>
<blockquote>
<p><strong>Engineering lesson</strong></p>
<p>LLMs can generate useful structure, but they should not be trusted with operational metadata like timestamps, IDs, or state transitions.</p>
</blockquote>
<h2>Current validation state</h2>
<p>In the current validation state, the core pipeline is in place and the work has shifted from building to observing. Chat and story paths receive narrative context. Provisional People Card candidates can be triggered when relationships are sufficiently grounded. Freshly saved models no longer carry false stale warnings caused by LLM-generated timestamps.</p>
<p>Across validation runs, the system shows the expected shape: richer conversational histories produce denser dimensions and more coherent continuity, while sparser histories produce blanks or lower-confidence fields rather than fabricated certainty. Sparse output is a feature, not a bug.</p>
<p>Attention at this stage sits on correctness, prompt health, and structured-output reliability — whether refreshes happen on schedule, whether the agent keeps returning valid structured output, and whether richer models translate into useful continuity without over-personalization.</p>
<h2>Implementation overview</h2>
<p>Narrative Model was built in staged increments rather than as a single drop. Each stage was small enough to validate on its own and large enough to land a real piece of product behavior. The end-to-end flow is in the Pipeline overview above; the rollout shape is below.</p>
<table>
<thead>
<tr>
<th>Stage</th>
<th>Title</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td>Stage 1</td>
<td>Core foundations</td>
<td>Agent, typed schema, persistence, async refresh job, chat-side hook, and the regression tests that anchor them.</td>
</tr>
<tr>
<td>Stage 2</td>
<td>Schema expansion &amp; onboarding</td>
<td>Additional dimension fields and explicit support for early-session users, where evidence is naturally thin and blanks are correct output.</td>
</tr>
<tr>
<td>Stage 3</td>
<td>Story composition injection</td>
<td>Wire the structured model into the Story Agent so longer-form reflections stay coherent with the user&#39;s current chapter and tension.</td>
</tr>
<tr>
<td>Stage 4</td>
<td>Provisional People Card triggering</td>
<td>Promote well-grounded named relationships into candidate memory objects, with grounding requirements rather than name-matching.</td>
</tr>
<tr>
<td>Stage 5</td>
<td>Validation &amp; monitoring</td>
<td>Operational notes covering refresh health, structured-output reliability, and the boundaries of what the model is allowed to claim.</td>
</tr>
</tbody></table>
<p>The current surface area covers the agent, the async refresh job, the schema, the persistence layer, the prompt contract, the chat and story injection hooks, the provisional People Card trigger, and a regression suite. This is a product pipeline, not a prompt-only experiment.</p>
<blockquote>
<p><strong>Why this matters</strong></p>
<p>Memory that lives only inside a prompt is fragile. Memory that lives in a validated schema, a persistence layer, and a refresh job is something the product can actually depend on.</p>
</blockquote>
<h2>The product surface: what continuity feels like</h2>
<p><em>Simulated UI · synthetic content. These mockups are illustrative. They show how a structured memory layer gently changes the surface a user actually touches — without exposing hidden fields or making the system feel like it is watching.</em></p>
<p><img src="https://dearartist.xyz/substack-visuals/glia-narrative-model-v1/glia-narrative-model-v1-phone-with-without.png" alt="Side-by-side phone mockups comparing chat with and without the Narrative Model" /></p>
<p><em>Without vs. with the Narrative Model.</em></p>
<p><strong>Without Narrative Model</strong> — Glia · Chat</p>
<blockquote>
<p>User: <em>I don&#39;t know why I&#39;m so tired this week.</em></p>
<p>Glia: <em>Sounds like you may be under stress. Try getting more sleep, reducing workload, and taking breaks.</em></p>
</blockquote>
<p><em>Generic but plausible.</em></p>
<p><strong>With Narrative Model</strong> — Glia · Chat</p>
<blockquote>
<p>User: <em>I don&#39;t know why I&#39;m so tired this week.</em></p>
<p>Glia: <em>You&#39;ve been carrying a lot between the launch push and trying to keep things steady around it. The tiredness may be tracking that, not just the week itself.</em></p>
<p>✨ context used lightly</p>
</blockquote>
<p><em>More situated, without exposing hidden fields.</em></p>
<h3>Specific recall, used lightly <em>(synthetic example)</em></h3>
<blockquote>
<p>User: <em>I&#39;m going for a hill workout today.</em></p>
<p>Glia: <em>Back to that hill again? Your legs may complain, but you&#39;ll probably feel better after.</em></p>
<p>✨ context used lightly</p>
</blockquote>
<p><em>Synthetic example inspired by early usage patterns. The user does not see raw memory fields; they feel continuity through a small, relevant detail.</em></p>
<h3>Story Agent preview</h3>
<p><strong>The Cost of Holding Momentum</strong>
<em>A reflection on building, uncertainty, and staying grounded.</em></p>
<blockquote>
<p>You are not only trying to move faster. You are trying to move without losing the parts of yourself that make the work worth doing.</p>
</blockquote>
<p><code>chapter-aware</code> · <code>tone-matched</code> · <code>grounded reflection</code></p>
<p><em>Narrative context keeps story generation coherent with the current chapter — not transcript replay.</em></p>
<h3>People Card candidate</h3>
<p><strong>Nora</strong> — <em>candidate · needs grounding</em>
co-founder</p>
<blockquote>
<p><strong>Relationship signal:</strong> Trusted, but currently under-communicated.</p>
</blockquote>
<p><em>Provisional. Only recurring, grounded relationships are promoted into memory objects.</em></p>
<p>The user does not see the raw model. They feel its effect through continuity: a reply that starts from the right emotional neighborhood, a story that matches the current chapter, or a relationship object that appears only when there is enough grounding.</p>
<p>The better answer is not better because it is more poetic or more invasive. It is better because it is more situated — grounded in what the system has already understood about the user&#39;s ongoing context.</p>
<p>The important constraint is how gently that context gets used. The assistant should not dump hidden model fields back at the user, sound like it is reading from a file, or psychoanalyze. The best outcome is a response that feels naturally coherent, not theatrically personalized.</p>
<h2>Design principles</h2>
<ul>
<li><strong>Structured memory beats raw memory.</strong> Raw history contains evidence, but structure makes it reusable. The product needs something more disciplined than transcript replay.</li>
<li><strong>Confidence matters.</strong> If the system cannot say how sure it is, it will tend to sound more certain than it should.</li>
<li><strong>Blank is better than hallucinated.</strong> A sparse field is not a failure. In personal memory systems, invention is usually worse than omission.</li>
<li><strong>Emerging identity should be protected.</strong> A user can be becoming something without wanting that identity imposed on them.</li>
<li><strong>Relationships must be grounded.</strong> Names matter, but only when supported. Otherwise relationship memory turns into noise.</li>
<li><strong>Operational metadata must be system-owned.</strong> Timestamps, IDs, save semantics, and state transitions should belong to the runtime, not the generator.</li>
<li><strong>Context should guide tone, not dominate it.</strong> The goal is better continuity, not to make every answer feel narrated by memory.</li>
</ul>
<h2>What we are not doing yet</h2>
<p>Narrative Model is intentionally narrow, and some boundaries are important.</p>
<ul>
<li><strong>Selective injection.</strong> Narrative context is not yet wired into every adjacent surface. Broader theme- or line-style injection should be evaluated deliberately.</li>
<li><strong>History stays internal.</strong> Model snapshots are retained but not surfaced to the user. The system preserves longitudinal shape before committing to a <a href="https://dearartist.xyz/blog/glia-social-share">public surface</a>.</li>
<li><strong>Context-size limits.</strong> Long-context users may eventually need explicit limits or a summarization layer when many named relationships are active.</li>
<li><strong>Structured-output reliability.</strong> Treated as an ongoing operational concern, not a solved problem. Validation and recovery paths stay first-class.</li>
</ul>
<p>Most importantly, this is a validation and monitoring phase, not a claim of perfect memory. The responsible thing at this point is to observe, refine only where evidence justifies it, and resist turning a narrow memory layer into a sprawling system too early.</p>
<h2>Optional future surface: model history over time</h2>
<p>Because each successful save snapshots the prior model, the system is quietly retaining the shape of change. That opens a possible future surface where a user could see continuity over time — not as surveillance, not as overconfident analysis, but as reflective movement.</p>
<pre><code class="language-text">[Snapshot 1] ──→ [Snapshot 2] ──→ [Snapshot 3]
Chapter:        Chapter:           Chapter:
Starting the    Learning to        Delegating
company alone   ask for help       without feeling
                                   absent
</code></pre>
<p>That surface is not shipped today, and should not be rushed. The underlying structure is there when the product question is ready.</p>
<h2>Closing</h2>
<p>Narrative Model is an attempt to make AI memory feel less like retrieval and more like continuity — without pretending the system fully knows the person, and without turning memory into a black box that can&#39;t be inspected.</p>
<p>What it holds is small: where the user seems to be in life, what tensions are still alive, who matters, what identity movement is emerging, and what fills or drains them. A modest claim, but a useful one.</p>
<blockquote>
<p>If personal AI is going to feel human over time, it needs memory that is more than recall and less than overreach.</p>
</blockquote>
<hr />
<p><strong>Privacy checklist</strong></p>
<p>This note describes a real production system, but everything shown on this page has been reviewed for privacy.</p>
<ul>
<li>✓ All chat examples, JSON payloads, and named people are synthetic. No real user data is shown.</li>
<li>✓ All personal identifiers — usernames, avatars, timestamps, and platform UI — have been removed from the early-signal feedback image.</li>
<li>✓ No internal commit SHAs, file paths, or deployment targets are exposed in the code snippets.</li>
<li>✓ Code blocks are simplified for readability and labeled accordingly; they are not production source.</li>
<li>✓ Real cofounder feedback is paraphrased and presented without handles or social media chrome.</li>
</ul>
<hr />
<p><em>Originally published at: <a href="https://dearartist.xyz/blog/glia-narrative-model-v1">https://dearartist.xyz/blog/glia-narrative-model-v1</a></em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-narrative-model-v1">https://dearartist.xyz/blog/glia-narrative-model-v1</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Glia Social / Share: How AI Memories Become Safely Shareable</title>
      <link>https://dearartist.xyz/blog/glia-social-share</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-social-share</guid>
      <pubDate>Thu, 23 Apr 2026 11:00:00 GMT</pubDate>
      <description><![CDATA[How Glia turns private AI memories into something that is safely shareable, with the right consent and redaction model.]]></description>
      <category>glia</category>
      <category>social</category>
      <category>privacy</category>
      <content:encoded><![CDATA[<p>In most software products, sharing is simple. You share a photo, a document, or a link, and the system mostly needs to answer one question: <em>who can access this resource?</em></p>
<p>In <a href="https://gliahq.com/">Glia ↗</a>, the object being shared is not a file or a URL. It is a <em>memory</em>. And memory is different.</p>
<p>A memory is generated from conversations, lived context, relationships, and AI-assisted narrative structure. The thing being shared is not just content — it is <em>content about someone</em>. That changes the problem entirely.</p>
<blockquote>
<p><strong>Core idea</strong></p>
<p>Glia does not simply send content outward. It delivers memories about a person to the right people, within the right boundary.</p>
</blockquote>
<p><img src="https://dearartist.xyz/substack-visuals/glia-social-share/glia-social-share-hero-three-columns.png" alt="Three-column hero: moment story · people entity · connection / share-link" /></p>
<p><em>Object · about · boundary — three columns of the share model.</em></p>
<h2>Why sharing memory is not ordinary sharing</h2>
<p>Once memory becomes the object, sharing is no longer just an access-control feature. It becomes a question of who the memory is about, who should be allowed to see it, whether relationship context matters, and how to make sharing useful without making it dangerously broad.</p>
<p>That is why Glia&#39;s social/share is not a traditional &quot;share sheet + URL&quot; feature. It is a memory-sharing system built around <strong>people entities</strong> and controlled boundaries. The architecture below is what it takes to make that workable in production.</p>
<h2>What Glia Social / Share actually is</h2>
<p>Glia Social / Share is implemented through two different sharing models that look similar from the outside but mean very different things underneath:</p>
<ul>
<li><strong>Connection Share</strong> — <em>entity-scoped</em> relationship access.</li>
<li><strong>Single Story Share</strong> — <em>card-scoped</em> outbound access.</li>
</ul>
<p>That distinction is the foundation of the whole system. Collapsing the two into one mechanism would blur the semantics and weaken the permission model.</p>
<h2>The four semantic rules</h2>
<p>Glia&#39;s current social/share system is defined by four rules. They look compact, but they determine nearly every important design choice in the stack.</p>
<blockquote>
<p><strong>Final semantics</strong></p>
<ol>
<li>Connection grants access only to moment stories related to a specific people entity; it does not grant access to all of the owner&#39;s stories.</li>
<li>Single Story Share applies only to one moment story.</li>
<li>The current share-link is a public bearer token; it is not identity-bound to the viewer.</li>
<li>Theme stories are excluded from the current social/share system.</li>
</ol>
</blockquote>
<h2>Core concepts at a glance</h2>
<blockquote>
<p><strong>Connection Share</strong> — A long-term sharing boundary anchored to a <strong>people entity</strong>. Opens access only to the owner&#39;s moment stories that relate to that entity.</p>
</blockquote>
<blockquote>
<p><strong>Single Story Share</strong> — An explicit, card-scoped outbound share. One token, one story. It does not establish any relationship with the receiver.</p>
</blockquote>
<blockquote>
<p><strong>Public Bearer Token</strong> — The current share-link is not identity-bound. Anyone who holds the token can view that single story until it expires.</p>
</blockquote>
<blockquote>
<p><strong>Theme Story Excluded</strong> — Theme stories are deliberately outside this system today. Their narrative shape does not fit entity-scoped sharing.</p>
</blockquote>
<h2>Two business lines: Connection vs Single Story</h2>
<h3>Connection Share</h3>
<p>Suppose user A has a people entity in Glia representing a real person B. If A invites B and the connection is accepted (see also: <a href="https://dearartist.xyz/blog/glia-invite-flow">fixing Glia&#39;s social invite flow</a>), B does <em>not</em> get access to all of A&#39;s stories. Instead, B gets access only to A&#39;s moment stories that are specifically related to that people entity.</p>
<p>The permission anchor is not the owner account, and not the feed as a whole. It is <code>entity_id</code>. This makes connection feel less like a traditional social graph and more like a controlled sharing contract around a person-centered memory scope.</p>
<h3>Single Story Share</h3>
<p>Single Story Share is the explicit, card-level outbound share. A user can generate a share-link for one specific <strong>moment story</strong>. It is mainly used to send a memory outward — especially to someone who does not yet have access through connection.</p>
<ul>
<li>it applies to one card</li>
<li>it does not create a lasting relationship</li>
<li>it does not establish connection</li>
<li>it does not verify that the receiver is &quot;the person in the story&quot;</li>
</ul>
<p>In the current version, this link is a <strong>public bearer token</strong>. Anyone who has the token can view that one story. That is exactly why <em>share-link is not the same thing as connection</em>.</p>
<h2>Diagram 1 — Business semantics</h2>
<pre><code>┌──────────────────────────┬──────────────────────────┬──────────────────────────┐
│ Connection Share         │ Single Story Share       │ Theme Story              │
│ (entity-scoped)          │ (card-scoped)            │ (excluded)               │
├──────────────────────────┼──────────────────────────┼──────────────────────────┤
│ Anchored to a            │ One moment story per     │ Not part of social/share │
│ people entity            │ token                    │                          │
│                          │                          │                          │
│ Only related moment      │ Public bearer token      │ Different narrative      │
│ stories                  │                          │ shape                    │
│                          │                          │                          │
│ Per-card self_only       │ No recipient identity    │ May get its own model    │
│ override                 │ binding                  │ later                    │
└──────────────────────────┴──────────────────────────┴──────────────────────────┘
</code></pre>
<h2>The permission model</h2>
<p>The most important thing about this system is not that it has endpoints. It is that its permissions match its semantics.</p>
<blockquote>
<p><strong>Connection authorization</strong></p>
<p>A viewer can read a story via connection only if all of the following are true:</p>
<ul>
<li>there is an active connection</li>
<li>the connection owner matches the card owner</li>
<li>the connection recipient matches the viewer</li>
<li>the connection <code>entity_id</code> matches the card <code>entity_id</code></li>
<li>the card is a social moment story</li>
<li>the card is not marked <code>self_only</code></li>
</ul>
</blockquote>
<p><img src="https://dearartist.xyz/substack-visuals/glia-social-share/glia-social-share-ios-invite-accept.jpg" alt="iOS invite preview — &#39;Selin wants to share memories about you&#39; with Accept and Decline actions" /></p>
<p><em>Invite framed as &#39;memories about you&#39; — consent to a relationship, not a file.</em></p>
<p>This is what prevents a subtle but dangerous failure mode: <em>&quot;I have one connection to this owner, therefore I can read unrelated stories from the same owner.&quot;</em> That is exactly the kind of permission expansion the model is designed to avoid.</p>
<blockquote>
<p><strong>Share-link authorization</strong></p>
<p>The token-based path has a different shape. A viewer can read a story via share-link if:</p>
<ul>
<li>the token exists</li>
<li>the token is not expired</li>
<li>the token belongs to the same card</li>
<li>the card is a social moment story</li>
</ul>
<p>There is no viewer identity check in this path. That is why the link is a public bearer token.</p>
</blockquote>
<h2>System architecture</h2>
<p>At a high level, Glia splits the responsibility across three surfaces — the authenticated <strong>Social API</strong>, the public <strong>Social Web</strong> for landing pages, and a dedicated <strong>Social OG</strong> router for preview images. Underneath sits the policy and store layer, which is where the real business logic lives.</p>
<h3>Diagram 2 — System architecture</h3>
<pre><code>CLIENT             ROUTERS                  DOMAIN                  DATABASE
─────────          ─────────                ─────────               ─────────
iOS App      ──►   Social API Router  ──►   Social Store / Policy ──► entity_shares
Public Web   ──►   Social Web Router  ──►   Push Jobs            ──► story_share_tokens
                   Social OG Router                                   cards
                                                                      card_visibility_overrides
                                                                      user_profiles
                                                                      social_notifications
</code></pre>
<p><em>Routers delegate to the store / policy layer. The store is the single place where authorization rules live.</em></p>
<p>The system does not rely on routers alone to express semantics. It relies on a centralized domain layer to keep the rules coherent across iOS, web, and background jobs.</p>
<h2>Data model</h2>
<p>The data model is what makes the semantics enforceable.</p>
<h3><code>entity_shares</code></h3>
<p>Represents connection-level sharing. It captures the share owner, which people entity it is about, who accepted the connection, and the current state. The key point is that <code>entity_id</code> is the permission anchor. That is what makes connection sharing entity-scoped.</p>
<h3><code>story_share_tokens</code></h3>
<p>Represents single-story sharing — the sharer, the card, the token, the expiration. It does not represent a relationship. It represents a single-card grant.</p>
<h3><code>card_visibility_overrides</code></h3>
<p>Captures per-card deny rules. Even when a connection exists, the owner can still hide a specific card by marking it <code>self_only</code>. That preserves an important real-world truth: <em>not every memory about someone should automatically be visible to them.</em></p>
<h2>Connection flow</h2>
<p>An invite is created by the owner, previewed by the recipient, and accepted before any feed access opens up. The state machine stays simple: <code>pending</code> → <code>active</code>.</p>
<h3>Diagram 3 — Connection flow</h3>
<pre><code>01  User A → API     POST /api/social/invites
02  API    → Store   create_invite(owner, entity_id)
03  Store  → DB      insert entity_share(status=pending)
04  API    → User A  returns invite_url
05  User A → User B  sends invite link
06  User B → API     GET /api/social/invites/{token}
07  Store  → DB      load preview (entity, owner, sample)
08  API    → User B  renders invite preview
09  User B → API     POST /api/social/invites/{token}/accept
10  Store  → DB      update status → active
11  API    → User B  connection created
</code></pre>
<p><img src="https://dearartist.xyz/substack-visuals/glia-social-share/glia-social-share-ios-notification-accepted.jpg" alt="iOS notification — &#39;Selin accepted your connection · 3 minutes ago&#39;" /></p>
<p><em>Step 11 in production — connection accepted, feed opens.</em></p>
<h2>Single Story Share flow</h2>
<p>The share-link path never touches connection state. It mints a token for one card, drops the receiver onto a public landing page, and forwards a token-bearing request into the detail endpoint.</p>
<h3>Diagram 4 — Single Story Share flow</h3>
<pre><code>01  User A → API     POST /api/social/share-link
02  API    → Store   create_or_get_share_link(card_id)
03  API    → User A  returns /s/{token}
04  User A → Viewer  sends link
05  Viewer → Web     GET /s/{token}
06  Web    → Store   get_by_share_token(token)
07  Web    → Viewer  renders public landing page
08  Viewer → API     GET /api/social/story/{card_id}?token=...
09  Store  → API     validate token (exists, not expired)
10  API    → Viewer  story content or 403
</code></pre>
<p><img src="https://dearartist.xyz/substack-visuals/glia-social-share/glia-social-share-ios-share-memory-panel.png" alt="iOS Share Memory bottom sheet showing &#39;people in this memory&#39; and a generated invite link with copy action" /></p>
<p><em>Step 1 in production — the panel that mints /s/{token}.</em></p>
<h2>Key code patterns</h2>
<p>The authorization rules above translate into very small, legible functions. The boundaries do most of the work.</p>
<h3>Social moment story detection</h3>
<pre><code class="language-python">def is_social_moment_story(card):
    return (
        card.type == &quot;story&quot;
        and card.entity_id is not None
    )
</code></pre>
<p>Two conditions are enough to define the boundary: it must be a story card and it must be anchored to a people entity. This naturally excludes theme stories from the current social/share system.</p>
<h3>Connection authorization</h3>
<pre><code class="language-python">def can_view_via_connection(viewer_user_id, card):
    share = find_active_entity_share(
        owner_user_id=card.user_id,
        recipient_user_id=viewer_user_id,
        entity_id=card.entity_id,
    )
    if not share:
        return False
    if has_self_only_override(card.user_id, card.id):
        return False
    return is_social_moment_story(card)
</code></pre>
<p>The critical line is <code>entity_id=card.entity_id</code>. That is what prevents owner-wide leakage and keeps connection access entity-scoped.</p>
<h3>Share-link authorization</h3>
<pre><code class="language-python">def can_view_via_share_token(token, card):
    record = get_story_share_token(token)
    if not record:
        return False
    if record.card_id != card.id:
        return False
    if record.expires_at &lt; now():
        return False
    return is_social_moment_story(card)
</code></pre>
<p>Notice what is missing: no viewer identity check, no connection requirement, no recipient matching. This is why the share-link is a public bearer token.</p>
<h3><code>self_only</code> visibility control</h3>
<pre><code class="language-python">def visible_in_connection_feed(card):
    if not is_social_moment_story(card):
        return False
    if has_self_only_override(card.user_id, card.id):
        return False
    return True
</code></pre>
<p>Connection defines the default sharing scope. <code>self_only</code> defines the owner&#39;s right to narrow that scope at the card level. Small but conceptually important.</p>
<h2>Supporting surfaces: landing pages, OG, push</h2>
<p>A complete share system is not just an API. To feel like a real product, it also needs a few well-aligned surrounding surfaces.</p>
<h3>Public landing pages</h3>
<p><code>/s/{token}</code> is how a shared memory enters the outside world. It gives the receiver a readable preview, a path into the app, and a stable handoff point between web and native UX.</p>
<h3>Open Graph images</h3>
<p>Invite links and story-share links are not the same thing. An invite means &quot;I want to establish a connection.&quot; A story share means &quot;I want to send you one memory.&quot; They deserve different OG previews because they mean different things — and the previews are part of how users learn what each link is.</p>
<h3>Push behavior</h3>
<p>Notifications also need semantic integrity. For example, <code>new_shared_card</code> should respect daily push limits, while <code>connection_accepted</code> should not be blocked by the same budget. This sounds like an implementation detail, but it is really part of the product contract.</p>
<h2>Product and engineering tradeoffs</h2>
<h3>Why connection and share-link must be separate</h3>
<p>This is the most important design decision in the whole system. If connection and share-link are collapsed into one mechanism, everything gets blurry: is this relationship access? a temporary grant? an invitation? a durable permission? By keeping them separate, the system stays legible: <strong>connection = entity-scoped relationship access</strong>; <strong>share-link = card-scoped outbound access</strong>.</p>
<h3>Why share-link is not recipient-bound</h3>
<p>A natural question: if the story is about someone, why not bind the share-link to that recipient? The answer is not that it is impossible. It is that it is not the right tradeoff for the current version. Recipient-bound links would require solving much heavier problems:</p>
<ul>
<li>how to identify the intended viewer before login</li>
<li>whether each person gets a unique token</li>
<li>how to model a story that references multiple people</li>
<li>what to do when links are forwarded</li>
<li>how native app, web landing, install flow, and authentication all connect</li>
</ul>
<p>That is an entirely different system. The current design is intentionally narrower: connection handles relationship access, share-link handles explicit card sharing, and the link is public but the scope is narrow. That keeps the model simple and the rollout practical.</p>
<h3>Why theme stories are excluded</h3>
<p>This is not just an implementation gap. It is a deliberate boundary. Moment stories are closely tied to specific people, events, and relational context. Theme stories are more abstract, more synthesized, and often span broader narrative terrain. The current social/share system is intentionally optimized for <em>people-related moment stories</em>, where entity-scoped sharing is coherent. Theme stories may eventually deserve their own sharing model — but reusing this one would blur the semantics.</p>
<h3>Why pending invite tri-state is intentionally not implemented yet</h3>
<p>A tri-state model (e.g. <code>none</code> / <code>pending</code> / <code>connected</code>) in the iOS UI sounds like a small change, but it pulls in invite ownership semantics, sender-vs-recipient pending states, expiration UI, and resolution flows on both ends. Until those are designed end-to-end, the iOS surface keeps the simpler binary &quot;connected vs not connected&quot; model. It is better to be honest about that than to ship a third state that does not have a coherent product story yet.</p>
<h2>What Glia explicitly does not support yet</h2>
<p>Some systems become confusing because they quietly imply features they do not really have. This one is better understood by being explicit.</p>
<ul>
<li><strong>Pending invite tri-state.</strong> The current iOS model is still effectively connected vs not connected.</li>
<li><strong>Recipient-bound share links.</strong> Current links are public bearer tokens; they are not identity-bound.</li>
<li><strong>Theme story socialization.</strong> Theme stories are excluded from this sharing model entirely.</li>
</ul>
<p>Calling these out matters. It keeps the system honest and it keeps the boundaries from drifting one PR at a time.</p>
<h2>Closing reflection</h2>
<p>What makes Glia Social / Share interesting is not that it adds a share button. It is that it treats memory as a different kind of object. A memory (see also: <a href="https://dearartist.xyz/blog/glia-actually-remembers">how Glia actually remembers</a>) is not just content. It is content with relationship context. It is often about someone. And that means sharing it cannot be modeled as generic URL access alone.</p>
<p>Glia&#39;s answer is to separate two ideas cleanly: <strong>Connection</strong> for ongoing, entity-scoped sharing, and <strong>Single Story Share</strong> for explicit, card-scoped outbound sharing. That separation is what makes the system useful without making it careless.</p>
<blockquote>
<p>Glia does not simply send content outward. It delivers memories about a person to the right people, within the right boundary.</p>
</blockquote>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-social-share">https://dearartist.xyz/blog/glia-social-share</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>How Glia Actually Remembers</title>
      <link>https://dearartist.xyz/blog/glia-actually-remembers</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-actually-remembers</guid>
      <pubDate>Mon, 20 Apr 2026 10:00:00 GMT</pubDate>
      <description><![CDATA[The memory system inside Glia, in practice.]]></description>
      <category>glia</category>
      <category>ai-memory</category>
      <content:encoded><![CDATA[<p><em>Raw messages, async patches, hybrid retrieval, and the evidence pipeline behind continuity.</em></p>
<p>When a user says, &quot;<a href="https://gliahq.com/">Glia ↗</a> remembered something I mentioned weeks ago,&quot; it is tempting to explain that experience with a single word: <em>memory</em>.</p>
<p>In a production system, memory is rarely one thing. Sometimes it means durable storage. Sometimes summarization. Sometimes retrieval. Sometimes the model is only seeing a carefully assembled slice of prior evidence and turning that into a fluent reply. These mechanisms can look identical from the outside while behaving very differently under the hood.</p>
<p>In Glia, continuity does not come from stuffing full history into the model on every turn. It comes from a layered architecture: raw messages are persisted, recent windows are transformed into structured memory artifacts, those artifacts are projected into a searchable index, and each new chat request builds a bounded evidence payload that the writer model uses as grounding.</p>
<blockquote>
<p><strong>Core idea</strong></p>
<p>Glia does not remember by keeping everything in context. It remembers by turning conversations into retrievable memory artifacts.</p>
</blockquote>
<p><img src="https://dearartist.xyz/substack-visuals/glia-actually-remembers/glia-actually-remembers-hero-pipeline.png" alt="Pipeline diagram: messages → memory_patches → memory_items → memory_evidence_json → reply" /></p>
<p><em>Pipeline / conversation → evidence → reply.</em></p>
<h2>Diagram 1 — Memory architecture overview</h2>
<pre><code>WRITE PATH                      READ PATH
──────────                      ─────────
User message                    New chat request
    ↓                                ↓
messages                        planner / fallback
    ↓                                ↓
extract_memory_patch_task       build_evidence_pack
    ↓                                ↓
memory_patches                  hybrid_retrieve
    ↓                                ↓
ingest_from_patch               memory_evidence_json
    ↓                                ↓
memory_items                    llm_chat_stream
                                     ↓
                                 reply
</code></pre>
<p><em>Stories · themes · entities also feed memory_items.</em></p>
<h2>Why this distinction matters</h2>
<p>Products that feel personal are easy to mis-explain. If a team talks about AI memory as if it were human memory, it becomes harder to reason about failure modes, observability, privacy boundaries, and product trust. Talking about memory in terms of persistence, extraction, indexing, retrieval scopes, and prompt contracts makes the system easier to debug and easier to improve.</p>
<p>This is not a philosophical essay about AI memory. It is a technical description of what memory currently means in Glia&#39;s deployed architecture: what is stored, what is derived, what is retrieved, and what the model actually sees before it replies.</p>
<h2>What users call memory</h2>
<p>From the user&#39;s perspective, memory is any behavior that makes the assistant feel continuous. It may name a person mentioned once before. It may reconnect to an earlier emotional thread. It may surface a detail that the user did not repeat in the current conversation. The engineering task is to separate that experience from the mechanisms that create it.</p>
<p>In Glia today, those mechanisms include several distinct artifact types: <code>messages</code> as the source of truth for raw text; <code>memory_patches</code> as structured post-chat extractions; <code>memory_items</code> as a unified retrieval index for patch-, story-, and note-derived content; <code>entities</code>, <code>aliases</code>, and <code>profiles</code> that stabilize recurring people, concepts, and organizations; and a request-time <code>memory_evidence_json</code> payload that becomes part of the writer model&#39;s prompt context.</p>
<p>The last piece is the most important. The assistant&#39;s final language is still generated by a model, but the product&#39;s practical memory behavior depends heavily on what evidence is assembled and injected before generation begins.</p>
<h2>Diagram 2 — Memory layers</h2>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Name</th>
<th>Code</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td>L1</td>
<td>Raw conversation</td>
<td><code>messages</code></td>
<td>Source of truth for what was actually said.</td>
</tr>
<tr>
<td>L2</td>
<td>Structured memory</td>
<td><code>memory_patches</code></td>
<td>Async extractions distilled from recent windows.</td>
</tr>
<tr>
<td>L3</td>
<td>Searchable index</td>
<td><code>memory_items</code></td>
<td>Unified retrieval substrate across sources.</td>
</tr>
<tr>
<td>L4</td>
<td>Narrative artifacts</td>
<td><code>stories · themes · entities</code></td>
<td>Longer-horizon structures that also feed memory.</td>
</tr>
<tr>
<td>L5</td>
<td>Request-time evidence</td>
<td><code>memory_evidence_json</code></td>
<td>Curated payload assembled per turn.</td>
</tr>
<tr>
<td>L6</td>
<td>Final model response</td>
<td><code>llm_chat_stream</code></td>
<td>Generated under transcript and evidence constraints.</td>
</tr>
</tbody></table>
<h2>The architecture in one sentence</h2>
<p>Glia&#39;s current memory system is a split write/read architecture:</p>
<ul>
<li>the write path persists raw messages, extracts structured memory patches, and projects them into retrievable memory items;</li>
<li>the read path gathers relevant memory artifacts at request time and passes them to the model as bounded evidence.</li>
</ul>
<p>That division turns out to be the cleanest way to understand the system.</p>
<h2>Write path: from messages to memory artifacts</h2>
<p>The write path begins with persistence. Every conversation turn is stored in <code>messages</code>. This remains the canonical record of what was actually said. If a later memory artifact exists, it ultimately traces back here.</p>
<p>After a turn completes, a background task runs <code>extract_memory_patch_task</code>. This task loads a recent message window, generates a structured memory patch, and stores it in <code>memory_patches</code> with metadata including <code>conversation_id</code>, <code>story_thread_id</code>, success state, and <code>supporting_message_ids</code>.</p>
<p>This is the first important compression step. Instead of relying on future prompts to carry large raw transcript segments, the system distills a recent conversational slice into a more structured representation.</p>
<p>If the patch is valid, the next step is projection. <code>ingest_from_patch</code> converts structured patch content into one or more <code>memory_items</code> rows. These rows become part of the retrieval substrate used later by Memory V2. A fact, event, or relationship is no longer only buried in chat logs — it becomes an indexed artifact.</p>
<p>There is a second write path beyond chat patches. Story pipelines, theme pipelines, and entity refresh jobs can also produce durable artifacts that later become retrievable. Stories, in particular, may be ingested into <code>memory_items</code>, which lets longer-horizon narrative structure participate in the same memory retrieval layer as patch-derived artifacts.</p>
<h2>Read path: from memory artifacts to prompt evidence</h2>
<p>The read path happens inside <code>POST /api/chat</code>. Before the writer model generates a reply, Glia constructs an evidence bundle. That bundle can include:</p>
<ul>
<li>recent <code>memory_patches</code> across conversation, user, and story-thread scopes;</li>
<li>entity lookups when the current turn contains known names;</li>
<li>temporal message windows for time-oriented prompts;</li>
<li>and Memory V2 hybrid retrieval results from <code>memory_items</code>.</li>
</ul>
<p>This evidence is serialized into <code>memory_evidence_json</code> and passed into the chat streaming layer alongside only a limited tail of recent transcript messages. The model is not receiving unbounded history. It is receiving a curated, typed summary of what the application thinks is relevant.</p>
<p>This is the key to the product&#39;s continuity. The system does not rely on the model to preserve everything across sessions. The application reconstructs relevant context on each turn.</p>
<h2>Where the evidence enters generation</h2>
<p>The bridge between retrieval and generation is explicit in code:</p>
<pre><code class="language-python"># app/api/chat.py
base_stream = llm_chat_stream(
    user_id,
    conversation_id,
    topic_id,
    message_text,
    memory_evidence_json,
    ...
)
</code></pre>
<p>The writer model receives a curated evidence payload, not full history. Memory is an application-level construct injected into the prompt, not an invisible model property.</p>
<h2>The fallback path is more important than it looks</h2>
<p>Memory planning is not always LLM-driven. A conservative gate sits in front of the LLM-based read planner: if the user turn does not strongly resemble recall, temporal lookup, reflection, or an explicit &quot;who is X?&quot; question, the system stays on the fast path.</p>
<p>The fast path is not empty. It includes a deterministic fallback retrieval plan that pulls recent structured memory at multiple scopes:</p>
<pre><code class="language-python"># app/api/chat.py
qs.append({&quot;type&quot;: &quot;memory_patches_recent&quot;, &quot;scope&quot;: &quot;conversation&quot;, &quot;days&quot;: 30, &quot;limit&quot;: 12, ...})
qs.append({&quot;type&quot;: &quot;memory_patches_recent&quot;, &quot;scope&quot;: &quot;user&quot;,         &quot;days&quot;: 90, &quot;limit&quot;: 12, ...})
if story_thread_id:
    qs.append({&quot;type&quot;: &quot;memory_patches_recent&quot;, &quot;scope&quot;: &quot;story_thread&quot;, &quot;days&quot;: 30, &quot;limit&quot;: 12, ...})
</code></pre>
<p>When the planner is skipped, this deterministic plan still pulls structured memory across three scopes — conversation, user, and story thread.</p>
<p>Continuity is not waiting for a special memory mode. Even without planner involvement, recent structured memory can re-enter the evidence pack through normal fallback retrieval. In practice, that makes the fallback path one of the most important memory features in the system.</p>
<h2>A redacted traced example</h2>
<blockquote>
<p><strong>About the example</strong></p>
<p>This article uses a redacted QA case study. Message text, user-identifiable details, and sensitive context have been removed. The goal is to explain the architecture and lineage of memory artifacts, not to expose user data.</p>
</blockquote>
<p>In a recent QA trace, three messages from a single conversation (<code>A</code>, <code>B</code>, <code>C</code>) were later linked to a successful patch row:</p>
<ul>
<li><code>memory_patches.id = mp_8ae2…124f</code></li>
<li><code>supporting_message_ids = {A, B, C}</code></li>
<li><code>story_thread_id = sth_b3be…63d</code></li>
</ul>
<p>The patch payload contained one extracted entity and two events. A blog-safe anchor in that payload was a public literary entity. That patch was then projected into <code>memory_items</code>, including rows whose <code>source_id</code> was shaped like <code>mp_8ae2…124f:entity:0</code>, carrying the patch&#39;s structured fact representation.</p>
<p>Later, a <a href="https://dearartist.xyz/blog/glia-thread-debugging"><code>memory_read_runs</code></a> row for the same user showed a fallback read plan whose JSON matched the deterministic strategy in code: conversation-scoped patches, user-scoped patches, and story-thread patches including the same thread identifier.</p>
<h3>Diagram 3 — Redacted memory trace</h3>
<pre><code>01  Earlier conversation
       ↓
02  message A · B · C
       ↓
03  structured patch created — mp_8ae2…124f
       ↓
04  projected into memory item — mi_5ac4…13f7d
       ↓
05  later chat request
       ↓
06  fallback / hybrid retrieval
       ↓
07  evidence injected — model can reference prior detail
</code></pre>
<p>This gives a concrete chain: <em>messages → memory_patch → memory_items → read plan → evidence eligibility</em>. What the chain does not establish on its own is that a later assistant reply explicitly named the artifact in a user-visible response. That stronger claim would require a tighter request-level reconstruction.</p>
<p>The narrower, more precise statement: the storage path is proven, the retrieval path is proven, and later resurfacing through patch recall or hybrid retrieval is technically supported by the deployed system.</p>
<p><strong>Redacted artifacts</strong></p>
<pre><code class="language-json">// memory_patches
{
  &quot;id&quot;: &quot;mp_8ae2…124f&quot;,
  &quot;supporting_message_ids&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;C&quot;],
  &quot;story_thread_id&quot;: &quot;sth_b3be…63d&quot;,
  &quot;ok&quot;: true,
  &quot;payload_json&quot;: {
    &quot;entities&quot;: [{&quot;display_name&quot;: &quot;&lt;public literary entity&gt;&quot;}],
    &quot;events&quot;:   [{&quot;event_type&quot;: &quot;life_event&quot;}]
  }
}
</code></pre>
<pre><code class="language-json">// memory_items
{
  &quot;id&quot;: &quot;mi_5ac4…13f7d&quot;,
  &quot;source_type&quot;: &quot;chat_patch&quot;,
  &quot;source_id&quot;:   &quot;mp_8ae2…124f:entity:0&quot;,
  &quot;item_type&quot;:   &quot;fact&quot;,
  &quot;content_text&quot;:&quot;&lt;redacted summary referencing public literary entity&gt;&quot;,
  &quot;embedding&quot;:   null
}
</code></pre>
<pre><code class="language-json">// memory_read_runs
{
  &quot;plan_source&quot;: &quot;fallback&quot;,
  &quot;memory_patches_n&quot;: 12,
  &quot;plan_json&quot;: {
    &quot;queries&quot;: [
      {&quot;scope&quot;: &quot;conversation&quot;,  &quot;days&quot;: 30},
      {&quot;scope&quot;: &quot;user&quot;,          &quot;days&quot;: 90},
      {&quot;scope&quot;: &quot;story_thread&quot;,  &quot;days&quot;: 30}
    ]
  }
}
</code></pre>
<h2>Hybrid retrieval is the real production story</h2>
<p>It is easy to describe memory retrieval as vector search and move on. Memory V2 in Glia is more honest and more robust than that. The retrieval logic combines vector recall, keyword recall, and weighted fusion — a pattern often called <a href="https://www.pinecone.io/learn/hybrid-search-intro/">hybrid retrieval ↗</a>:</p>
<pre><code class="language-python"># app/core/memory_retrieval_v2.py
# Stage 1: vector recall
if query_embedding:
    vector_results = vector_search(...)

# Stage 2: keyword recall
kw_items = keyword_search(...)

score = (
    W_SEMANTIC * semantic
    + W_KEYWORD * keyword
    + W_RECENCY * recency
    + W_SOURCE_QUALITY * source_q
    + W_IMPORTANCE * importance
)
</code></pre>
<p>Vector and keyword recall are fused with recency, source quality, and importance weights — not a single similarity score.</p>
<blockquote>
<p><strong>Production nuance</strong></p>
<p>Not every memory item has an embedding. In production systems, hybrid retrieval matters because vector coverage is often incomplete. Keyword overlap and recency-based patch recall remain important parts of the memory path.</p>
</blockquote>
<p>A system that assumes perfect vector coverage is elegant in theory. A system that combines vector and lexical retrieval is usually stronger in practice.</p>
<h2>Deterministic, heuristic, and LLM-driven layers</h2>
<p>The cleanest way to reason about Glia&#39;s memory stack is to separate its different kinds of logic:</p>
<table>
<thead>
<tr>
<th>Stage</th>
<th>Kind</th>
</tr>
</thead>
<tbody><tr>
<td>Message persistence</td>
<td>Deterministic</td>
</tr>
<tr>
<td>Fallback retrieval plan</td>
<td>Deterministic</td>
</tr>
<tr>
<td>Patch extraction</td>
<td>LLM-driven</td>
</tr>
<tr>
<td>Story / theme composition</td>
<td>LLM-driven</td>
</tr>
<tr>
<td>Hybrid ranking</td>
<td>Heuristic</td>
</tr>
<tr>
<td>Final reply</td>
<td>LLM under constraints</td>
</tr>
</tbody></table>
<p>A concise way to put it: <em>the database remembers, the retriever selects, the model interprets under constraints.</em> That is more useful than saying &quot;the model remembered.&quot;</p>
<h2>Why this feels seamless to users</h2>
<p>The complexity is hidden at the right layer. Users do not see patch ids, retrieval plans, or evidence JSON. They see a final reply that incorporates relevant prior context. Because retrieval happens before generation, and because the writer contract encourages proactive use of memory evidence, the output reads like direct recall rather than assembled grounding.</p>
<blockquote>
<p><strong>Why it feels seamless</strong></p>
<p>The product feels continuous because retrieval happens before writing, and the writer model sees a bounded evidence payload rather than full history.</p>
</blockquote>
<h2>Tradeoffs and failure modes</h2>
<ul>
<li><strong>Compression risk.</strong> Structured patches are compact and retrievable, but they can be wrong or subtly distorted (see also: <a href="https://dearartist.xyz/blog/glia-token-cost">Reducing token cost</a>).</li>
<li><strong>Durability.</strong> A detail only survives over time if it remains accessible through patches, memory items, stories, entities, or notes.</li>
<li><strong>Observability.</strong> Internal rows like <code>memory_read_runs</code> provide useful telemetry, but user-facing provenance is still limited.</li>
<li><strong>Retrieval coverage.</strong> Null embeddings mean not every memory item participates equally in vector search, which raises the importance of lexical retrieval and recency-based patch recall.</li>
<li><strong>Telemetry ambiguity.</strong> Some read-side signals still blur the difference between &quot;planner skipped&quot; and &quot;planner failed,&quot; which makes debugging harder than it needs to be.</li>
</ul>
<p>These are not signs that the architecture is weak. They are signs that memory in production is a systems problem, not a slogan.</p>
<h2>What this architecture gets right</h2>
<ul>
<li>It avoids the cost and unreliability of sending full history to the model every turn.</li>
<li>It preserves raw source-of-truth messages while creating compact derived artifacts.</li>
<li>It separates write-time extraction from read-time retrieval.</li>
<li>It allows narrative systems like stories and entities to feed into the same memory substrate.</li>
<li>It preserves lineage through <code>supporting_message_ids</code>, which is far better than treating all summaries as free-floating facts.</li>
</ul>
<h2>What should improve next</h2>
<ul>
<li>Telemetry should distinguish more clearly between &quot;planner skipped&quot; and &quot;planner failed.&quot;</li>
<li>Internal tooling should expose top evidence artifacts per request, including patch and memory item ids.</li>
<li>Embedding coverage should be treated as an operational metric, not an invisible implementation detail.</li>
<li>Lightweight provenance UX could surface whether a detail came from earlier chat summaries, story artifacts, or other memory sources — without exposing raw internals.</li>
</ul>
<h2>Closing</h2>
<p>Glia&#39;s memory is not a single module and not a single model behavior. It is an architecture. Messages live in Postgres. Background jobs compress them into structured artifacts. Those artifacts are indexed, retrieved, and assembled into bounded evidence. The model then writes inside those rails.</p>
<p>That is a more useful way to think about continuity than asking whether the model &quot;really remembers.&quot; The better question is:</p>
<p><em>What artifacts should become durable memory, how should they be indexed, and what evidence should be injected on each turn?</em></p>
<p>That is where memory becomes a product system rather than a metaphor.</p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-actually-remembers">https://dearartist.xyz/blog/glia-actually-remembers</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>From Claude HUD to My Own Status Line</title>
      <link>https://dearartist.xyz/blog/claude-statusline</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/claude-statusline</guid>
      <pubDate>Fri, 17 Apr 2026 16:00:00 GMT</pubDate>
      <description><![CDATA[Building a personal status line for Claude Code.]]></description>
      <category>claude-code</category>
      <category>developer-tools</category>
      <category>terminal</category>
      <content:encoded><![CDATA[<p><img src="https://dearartist.xyz/substack-visuals/claude-statusline/claude-statusline-hero-statusline-card.png" alt="Hero terminal mockup showing the Claude Code two-line status line" /></p>
<p><em>Two lines, one horizontal scan: model · plan · project · branch — context · usage · weekly.</em></p>
<h2>Intro</h2>
<p>Claude Code&#39;s status line is small, but I look at it constantly. It sits at the bottom of every session, summarising who I&#39;m talking to, where I&#39;m working, and how much budget is left. When the layout is stable, it fades into the background. When it isn&#39;t, it becomes surprisingly distracting.</p>
<p>This note walks through how I went from installing a third-party plugin to tracing its rendering path, finding a few layout issues, watching some get patched upstream, and eventually open-sourcing a small replacement that matches the two-line layout I wanted from the start.</p>
<h2>What I wanted from the status line</h2>
<p>My target UX was specific, not vague:</p>
<ul>
<li><strong>Line 1:</strong> model + plan + project + git branch</li>
<li><strong>Line 2:</strong> context + usage + weekly usage, all on the same line</li>
</ul>
<p>Compact, horizontally scannable, predictable. This is not an aesthetic preference — it is about making the data easy to scan during real use. I wanted to know, in one glance, which model I&#39;m on, which project I&#39;m in, and how close I am to my limits. Two lines. Same shape every time.</p>
<blockquote>
<p><strong>What I changed</strong></p>
<p>The goal was never &quot;make it pretty.&quot; It was &quot;make the same information land in the same place every time.&quot; Layout stability beats visual polish.</p>
</blockquote>
<h2>Starting with claude-hud</h2>
<p>The obvious starting point was <a href="https://github.com/jarrodwatts/claude-hud">claude-hud</a>. It is well-presented, it ships through Claude Code&#39;s plugin system, and the README screenshots showed something close to the layout I wanted. Installation is a four-command flow inside Claude Code itself.</p>
<pre><code class="language-bash">/plugin marketplace add jarrodwatts/claude-hud
/plugin install claude-hud
/reload-plugins
/claude-hud:setup
</code></pre>
<p>Within a minute I had a status line. Within five minutes I knew it wasn&#39;t quite the layout I wanted. That gap is what this note is about.</p>
<h2>What worked</h2>
<blockquote>
<p><strong>What worked</strong></p>
<ul>
<li>Plugin installation worked first try.</li>
<li>It exposed the key pieces of session state I cared about: model, context, usage, and git/project info.</li>
<li>It made the <code>statusLine</code> surface concrete enough that I could debug it instead of guessing.</li>
<li>The README and examples made the plugin feel approachable as a starting point.</li>
<li>It made the available stdin payload concrete, which mattered later when I built my own.</li>
</ul>
</blockquote>
<p>None of this is faint praise. claude-hud was a perfectly reasonable starting point. It taught me how the rendering surface worked. It also gave me a clear reference for what &quot;good enough&quot; looked like before I knew what I actually wanted.</p>
<h2>Investigation</h2>
<p>The README implied a layout close to this:</p>
<pre><code class="language-text">Expected:
[Sonnet 4.6 | Max] | project-name git:(main)
Context ████░░ 40% | Usage ██░ 20% | Weekly █ 8%
</code></pre>
<pre><code class="language-text">Observed in my env:
Sonnet 4.6 | project-name
Context ████░░ 40%
Usage ██░ 20%
Weekly █ 8%
</code></pre>
<p>In real use, the second line stacked into three or four rows. The data was the same, but the scanning experience was very different — instead of one horizontal bar of state, I had a little column of independent facts.</p>
<p>Before assuming this was a bug, I checked the obvious sources of a mismatch: my configuration (compact vs expanded), the stdin payload Claude Code was sending, terminal width detection, and claude-hud&#39;s own rendering logic. I traced this through the source rather than guessing — where the layout is rendered, how compact differs from expanded, what happens when <code>rate_limits</code> is missing, and how width is detected when stdout is a pipe instead of a TTY.</p>
<p>Each of those turned out to matter. None of them in isolation explained the full picture.</p>
<h2>Root cause</h2>
<p>There were four distinct factors, not one.</p>
<h3>1. Terminal width fallback (primary bug)</h3>
<p>Claude Code launches the status line as a subprocess with stdout piped, not a TTY. That means standard terminal width detection can fail. When all fallbacks failed, claude-hud used:</p>
<pre><code class="language-ts">UNKNOWN_TERMINAL_WIDTH = 40
</code></pre>
<p>40 columns is far narrower than any modern terminal. Under that constraint, the horizontal layout had no choice but to wrap into stacked rows. In my environment, this was the main reason the README-style layout didn&#39;t appear.</p>
<h3>2. Expanded layout pre-split context and usage too early</h3>
<p>The expanded renderer pre-split context and usage before the smarter wrapping logic could decide whether to keep them together. Combined with the 40-column fallback, this made the layout fragile in a predictable way.</p>
<h3>3. Missing rate_limits data from Claude Code stdin</h3>
<p>When <code>rate_limits</code> is absent from the stdin payload, the usage block becomes incomplete or disappears entirely. Even with correct layout logic, the visible output can still diverge from the README because the data itself is incomplete.</p>
<h3>4. README screenshots represent ideal conditions</h3>
<p>The README implicitly assumes that terminal width is detected correctly, that <code>rate_limits</code> is present, that the model name is short, and that the outer Claude Code UI isn&#39;t consuming critical width. It&#39;s a useful conceptual example, not a guaranteed representation of every runtime environment.</p>
<p>It&#39;s worth noting that none of this is reducible to &quot;user error.&quot; Even with reasonable configuration, the 40-column fallback could force the HUD into a stacked layout, and switching to compact mode could make it worse — compact emits one long line that wraps aggressively when width detection fails. I couldn&#39;t find a config-only path that reproduced the README layout reliably in my environment. In my case, the issue was primarily implementation and runtime behavior, not configuration.</p>
<blockquote>
<p><strong>Root cause</strong></p>
<p>One implementation bug (width fallback), one rendering ordering bug (pre-split), one runtime data gap (rate_limits), and one expectation gap (README under ideal conditions). Each looks like the others until you separate them.</p>
</blockquote>
<h2>What got fixed in claude-hud</h2>
<p>The investigation surfaced two real rendering issues, and fixes were applied for both.</p>
<ul>
<li>Terminal width fallback was raised from <strong>40 → 80</strong>.</li>
<li>Expanded rendering no longer pre-splits context and usage. They are combined first and then passed through the smarter wrapping logic, so they only split when there isn&#39;t enough room.</li>
</ul>
<p>The effect was incremental but real:</p>
<ul>
<li>Before either fix: often <strong>4 rows</strong>.</li>
<li>After the width fix: typically <strong>3 rows</strong>.</li>
<li>After the rendering fix: context and usage group much more intelligently.</li>
<li>With a correctly detected real terminal width, the layout becomes much closer to the README expectation.</li>
</ul>
<blockquote>
<p><strong>What got fixed</strong></p>
<p>Two real implementation fixes — not cosmetic tuning. The width fallback was a bug. The pre-split ordering was a bug. Both are now better.</p>
</blockquote>
<h2>What still remained unsolved</h2>
<p>Even after both fixes, the result still depends on runtime conditions.</p>
<ul>
<li>If the real terminal width is still not detected and the renderer falls back to 80 columns, Context + Usage + Weekly may still not fit on one line, so Weekly drops to a third row.</li>
<li>If <code>rate_limits</code> is missing from stdin, usage info still disappears or becomes incomplete.</li>
<li>The &quot;Max&quot; plan label I wanted was never implemented in claude-hud at all. This is not a config issue and not a bug — the codebase simply does not render a plan field. Adding it would require Claude Code to include that field in stdin and claude-hud to parse and render it.</li>
</ul>
<blockquote>
<p><strong>What still remained</strong></p>
<p>Some things were bugs that got fixed. Some things were runtime conditions that no plugin can fully control. Some things were missing features. They look the same from the outside, and they aren&#39;t.</p>
</blockquote>
<h2>Why I still replaced it</h2>
<p>Even after the fixes, I still chose to build my own. Not because claude-hud was bad — it had genuinely improved — but because my requirements were narrow and specific:</p>
<ul>
<li>Full layout control.</li>
<li>Deterministic output, every session.</li>
<li>A graceful fallback when <code>rate_limits</code> is missing.</li>
<li>A renderer small enough that I could understand and change end-to-end.</li>
<li>A final UI that matches my workflow exactly, not approximately.</li>
</ul>
<p>At that point, a small custom renderer felt simpler than continuing to tune wrapping behavior indirectly. For a surface this small, owning the renderer was easier to maintain than negotiating with someone else&#39;s heuristics.</p>
<h2>My custom open-source solution</h2>
<p>The result lives here: <a href="https://github.com/yuannh/claude-code-custom-statusline">yuannh/claude-code-custom-statusline</a>. It replaces plugin rendering with a small Python script that Claude Code invokes through its native <code>statusLine</code> command interface.</p>
<p>Wire it up in <code>~/.claude/settings.json</code>:</p>
<pre><code class="language-json">{
  &quot;enabledPlugins&quot;: {
    &quot;swift-lsp@claude-plugins-official&quot;: true
  },
  &quot;statusLine&quot;: {
    &quot;type&quot;: &quot;command&quot;,
    &quot;command&quot;: &quot;python3 \&quot;$HOME/.claude/statusline.py\&quot;&quot;
  }
}
</code></pre>
<p>The script itself is intentionally short and readable.</p>
<pre><code class="language-python">#!/usr/bin/env python3
import sys
import json
import os
import subprocess
from datetime import datetime, timezone
from pathlib import Path

CACHE = Path.home() / &quot;.claude&quot; / &quot;statusline-debug.json&quot;

def bar(pct, width=10):
    try:
        pct = float(pct)
    except Exception:
        pct = 0
    pct = max(0, min(100, pct))
    filled = int(round(width * pct / 100))
    return &quot;█&quot; * filled + &quot;░&quot; * (width - filled)

def time_left(ts):
    try:
        ts = int(ts)
        now = int(datetime.now(timezone.utc).timestamp())
        diff = max(0, ts - now)
        d = diff // 86400
        h = (diff % 86400) // 3600
        m = (diff % 3600) // 60
        if d &gt; 0:
            return f&quot;{d}d {h}h&quot;
        if h &gt; 0:
            return f&quot;{h}h {m}m&quot;
        return f&quot;{m}m&quot;
    except Exception:
        return &quot;&quot;

def git_text():
    project = os.path.basename(os.getcwd())
    try:
        branch = subprocess.check_output(
            [&quot;git&quot;, &quot;branch&quot;, &quot;--show-current&quot;],
            stderr=subprocess.DEVNULL,
            text=True
        ).strip()
    except Exception:
        branch = &quot;&quot;

    dirty = &quot;&quot;
    try:
        inside = subprocess.run(
            [&quot;git&quot;, &quot;rev-parse&quot;, &quot;--is-inside-work-tree&quot;],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL
        ).returncode == 0
        if inside:
            dirty_work = subprocess.run(
                [&quot;git&quot;, &quot;diff&quot;, &quot;--quiet&quot;],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL
            ).returncode != 0
            dirty_index = subprocess.run(
                [&quot;git&quot;, &quot;diff&quot;, &quot;--cached&quot;, &quot;--quiet&quot;],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL
            ).returncode != 0
            if dirty_work or dirty_index:
                dirty = &quot;*&quot;
    except Exception:
        pass

    if branch:
        return f&quot;{project} git:({branch}{dirty})&quot;
    return project

def load_json(text):
    try:
        return json.loads(text) if text else {}
    except Exception:
        return {}

raw = sys.stdin.read().strip()
current = load_json(raw)

cached = {}
if CACHE.exists():
    try:
        cached = json.loads(CACHE.read_text())
    except Exception:
        cached = {}

if current.get(&quot;rate_limits&quot;):
    try:
        CACHE.write_text(json.dumps(current))
    except Exception:
        pass

if not current.get(&quot;rate_limits&quot;) and cached.get(&quot;rate_limits&quot;):
    current[&quot;rate_limits&quot;] = cached[&quot;rate_limits&quot;]

for key in [&quot;model&quot;, &quot;context_window&quot;, &quot;workspace&quot;, &quot;cwd&quot;]:
    if not current.get(key) and cached.get(key):
        current[key] = cached[key]

data = current

model = data.get(&quot;model&quot;, {}).get(&quot;display_name&quot;) or &quot;Sonnet 4.6&quot;
plan = &quot;Max&quot;

ctx = data.get(&quot;context_window&quot;, {}).get(&quot;used_percentage&quot;)
if ctx is None:
    ctx = 0

five = data.get(&quot;rate_limits&quot;, {}).get(&quot;five_hour&quot;, {})
week = data.get(&quot;rate_limits&quot;, {}).get(&quot;seven_day&quot;, {})

five_pct = five.get(&quot;used_percentage&quot;, 0)
week_pct = week.get(&quot;used_percentage&quot;, 0)

five_left = time_left(five.get(&quot;resets_at&quot;))
week_left = time_left(week.get(&quot;resets_at&quot;))

line1 = f&quot;[{model} | {plan}] | {git_text()}&quot;
line2 = (
    f&quot;Context {bar(ctx)} {int(ctx)}% | &quot;
    f&quot;Usage {bar(five_pct)} {int(five_pct)}%&quot;
    + (f&quot; ({five_left})&quot; if five_left else &quot;&quot;)
    + &quot; | &quot;
    f&quot;Weekly {bar(week_pct)} {int(week_pct)}%&quot;
    + (f&quot; ({week_left})&quot; if week_left else &quot;&quot;)
)

print(line1)
print(line2)
</code></pre>
<p>It reads stdin JSON, caches the last good payload to <code>~/.claude/statusline-debug.json</code>, falls back to the cache when <code>rate_limits</code> is missing, and prints exactly two lines. No wrap heuristics. No conditional layouts. Same shape, every time.</p>
<h2>What my implementation fixes</h2>
<blockquote>
<p><strong>What I changed</strong></p>
<ul>
<li>Deterministic two-line layout.</li>
<li>Stable horizontal second line.</li>
<li>Cache fallback when <code>rate_limits</code> is missing.</li>
<li>Simpler rendering path — no width detection at all.</li>
<li>Easier debugging and full ownership.</li>
<li>Explicit layout control instead of indirect tuning.</li>
</ul>
</blockquote>
<p>It is not trying to be more capable than claude-hud. It is simply narrower, smaller, and easier for me to reason about.</p>
<h3>Side-by-side comparison</h3>
<pre><code class="language-text">README expected:
[Sonnet 4.6 | Max] | proj git:(main)
Context ███░ 40% | Usage ██ 20% | Weekly █ 8%
</code></pre>
<pre><code class="language-text">Actual claude-hud:
Sonnet 4.6 | proj
Context ███░ 40%
Usage ██ 20%
Weekly █ 8%
</code></pre>
<pre><code class="language-text">Patched claude-hud:
Sonnet 4.6 | proj git:(main)
Context ███░ 40% | Usage ██ 20%
Weekly █ 8%
</code></pre>
<pre><code class="language-text">My implementation:
[Sonnet 4.6 | Max] | proj git:(main*)
Context ███░ 40% | Usage ██ 20% | Weekly █ 8%
</code></pre>
<h2>Screenshots — before vs after</h2>
<p>These are CSS recreations of what the status line actually rendered as in each phase — same data, different layout behavior.</p>
<p><img src="https://dearartist.xyz/substack-visuals/claude-statusline/claude-statusline-four-tones-comparison.png" alt="Four stacked terminal screenshots: expected, observed, after spec rewrite, and final status line" /></p>
<p><em>Status line evolution across four iterations.</em></p>
<p><img src="https://dearartist.xyz/substack-visuals/claude-statusline/claude-statusline-github-repo-card.png" alt="GitHub-style repo card for yuannh/claude-code-custom-statusline" /></p>
<p><em>The open-source repo on GitHub.</em></p>
<h2>Links and repositories</h2>
<ul>
<li><a href="https://github.com/jarrodwatts/claude-hud">jarrodwatts/claude-hud</a> — The original plugin I started from and investigated. A great way to understand the Claude Code statusLine surface.</li>
<li><a href="https://github.com/yuannh/claude-code-custom-statusline">yuannh/claude-code-custom-statusline</a> — My smaller, purpose-built implementation for a deterministic two-line layout. Python, no dependencies, ~110 lines.</li>
</ul>
<h2>Lessons learned</h2>
<ul>
<li>Plugins are a great starting point. Sometimes they aren&#39;t the final answer.</li>
<li>Rendering bugs can hide behind what looks like a config mismatch.</li>
<li>Terminal width detection matters far more than it first appears, especially when stdout is piped.</li>
<li>Documentation screenshots usually represent ideal conditions, not all runtime realities.</li>
<li>Even when a tool improves, owning the final rendering path can be the better tradeoff if the UX needs to be exact.</li>
<li>Source-level investigation is often the fastest way to separate user error from real implementation problems.</li>
</ul>
<h2>Closing reflection</h2>
<p>I started by trying to customize a plugin. I ended up tracing the rendering path, understanding the limits of the original implementation, and open-sourcing a version that matched the interface I actually wanted.</p>
<p>In the end, I didn&#39;t need a more feature-rich status line. I needed one whose layout I could predict. For my workflow, that turned out to be easier to build than to negotiate. That tradeoff won&#39;t be right for everyone, but it was the right one for me.</p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/claude-statusline">https://dearartist.xyz/blog/claude-statusline</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>My Terminal Workstack: Ghostty + Yazi + lazygit + Claude Code</title>
      <link>https://dearartist.xyz/blog/terminal-workstack</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/terminal-workstack</guid>
      <pubDate>Fri, 17 Apr 2026 15:00:00 GMT</pubDate>
      <description><![CDATA[The terminal stack I use day to day.]]></description>
      <category>terminal</category>
      <category>developer-tools</category>
      <category>workflow</category>
      <content:encoded><![CDATA[<p>One terminal window, four tools, and a workflow that finally feels like a workstation instead of a command line.</p>
<p><img src="https://dearartist.xyz/substack-visuals/terminal-workstack/terminal-workstack-hero-three-panes-card.png" alt="Three-pane terminal mockup: Claude Code · Yazi · lazygit" /></p>
<p><em>Three panes, one terminal: Claude Code · Yazi · lazygit.</em></p>
<p><strong>Stack Summary</strong></p>
<ul>
<li><strong>Ghostty</strong> → terminal and panes</li>
<li><strong>Yazi</strong> → file navigation</li>
<li><strong>lazygit</strong> → Git workflow</li>
<li><strong>Claude Code</strong> → AI coding layer</li>
</ul>
<h2>Why I moved more of my work into the terminal</h2>
<p>For a while, my day looked like this: Finder for files, a GUI Git client for commits, a browser tab for searching the codebase, an IDE for editing, another window for the terminal, and a separate chat window for AI. Each tool was fine in isolation. The context switching was not.</p>
<p>I wanted one place where navigation, code context, Git operations, and AI-assisted development could sit next to each other — composable, focused, and fast. The terminal turned out to be the most honest answer. Not because the terminal is morally superior, but because it lets me arrange the pieces I actually use into a single layout that never moves.</p>
<h2>The stack</h2>
<p>Four tools, each doing one thing well.</p>
<h3>Ghostty — terminal</h3>
<p>The foundation. Fast, GPU-accelerated, native-feeling on macOS, and serious about split panes. It&#39;s the first terminal I&#39;ve used where the window itself feels like part of the work, not a frame around it.</p>
<ul>
<li><a href="https://github.com/ghostty-org/ghostty">GitHub</a></li>
<li><a href="https://ghostty.org">Docs</a></li>
</ul>
<h3>Yazi — file manager</h3>
<p>A blazing-fast terminal file manager written in Rust. I use it to move through projects and directories without lifting my hands off the keyboard, and to drop back into a shell exactly where I left off.</p>
<ul>
<li><a href="https://github.com/sxyazi/yazi">GitHub</a></li>
<li><a href="https://yazi-rs.github.io">Docs</a></li>
</ul>
<h3>lazygit — git ui</h3>
<p>A keyboard-first Git UI. Staging hunks, reviewing diffs, writing commit messages, rebasing — all without typing long commands or leaving the terminal. The thing GUI Git clients always wanted to be.</p>
<ul>
<li><a href="https://github.com/jesseduffield/lazygit">GitHub</a></li>
<li><a href="https://github.com/jesseduffield/lazygit/wiki">Wiki</a></li>
</ul>
<h3>Claude Code — ai layer</h3>
<p>The AI layer in the terminal. I use it for repo-level understanding, code audits, implementation help, and quick iteration on patches. It reads the actual code, not a summary I had to paste in.</p>
<ul>
<li><a href="https://docs.claude.com/claude-code">Docs</a></li>
<li><a href="https://github.com/anthropics/claude-code">GitHub</a></li>
</ul>
<h2>How the workflow fits together</h2>
<p>A typical session looks like this.</p>
<ol>
<li>Open Ghostty. Split it into a left pane and a right pane.</li>
<li>In the left pane, run <code>yy</code> to launch Yazi and jump into the project I want.</li>
<li>In the right pane, start Claude Code. Ask it to explain a module, find usages, or sketch a fix.</li>
<li>Split the right pane vertically. Top half stays with Claude. Bottom half opens lazygit so I can watch the working tree change as I edit.</li>
<li>Edit in my editor of choice (often invoked from the same terminal). Stage, review, and commit in lazygit. Push.</li>
</ol>
<p>Nothing here is novel on its own. The point is that all of it lives in one window, in panes I can see at a glance, with the same key bindings I use for everything else.</p>
<h2>What I actually like about this setup</h2>
<ul>
<li><strong>Fewer context switches.</strong> Files, code, Git, and AI are visible at the same time. I stop losing my train of thought between apps.</li>
<li><strong>More keyboard flow.</strong> Once your hands stop moving to the trackpad, you start thinking in steps instead of clicks.</li>
<li><strong>Better repo awareness.</strong> Yazi keeps the directory tree in my peripheral vision, so I always know where I am.</li>
<li><strong>Faster Git reviews.</strong> lazygit&#39;s diff view makes it easy to actually read what I&#39;m about to commit instead of rubber-stamping it.</li>
<li><strong>Terminal as a real workspace.</strong> Not a command prompt I tolerate, but a layout I designed.</li>
</ul>
<h2>My basic setup and config</h2>
<p>The relevant pieces from my dotfiles. Nothing exotic — just the things that earn their keep.</p>
<h3>Ghostty config</h3>
<p>Lives at <code>~/.config/ghostty/config</code>.</p>
<pre><code class="language-conf"># =====================================
# Ghostty — macOS polished setup
# =====================================
font-family = JetBrainsMono Nerd Font Mono
font-size = 13
window-padding-x = 18
window-padding-y = 16
cursor-style = block
cursor-style-blink = false
theme = Catppuccin Mocha
background-opacity = 0.96
background-blur-radius = 20
macos-titlebar-style = transparent
scrollback-limit = 300000
shell-integration = detect
clipboard-read = allow
clipboard-write = allow
copy-on-select = false
window-inherit-working-directory = true
confirm-close-surface = true
mouse-scroll-multiplier = precision:1.0,discrete:3
keybind = cmd+t=new_tab
keybind = cmd+w=close_surface
keybind = cmd+n=new_window
keybind = cmd+]=next_tab
keybind = cmd+[=previous_tab
keybind = cmd+d=new_split:right
keybind = cmd+shift+d=new_split:down
keybind = alt+left=goto_split:left
keybind = alt+right=goto_split:right
keybind = alt+up=goto_split:up
keybind = alt+down=goto_split:down
keybind = cmd+alt+left=resize_split:left,10
keybind = cmd+alt+right=resize_split:right,10
keybind = cmd+alt+up=resize_split:up,10
keybind = cmd+alt+down=resize_split:down,10
keybind = cmd+ctrl+f=toggle_fullscreen
keybind = cmd+shift+p=toggle_command_palette
</code></pre>
<h3>Yazi shell helper</h3>
<p>The <code>yy</code> function launches Yazi and, when you quit, drops your shell into whatever directory you ended up in.</p>
<pre><code class="language-zsh">function yy() {
    local tmp=&quot;$(mktemp -t &quot;yazi-cwd.XXXXXX&quot;)&quot;
    yazi &quot;$@&quot; --cwd-file=&quot;$tmp&quot;
    if cwd=&quot;$(cat -- &quot;$tmp&quot;)&quot; &amp;&amp; [ -n &quot;$cwd&quot; ] &amp;&amp; [ &quot;$cwd&quot; != &quot;$PWD&quot; ]; then
        cd -- &quot;$cwd&quot;
    fi
    rm -f -- &quot;$tmp&quot;
}
</code></pre>
<h3>Helpful aliases</h3>
<pre><code class="language-zsh">alias codehome=&#39;cd ~/Code&#39;
alias lg=&#39;lazygit&#39;
</code></pre>
<h3>Typical usage</h3>
<pre><code class="language-bash">yy ~/Code
cd ~/Code/glia-core
lazygit
</code></pre>
<h2>Notes on how I use Git in the terminal</h2>
<p>lazygit doesn&#39;t change what Git is. It just makes the parts you do every day easier to see. The mental model I keep in mind:</p>
<ul>
<li><strong>Working tree</strong> — what&#39;s actually changed on disk.</li>
<li><strong>Staging</strong> — picking which of those changes belong in the next commit.</li>
<li><strong>Commit</strong> — saving a meaningful checkpoint with a message you&#39;d be willing to read later.</li>
<li><strong>Push</strong> — publishing it to the remote so it stops being only yours.</li>
</ul>
<p>When all four are visible in a pane next to my code, I commit smaller, stage more deliberately, and write better messages. That&#39;s the whole pitch.</p>
<h2>What this setup is not</h2>
<p>It&#39;s not a manifesto for ditching every GUI forever. I still use a browser, a Figma window, and a real editor. I&#39;m not pretending the terminal is universally better.</p>
<p>It&#39;s also not about terminal purity. If a tool only exists as a GUI and it&#39;s good, I use it.</p>
<p>What this setup does is reduce the friction in the part of my day where I&#39;m actually building. The value isn&#39;t in any single tool. It&#39;s in the composition — four small, well-behaved pieces sitting next to each other, doing one job each, never getting in the way.</p>
<h2>Resources</h2>
<ul>
<li><strong>Ghostty</strong> — Terminal emulator with native UI and GPU acceleration<ul>
<li><a href="https://github.com/ghostty-org/ghostty">GitHub</a></li>
<li><a href="https://ghostty.org/docs">Docs</a></li>
</ul>
</li>
<li><strong>Yazi</strong> — Async terminal file manager written in Rust<ul>
<li><a href="https://github.com/sxyazi/yazi">GitHub</a></li>
<li><a href="https://yazi-rs.github.io/docs/installation">Docs</a></li>
</ul>
</li>
<li><strong>lazygit</strong> — Keyboard-first terminal UI for Git<ul>
<li><a href="https://github.com/jesseduffield/lazygit">GitHub</a></li>
<li><a href="https://github.com/jesseduffield/lazygit/wiki">Wiki</a></li>
</ul>
</li>
<li><strong>Claude Code</strong> — Terminal-native AI coding workflow<ul>
<li><a href="https://docs.claude.com/claude-code">Docs</a></li>
<li><a href="https://github.com/anthropics/claude-code">GitHub</a></li>
</ul>
</li>
</ul>
<p><em>Still evolving. Like any good terminal setup.</em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/terminal-workstack">https://dearartist.xyz/blog/terminal-workstack</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Reducing Token Cost While Working on Glia</title>
      <link>https://dearartist.xyz/blog/glia-token-cost</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-token-cost</guid>
      <pubDate>Fri, 17 Apr 2026 11:00:00 GMT</pubDate>
      <description><![CDATA[Practical tactics to lower token spend.]]></description>
      <category>glia</category>
      <category>llm</category>
      <category>cost-optimization</category>
      <content:encoded><![CDATA[<p><em>Engineering Note · Glia · AI-Assisted Development · Cost Optimization · Codebase Governance</em></p>
<p><em>2026-04-17 09:30 · 12 min read · Engineering, AI-Assisted Development, Cost Optimization, Codebase Governance, Refactoring</em></p>
<p>A practical governance pass on an AI-heavy codebase: separating fixed prompt noise from real context cost, extracting safe helpers, validating aggressively, and stopping before optimization became a refactor project.</p>
<blockquote>
<p><strong>Key Findings</strong></p>
<ul>
<li>In Glia, the main token cost driver was repeated large-file context loading, not just prompt overhead.</li>
<li>Trimming <code>CLAUDE.md</code> helped reduce fixed noise, but it was not the main billing lever.</li>
<li>Extracting pure helper functions improved file-load granularity for narrow AI queries.</li>
<li>The biggest remaining cost center is still orchestration-level reasoning inside giant job bodies.</li>
</ul>
</blockquote>
<h2>Opening</h2>
<p>While working on <a href="https://gliahq.com">Glia ↗</a>, I ran into a practical problem: our AI-assisted development workflow was getting expensive in ways that were easy to feel but hard to measure. I was not trying to clean up the codebase for aesthetics. I was trying to understand what was actually driving token usage, and which changes would reduce cost without turning the effort into an uncontrolled refactor.</p>
<p>This was an engineering economics problem more than a code quality one. Repeated context loading of giant files, a long always-applied <code>CLAUDE.md</code>, and dense iterative debugging loops were quietly compounding. The instinct to make things &quot;cleaner&quot; is not the same instinct as making things &quot;cheaper&quot;, and the two diverge quickly once you look at where the bytes actually go.</p>
<h2>The Glia-specific problem</h2>
<p>The concrete situation was easy to characterize once I sat down with the file sizes. A handful of files dominated the working surface that AI tooling kept reaching into.</p>
<table>
<thead>
<tr>
<th>File</th>
<th>Size</th>
<th>Notes</th>
</tr>
</thead>
<tbody><tr>
<td><code>extract_story_job.py</code></td>
<td>~10,598 lines · ~467 KB</td>
<td>Compose pipeline orchestration</td>
</tr>
<tr>
<td><code>app/api/cards.py</code></td>
<td>~223 KB</td>
<td>Card-surface API</td>
</tr>
<tr>
<td><code>theme_story_service.py</code></td>
<td>~175 KB</td>
<td>Theme story assembly</td>
</tr>
<tr>
<td><code>CLAUDE.md</code></td>
<td>~18 KB (pre-cleanup)</td>
<td>Always-applied instructions</td>
</tr>
</tbody></table>
<p>At first glance, there were two obvious suspects: the oversized <code>CLAUDE.md</code> injected into every conversation, and several extremely large Python files, especially <code>extract_story_job.py</code>.</p>
<p>The first important insight was that the instinct to trim <code>CLAUDE.md</code> was directionally right, but incomplete. It was fixed overhead. The much more expensive pattern was repeated full-file loading of giant source files whenever AI tooling needed even a tiny helper function, a local rule, or a narrow explanation of behavior. The same 467 KB was being shipped into context for questions that genuinely only needed 200 lines.</p>
<h2>Fixed noise vs dynamic cost</h2>
<p>The reframing that mattered was separating two things that often get conflated in AI cost discussions:</p>
<table>
<thead>
<tr>
<th>Fixed noise</th>
<th>Dynamic cost</th>
</tr>
</thead>
<tbody><tr>
<td><code>CLAUDE.md</code></td>
<td>giant source files</td>
</tr>
<tr>
<td>always-applied context</td>
<td>repeated full-file loads</td>
</tr>
<tr>
<td>lower leverage</td>
<td>much higher leverage</td>
</tr>
</tbody></table>
<p>In Glia, <code>CLAUDE.md</code> still mattered, but mainly as context pollution rather than the primary billing driver. The true cost driver was large-file ingestion during repeated debugging and code review cycles. That changed the optimization strategy from &quot;compress everything&quot; to &quot;reduce the granularity of what needs to be loaded.&quot;</p>
<h2>Process flow</h2>
<pre><code>Cost Investigation  →  CLAUDE.md Trim  →  Helper Extraction  →  Verification  →  Stop / Observe
(What actually        (Reduce fixed       (Smaller modules)     (Tests +         (Measure,
 loads)                noise)                                    container)       don&#39;t refactor)
</code></pre>
<h2>Phase 1: Trimming CLAUDE.md</h2>
<p>The first move was the low-risk one. I went through <code>CLAUDE.md</code> and removed historical rollout and recovery sections that no longer affected current decision-making. I preserved only the still-relevant constraints, and merged isolated environment variables back into the main feature flag table so they would be findable in one place.</p>
<table>
<thead>
<tr>
<th>✓ Kept</th>
<th>✗ Removed</th>
</tr>
</thead>
<tbody><tr>
<td>Still-active env variables</td>
<td>Completed rollout history</td>
</tr>
<tr>
<td>The one still-open issue worth preserving</td>
<td>Resolved issue logs</td>
</tr>
<tr>
<td>Active constraints such as the <code>SM-5</code> guard</td>
<td>Validation snapshots</td>
</tr>
<tr>
<td></td>
<td>Temporary monitoring notes</td>
</tr>
<tr>
<td></td>
<td>Recovery narrative no longer affecting behavior</td>
</tr>
</tbody></table>
<p>The result was modest in raw numbers: <code>CLAUDE.md</code> went from roughly 385 lines to 322 lines, and from about 18 KB to about 13.5 KB. This was worth doing, but not because it would dramatically lower cost by itself. Its real value was reducing fixed noise and improving the quality of the model&#39;s working context. A leaner instruction file is easier to keep coherent and easier to reason about during review.</p>
<h2>Phase 2: Reducing file-load granularity</h2>
<p>The higher-leverage move was structural rather than textual. Instead of debating giant files in the abstract, I extracted pure helper functions out of <code>extract_story_job.py</code> into smaller modules so future AI queries could load narrow-purpose files instead of a 467 KB monolith.</p>
<p>My rule was intentionally strict: only move helpers that were pure, self-contained, independently queryable, and one-way-import safe. No DB access. No LLM calls. No Celery or task context. No orchestration logic. The whole point was that the extracted modules had to be safe to load in isolation.</p>
<p>Honestly, the first pass overshot the original minimal plan. Instead of extracting only a tiny timeline helper cluster, the change moved a broader pure-function surface into two new modules:</p>
<ul>
<li><code>app/core/story_text_utils.py</code></li>
<li><code>app/core/story_identity_guard.py</code></li>
</ul>
<p>This was broader than originally intended. That immediately raised the bar for validation, because once you stop being minimal you can no longer rely on minimality as your safety argument.</p>
<h2>Concrete examples from the Glia codebase</h2>
<p><code>extract_story_job.py</code> contained helper logic that AI tools might need to inspect independently — tokenization rules, sentence normalization, identity-token comparison — but those narrow questions often dragged in a 467 KB file. After extraction, helper-level questions could often target <code>story_text_utils.py</code> or <code>story_identity_guard.py</code> instead. That changed the loading shape for tasks like tokenization, sentence normalization, identity token comparison, and similar rule-level debugging.</p>
<pre><code class="language-python"># Imports after extraction
from app.core.story_text_utils import (
    _tokenize_universal,
    _normalize_timeline_sentence,
)
from app.core.story_identity_guard import (
    _identity_tokens,
    _token_jaccard,
)
</code></pre>
<p><strong>Loading shape</strong></p>
<pre><code>Before: helper question → load extract_story_job.py (~467 KB)
After:  helper question → load story_text_utils.py / story_identity_guard.py
</code></pre>
<h2>The governance correction</h2>
<p>The most important lesson was not just about code movement. It was about governance. A broader refactor cannot be accepted on the basis of vague or incomplete validation, and the first acceptance pass on this change was honestly a little too soft.</p>
<p>Partial import issues appeared during early checks. Some validation claims were too loose. &quot;All good&quot; was asserted too early, on evidence that did not actually cover the QA worker container. That kind of premature green light is exactly how a small extraction quietly turns into a regression in a runtime path you never tested.</p>
<p>Instead of continuing into more extraction work, I stopped and required a strict verification pass. The required checks were explicit:</p>
<ul>
<li>Enumerate every moved function.</li>
<li>Confirm no duplicate definitions remained in <code>extract_story_job.py</code>.</li>
<li>Confirm one-way import direction only.</li>
<li>Verify re-export behavior where tests still imported symbols from the old path.</li>
<li>Run a larger, reproducible test set.</li>
<li>Verify import success in the QA worker container.</li>
</ul>
<blockquote>
<p><strong>Note</strong></p>
<p>This was not a heroic rewrite. It was a constrained governance pass: reduce cost where the evidence was strong, validate aggressively, and stop before the optimization frontier became a refactor project.</p>
</blockquote>
<h2>What was actually verified</h2>
<p>After the corrected verification pass, the outcomes I was willing to stand behind were narrow but solid.</p>
<ul>
<li>✓ No duplicate moved-function definitions remained in <code>extract_story_job.py</code>.</li>
<li>✓ New modules did not import back from the original giant job file.</li>
<li>✓ Representative behavior checks passed for the moved helper functions.</li>
<li>✓ 191 tests passed in the selected verification scope.</li>
<li>✓ QA worker import check succeeded for <code>from app.jobs.extract_story_job import extract_story_job</code>.</li>
</ul>
<p>The QA worker check mattered most. It closed the gap between &quot;seems fine in a partial local environment&quot; and &quot;imports correctly in the actual runtime container.&quot; Local greens that do not mirror runtime are the exact failure mode I was trying to avoid this time.</p>
<pre><code class="language-bash"># qa-worker · import smoke check
$ docker exec glia-core-worker-1 python -c \
    &quot;from app.jobs.extract_story_job import extract_story_job; print(&#39;import OK&#39;)&quot;

import OK
</code></pre>
<h2>What changed structurally</h2>
<table>
<thead>
<tr>
<th>Before</th>
<th>After</th>
</tr>
</thead>
<tbody><tr>
<td>Helper logic buried inside a massive job file.</td>
<td><code>CLAUDE.md</code> contains less dead historical state.</td>
</tr>
<tr>
<td>Narrow questions often triggered giant context loads.</td>
<td>Text and identity helpers are available as smaller modules.</td>
</tr>
<tr>
<td>Fixed prompt context contained too much stale history.</td>
<td>Future AI-assisted queries can target smaller files in many cases.</td>
</tr>
<tr>
<td></td>
<td>Validation discipline improved before accepting structural changes.</td>
</tr>
</tbody></table>
<h2>What this did NOT solve</h2>
<p>Helper extraction solved only one class of cost: narrow helper-level queries. It did not solve the most expensive remaining problem.</p>
<blockquote>
<p><strong>⚠ Still Expensive</strong></p>
<p><code>extract_story_job.py</code> still contains two enormous orchestration bodies:</p>
<ul>
<li><code>extract_story_job()</code></li>
<li><code>process_story_pipeline_job()</code></li>
</ul>
<p>These functions remain the main cost center for compose-flow debugging. Any conversation that needs to understand those paths can still trigger very large context loads. Helper extraction is not the same thing as orchestration decomposition, and conflating the two would be a real architectural mistake.</p>
</blockquote>
<h2>Why I did not continue into theme_story_service.py</h2>
<p>I did analyze <code>theme_story_service.py</code> as the next candidate, and I found a reasonable pure helper cluster that could potentially be extracted later. The file contained candidate pure helpers such as date formatting helpers, summary sanitation helpers, feed summary helpers, and lightweight text transformation helpers.</p>
<p>But I deliberately stopped. The analysis surfaced near-duplicate utility behavior that was not obviously safe to unify without design intent review.</p>
<p>Two concrete examples from the codebase:</p>
<ul>
<li><code>_tokenize</code> in <code>theme_story_service.py</code> was close to <code>_tokenize_universal</code> in <code>story_text_utils.py</code> — but their CJK behavior was not identical.</li>
<li><code>_truncate_at_sentence</code> also looked very close to <code>_truncate_snippet_at_sentence</code>, but &quot;near-duplicate&quot; is not the same as &quot;safe to consolidate&quot;.</li>
</ul>
<p>The analysis showed that <code>theme_story_service.py</code> was a plausible next target, but it also showed why obvious cleanup is sometimes less obvious than it looks. Unifying two functions that behave subtly differently is a content decision, not a tidy-up.</p>
<h2>Why I deliberately stopped</h2>
<p>Once the validated low-risk improvements were in place, I stopped. Beyond this point, optimization would no longer be &quot;small extraction work.&quot; It would become real refactoring:</p>
<ul>
<li>Redefining pipeline stage boundaries.</li>
<li>Extracting orchestration phases.</li>
<li>Designing interfaces between evidence selection, compose, gating, and publish.</li>
<li>Increasing coordination and regression risk.</li>
</ul>
<p>Not every technically possible optimization should be executed immediately. Good engineering governance includes knowing when to stop. The cheapest mistake at that point was to keep going on momentum and pretend it was still a focused optimization.</p>
<h2>Key lessons</h2>
<ol>
<li><strong>Separate fixed noise from dynamic cost.</strong> Always-applied context and per-query file loads are different cost regimes. Treat them with different tools.</li>
<li><strong>Large-file loading is often the real billing driver in AI-assisted workflows.</strong> Prompt overhead is visible. Repeated multi-hundred-KB ingest is not, but it is usually where the money goes.</li>
<li><strong>&quot;Cleaner&quot; is not the same as &quot;cheaper&quot;.</strong> A nicer-looking codebase does not automatically reduce token cost. Tie every change to a specific load path you actually want to shrink.</li>
<li><strong>Validation quality matters more when scope expands.</strong> The minute a minimal change becomes a broader extraction, the minimality argument disappears. Verification has to scale with scope.</li>
<li><strong>Governance is not bureaucracy; it is how you stop local improvements from turning into uncontrolled refactors.</strong> Knowing when to stop is part of the work. Do not continue optimizing without measurement.</li>
</ol>
<h2>What I would do next</h2>
<ul>
<li>Observe real-world token usage after the current changes before committing to more structural work.</li>
<li>Measure cost across typical task types so future decisions are data-grounded, not vibes-grounded.</li>
<li>Determine whether future helper extraction in <code>theme_story_service.py</code> is justified once the duplicate-behavior questions are resolved.</li>
<li>Treat decomposition of the giant orchestration functions as a dedicated engineering sprint, not a side quest.</li>
</ul>
<h2>Closing</h2>
<p>The most useful outcome of this pass was not just a smaller <code>CLAUDE.md</code> or a couple of helper modules. It was a clearer mental model of where AI development cost actually comes from inside Glia. Once that became visible, decision-making changed. Some optimizations were worth shipping immediately. Others were worth deferring. That distinction is the difference between optimization as engineering and optimization as noise.</p>
<blockquote>
<p>&quot;This was not a refactor sprint. It was a governance pass. The goal was to reduce real AI development cost without confusing motion for progress.&quot;</p>
</blockquote>
<hr />
<p><em>Originally published at: <a href="https://dearartist.xyz/blog/glia-token-cost">https://dearartist.xyz/blog/glia-token-cost</a></em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-token-cost">https://dearartist.xyz/blog/glia-token-cost</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>From cat to bat: A Small CLI Change That Made My Terminal Better</title>
      <link>https://dearartist.xyz/blog/bat-setup</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/bat-setup</guid>
      <pubDate>Thu, 16 Apr 2026 10:00:00 GMT</pubDate>
      <description><![CDATA[Replacing cat with bat for a nicer terminal.]]></description>
      <category>cli</category>
      <category>terminal</category>
      <category>developer-tools</category>
      <content:encoded><![CDATA[<p>A personal walkthrough of installing and configuring <a href="https://github.com/sharkdp/bat">bat</a> — the cat clone with syntax highlighting, Git integration, and theme support.</p>
<p>I don&#39;t remember exactly when I first heard about bat. It was probably a passing mention in someone&#39;s dotfiles repo, or a one-liner in a Hacker News comment. The premise was simple: it&#39;s like <code>cat</code>, but with syntax highlighting, line numbers, and Git-aware diff markers. I installed it, tried it once, and then forgot about it for weeks.</p>
<p>When I finally came back to it, I realized there was more to configure than I expected. The defaults are fine, but getting it to feel right — choosing a theme, understanding how paging works, wiring it into my shell — took a bit of reading and trial and error. This post is that process, written down so I can reference it later and so anyone else starting from scratch doesn&#39;t have to piece it together from scattered GitHub issues.</p>
<h2>What bat Actually Does</h2>
<p>If you&#39;ve used <code>cat</code> to quickly dump a file to the terminal, bat does the same thing — but makes the output substantially easier to read. Here&#39;s what it adds:</p>
<ul>
<li><strong>Syntax highlighting.</strong> bat detects the file type and applies color to code, config files, Markdown, JSON, YAML, and hundreds of other formats. You see structure instead of a wall of monochrome text.</li>
<li><strong>Line numbers.</strong> Every line gets a number in the gutter, which makes it easier to reference specific lines when debugging or discussing code.</li>
<li><strong>Git modification markers.</strong> If the file is inside a Git repository, bat shows additions, deletions, and modifications in the left margin — similar to what you&#39;d see in a code editor.</li>
<li><strong>Automatic paging.</strong> When the output is longer than your terminal window, bat pipes it through a pager (like <code>less</code>) so you can scroll. Short files print directly.</li>
<li><strong>Theme support.</strong> bat ships with dozens of themes and supports custom ones. You can match your terminal&#39;s color scheme exactly.</li>
<li><strong>It just feels nicer.</strong> This is subjective, but looking at a syntax-highlighted config file is meaningfully less fatiguing than reading raw plaintext. Small thing, real difference.</li>
</ul>
<p>bat is originally created by <a href="https://github.com/sharkdp">sharkdp</a>. It&#39;s open source, written in Rust, and actively maintained. What follows is entirely about my own experience setting it up and using it — not a fork or modification of the project itself.</p>
<h2>Why I Even Cared</h2>
<p>I spend a lot of time reading config files, checking environment variables, and scanning log output in the terminal. Most of the time I&#39;m not editing — I&#39;m just looking. And plain <code>cat</code> output makes that harder than it needs to be. No line numbers, no color, no context. For short files it&#39;s fine. For anything over twenty lines, I&#39;d find myself opening the file in an editor just to get syntax highlighting, which felt excessive for a read-only glance.</p>
<p>bat solves exactly that problem. It turns &quot;let me quickly look at this file&quot; into something that actually works, without requiring me to leave the terminal or launch a heavier tool.</p>
<p><img src="https://dearartist.xyz/substack-visuals/bat-setup/bat-setup-cat-vs-bat-zshrc-card.png" alt="Side-by-side terminal comparison of cat vs bat output for a .zshrc file" /></p>
<p><em>cat vs bat — same file, different reading experience.</em></p>
<h2>My Actual Setup Journey</h2>
<p>I installed bat via Homebrew — <code>brew install bat</code> — and ran it against a random file. It worked. Colors appeared. I thought: great, done.</p>
<p>Then I ran <code>bat --list-themes</code> to see what other themes were available, and got hit with a wall of output — every single built-in theme rendering the same sample file, one after another, scrolling endlessly through the pager. It was useful information, but overwhelming in presentation. I didn&#39;t know which theme I was even looking at half the time because the theme name scrolled past before I could read it.</p>
<p>That&#39;s when I learned the difference between bat&#39;s own output and the pager that wraps it. By default, bat pipes everything through <code>less</code>, which is great for long files but confusing when you&#39;re trying to visually compare themes. Running <code>bat --list-themes --paging=never</code> let me see all themes in a continuous stream, which was much easier to scan.</p>
<p><img src="https://dearartist.xyz/substack-visuals/bat-setup/bat-setup-list-themes-mockup.png" alt="Stacked terminal previews of bat themes (Dracula, Catppuccin Mocha selected, GitHub Light)" /></p>
<p><em>bat --list-themes — previewing color schemes side by side.</em></p>
<p>Next came the config file. bat supports a persistent configuration file so you don&#39;t have to pass flags every time. Finding it was its own small adventure:</p>
<pre><code class="language-zsh">$ bat --config-file
/Users/yuanh/.config/bat/config
</code></pre>
<p>The file didn&#39;t exist yet. I created it and opened it:</p>
<pre><code class="language-zsh">$ open -e &quot;$(bat --config-file)&quot;
</code></pre>
<p>From there I started experimenting. I tried a few themes, toggled different style options on and off, and eventually settled on a configuration that felt right. I chose Catppuccin Mocha — it matches the palette I already use in my editor and terminal, so the colors don&#39;t clash when I switch between tools.</p>
<p><img src="https://dearartist.xyz/substack-visuals/bat-setup/bat-setup-config-file-mockup.png" alt="Editor view of ~/.config/bat/config showing theme, style, and syntax mapping" /></p>
<p><em>Editing ~/.config/bat/config.</em></p>
<h2>The Confusing Parts</h2>
<p>Two things tripped me up early on, and I think they&#39;d trip up most people who aren&#39;t already familiar with how Unix pagers work.</p>
<p><strong>Pager output vs. bat output.</strong> bat sends its colored, styled output to a pager by default. The pager controls scrolling, search, and when the output disappears. If you&#39;re used to <code>cat</code>, which just dumps everything and returns to the prompt, this feels different. You&#39;re suddenly in a scrollable view and you have to press <code>q</code> to exit. It&#39;s not wrong — it&#39;s actually better for long files — but it takes a moment to understand what&#39;s happening.</p>
<p><strong>Theme previews are overwhelming.</strong> bat ships with a lot of themes. Running <code>--list-themes</code> shows all of them at once. There&#39;s no &quot;pick from a menu&quot; experience. You scroll through a long list, try to remember which names you liked, then set them in your config and restart. It works, but it&#39;s not a polished selection flow. I eventually just searched for Catppuccin by name after deciding on it separately.</p>
<h2>My Final Configuration</h2>
<p>Here&#39;s what I ended up with. The config file lives at the path returned by <code>bat --config-file</code>.</p>
<pre><code class="language-conf"># Theme
--theme=&quot;Catppuccin Mocha&quot;

# Display style
--style=&quot;numbers,changes,header&quot;

# Paging behavior
--paging=auto

# Typography
--italic-text=always
--tabs=4

# Syntax mappings
--map-syntax=&quot;*.ino:C++&quot;
--map-syntax=&quot;.ignore:Git Ignore&quot;
</code></pre>
<p>And the shell aliases I added to <code>~/.zshrc</code>:</p>
<pre><code class="language-zsh"># Use bat as default cat replacement
alias cat=&#39;bat --paging=never&#39;

# Configure bat&#39;s pager
export BAT_PAGER=&quot;less -RF&quot;
</code></pre>
<p>After saving, reload the shell and verify:</p>
<pre><code class="language-zsh">$ source ~/.zshrc

$ bat ~/.zshrc
  # outputs .zshrc with syntax highlighting, line numbers, and header

$ cat ~/.zshrc
  # now uses bat under the hood, but without paging

$ bat --list-themes --paging=never
  # browse all available themes in a continuous stream

$ bat /etc/hosts
  # quick system file inspection with full styling
</code></pre>
<h2>Before / After</h2>
<p>The difference is easier to feel than to describe. Here&#39;s a rough comparison of viewing the same file with <code>cat</code> versus a configured <code>bat</code>.</p>
<p><strong>Left — <code>cat .zshrc</code> (plain, monochrome):</strong></p>
<pre><code class="language-zsh"># Path exports
export PATH=&quot;/usr/local/bin:$PATH&quot;
export EDITOR=&quot;vim&quot;

# Aliases
alias ll=&#39;ls -la&#39;
alias cat=&#39;bat --paging=never&#39;

# Starship
eval &quot;$(starship init zsh)&quot;
</code></pre>
<p><strong>Right — <code>bat .zshrc</code> (line numbers, syntax highlighting, Git <code>+</code> marker on changed line 7):</strong></p>
<pre><code class="language-zsh">1  # Path exports
2  export PATH=&quot;/usr/local/bin:$PATH&quot;
3  export EDITOR=&quot;vim&quot;
4
5  # Aliases
6  alias ll=&#39;ls -la&#39;
7+ alias cat=&#39;bat --paging=never&#39;
8
9  # Starship
10 eval &quot;$(starship init zsh)&quot;
</code></pre>
<p><img src="https://dearartist.xyz/substack-visuals/bat-setup/bat-setup-cat-vs-bat-zshrc.png" alt="Side-by-side comparison: cat vs bat on .zshrc, showing syntax highlighting, line numbers, and a Git change marker" /></p>
<p><em>Left: plain cat. Right: bat with Catppuccin Mocha, line numbers, and Git markers.</em></p>
<p><img src="https://dearartist.xyz/substack-visuals/bat-setup/bat-setup-cat-vs-bat-package-json.png" alt="Side-by-side comparison: cat vs bat on package.json, showing JSON keys highlighted blue and a Git change marker" /></p>
<p><em>Comparing terminal output before and after configuration.</em></p>
<h2>What Changed in Daily Use</h2>
<p>bat is not a transformative tool. It doesn&#39;t change what you can do — it changes how it feels to do the things you already do. And that matters more than I expected.</p>
<p>Terminal output is easier to read. I don&#39;t skim past important lines as often because syntax highlighting creates visual structure that my eyes can follow. Comments look different from code. Strings look different from keywords. That&#39;s it, but it&#39;s enough.</p>
<p>Config files feel less intimidating. Opening a long <code>.env</code> or Nginx config with bat makes it immediately parseable. The line numbers let me say &quot;look at line 47&quot; instead of &quot;scroll down a bit, it&#39;s somewhere in the middle.&quot;</p>
<p>The friction for everyday file inspection dropped. I used to open files in VS Code just to glance at them. Now I stay in the terminal. It&#39;s a small workflow change, but it compounds across dozens of file reads per day.</p>
<p>It&#39;s a small tool. It improves the texture of daily development work. That&#39;s the best way I can describe it.</p>
<h2>Credits and Links</h2>
<p>bat is originally created and maintained by <a href="https://github.com/sharkdp">sharkdp</a>. This post is not affiliated with the project — it&#39;s a personal write-up about my own setup experience and daily usage.</p>
<ul>
<li><a href="https://github.com/yuannh/bat">My shared repo</a></li>
<li><a href="https://github.com/sharkdp/bat">Original project by sharkdp</a></li>
</ul>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/bat-setup">https://dearartist.xyz/blog/bat-setup</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Fixing &quot;Repetitive and Contrived&quot; AI Stories as a Structural Systems Problem</title>
      <link>https://dearartist.xyz/blog/glia-theme-debugging</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-theme-debugging</guid>
      <pubDate>Wed, 15 Apr 2026 20:00:00 GMT</pubDate>
      <description><![CDATA[Why "repetitive" AI output is a systems issue.]]></description>
      <category>glia</category>
      <category>ai</category>
      <category>debugging</category>
      <content:encoded><![CDATA[<p><em>A user complaint that looked like a writing issue turned out to be an upstream problem in entity modeling, theme deduplication, and narrative assignment.</em></p>
<blockquote>
<p><em>&quot;What looked like a writing problem was actually a systems problem.&quot;</em></p>
</blockquote>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Value</th>
<th>Detail</th>
</tr>
</thead>
<tbody><tr>
<td>Ready themes</td>
<td>23 → 9</td>
<td>After dedup + suppression</td>
</tr>
<tr>
<td>Org entities</td>
<td>13 → 10</td>
<td>3 spelling variants merged</td>
</tr>
<tr>
<td>Overlap pairs</td>
<td>13 → 3</td>
<td>Jaccard ≥ 0.50</td>
</tr>
<tr>
<td>Suppressed</td>
<td>14</td>
<td>5 overlap + 9 min-members</td>
</tr>
</tbody></table>
<h2>TL;DR</h2>
<blockquote>
<p><strong>Summary</strong></p>
<p>Sometimes the most revealing user feedback sounds subjective. In this case, the feedback was simple: &quot;These stories all feel repetitive and contrived.&quot; The obvious reaction would have been to tweak prompts, soften the writing style, or make the output sound more natural. But after auditing the full data path in <a href="https://gliahq.com">Glia ↗</a>&#39;s theme story system, it became clear that this was not primarily a composition problem. It was a structural systems problem. A three-phase fix (low-signal suppression → overlap dedup → fuzzy entity merge) brought ready themes from <strong>23</strong> down to <strong>9</strong>, and overlap pairs from <strong>13</strong> down to <strong>3</strong>.</p>
</blockquote>
<h2>Background: What Theme Stories Are Supposed to Do</h2>
<p>Glia is trying to do something unusually difficult: turn a user&#39;s ongoing conversations, places, projects, relationships, and recurring reflections into a narrative interface they can revisit and make sense of.</p>
<p>A theme story is not just a summary. It is meant to act more like a durable narrative unit around an ongoing topic in a person&#39;s life:</p>
<ul>
<li>A long-running project</li>
<li>An important relationship</li>
<li>A place that shapes a phase of life</li>
<li>A recurring value, tension, or reflective thread</li>
</ul>
<p>When this works, the system feels like it genuinely understands what the user is living through.</p>
<p>When it fails, the effect is immediate:</p>
<blockquote>
<p><em>&quot;Why am I seeing the same thing over and over, just rewritten with a different title?&quot;</em></p>
</blockquote>
<p>That second failure mode is what this post is about.</p>
<h2>The User Complaint: 8–9 Moments, but 21+ Stories</h2>
<p>The user was based in Singapore and gave direct feedback that the generated stories felt repetitive and contrived.</p>
<p>For a narrative product, repetition is not just a stylistic issue. It is a trust problem. Once the system starts rewriting the same thing as multiple stories, users stop feeling understood and start feeling processed.</p>
<p>The QA snapshot for this user looked like this:</p>
<pre><code class="language-sql">-- user_audit_snapshot.sql
-- User data snapshot (Singapore user)
-- ─────────────────────────────────────────
  metric                    │ value
 ──────────────────────────┼───────
  Messages                  │ 109
  Distinct threads           │ 21
  Ready theme stories        │ 23
  Active org entities        │ 13
  Duplicate org variants     │ 4 (same project)
  Overlap pairs (J ≥ 0.50)  │ 13
  Single-member themes       │ 9
 ──────────────────────────┴───────
</code></pre>
<p>Even before reading any generated text, the shape of the data already looked wrong.</p>
<p>Twenty-one threads should not naturally produce twenty-three ready themes. That meant the system was not distilling themes. It was <strong>amplifying fragments</strong>.</p>
<h2>Why I Did Not Change the Prompt</h2>
<p>At first glance, it would have been easy to blame the language model: maybe the prose was too literary, maybe the prompt was too eager to find meaning, maybe the titles were over-written.</p>
<p>But when I traced the chain upstream, the problem was already present before composition ever ran.</p>
<p>If the system splits one real-world project into multiple entities, creates separate themes for each variant, and then feeds many of the same threads into each of those themes, no prompt can save the result. Even a perfectly restrained writing prompt would still produce outputs that feel repetitive and forced.</p>
<blockquote>
<p><em>The writing layer was not causing the issue. It was faithfully exposing it.</em></p>
</blockquote>
<p>The real failure was in theme formation.</p>
<h2>Root Cause Analysis: Three Gaps in One Bad Chain</h2>
<h3>1. Entity Fragmentation</h3>
<p>The user was really talking about one project, but the system had created multiple org entities for it:</p>
<pre><code class="language-text"># entity_variants
Project Kaiwen
Project Kaiwan
Project Kai Wen
Kawen
</code></pre>
<p>These were not four distinct projects. They were spacing, spelling, and prefix variants of the same one. Without canonical convergence at the entity layer, every downstream stage would continue to treat them as separate sources of meaning.</p>
<h3>2. Theme Duplication</h3>
<p>Once entity variants exist, theme creation can easily multiply them. If deduplication only checks exact string equality, then <code>kawen</code>, <code>project kaiwen</code>, and <code>project kai wen</code> all survive as separate theme candidates.</p>
<p>So one entity-level split quickly becomes multiple theme states.</p>
<h3>3. Member Over-assignment</h3>
<p>Even with some duplication upstream, a stricter membership layer could have reduced the damage. But in this case, member assignment behaved more like independent attraction than coordinated allocation. Similar themes each absorbed overlapping threads using keyword overlap, without enough cross-theme restraint.</p>
<p>The result:</p>
<ul>
<li>Theme A contains those threads</li>
<li>Theme B also contains those threads</li>
<li>Theme C contains them again</li>
</ul>
<p>By the time the user sees the output, it looks like multiple stories with different titles but almost the same underlying material.</p>
<h2>Root Cause Chain</h2>
<pre><code>┌──────────────┐   ┌──────────────┐   ┌────────────────┐   ┌──────────────┐
│   Entity     │ → │    Theme     │ → │    Member      │ → │  Repetitive  │
│ Fragmentation│   │ Duplication  │   │ Over-assignment│   │   Stories    │
└──────────────┘   └──────────────┘   └────────────────┘   └──────────────┘
</code></pre>
<h2>Designing the Fix: Stop the Bleeding, Then Close the Source</h2>
<p>I did not want to solve this with a large architectural rewrite. The safer path was a phased, reversible repair strategy.</p>
<h3>Phase D1: Suppress Low-Signal Themes</h3>
<p>The quickest and lowest-risk win was to stop generating themes from only one member thread. If a theme has just one member thread, it usually does not deserve to become a standalone theme story.</p>
<pre><code class="language-bash"># feature_flags.env
GLIA_THEME_MIN_MEMBERS=2
</code></pre>
<p>If a theme has fewer than two member threads, it is suppressed before composition. This immediately removes a large class of shallow, low-signal stories.</p>
<h3>Phase D2: Deduplicate Themes by Member Overlap</h3>
<p>Next, I added a guard against duplicate themes at the theme level. If a candidate theme&#39;s <code>member_thread_ids</code> overlap too heavily with an older ready theme, the newer one is suppressed.</p>
<pre><code class="language-bash"># feature_flags.env
GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD=0.50
</code></pre>
<p>This is not the final form of cross-theme allocation, but it is an effective v1. It directly reduces the symptom the user actually experiences: repeated stories built from the same thread set.</p>
<h3>Phase D3: Fuzzy Merge for Org Entities</h3>
<p>Finally, I addressed the source of the duplication. I added conservative fuzzy matching to the org entity resolve path, specifically to catch spelling variants, spacing variants, and prefix noise.</p>
<ul>
<li>Removing leading noise tokens like &quot;project&quot;</li>
<li>Collapsing spaces</li>
<li>Comparing core forms</li>
<li>Applying a conservative Levenshtein threshold</li>
</ul>
<pre><code class="language-bash"># feature_flags.env
GLIA_ORG_ENTITY_FUZZY_ENABLED=1
</code></pre>
<p>I kept this intentionally conservative and avoided broad fuzzy merge behavior for people entities.</p>
<h3>Full Feature Flag Configuration</h3>
<pre><code class="language-bash"># glia-core/.env (theme repair flags)
GLIA_THEME_MIN_MEMBERS=2
GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD=0.50
GLIA_ORG_ENTITY_FUZZY_ENABLED=1
</code></pre>
<h2>What Changed in Code</h2>
<h3>1. Low-Signal Theme Suppression</h3>
<pre><code class="language-python"># theme_story_compose.py
min_members = settings.GLIA_THEME_MIN_MEMBERS

if min_members &gt; 0 and len(member_thread_ids) &lt; min_members:
    suppress_theme(
        reason=&quot;min_members_guard&quot;,
        detail=f&quot;members={len(member_thread_ids)} &lt; threshold={min_members}&quot;,
    )
    return
</code></pre>
<p>This was a small change, but it removed a large number of thin themes that never should have composed in the first place.</p>
<h3>2. Jaccard-Based Theme Deduplication</h3>
<pre><code class="language-python"># theme_dedup.py
def jaccard(a: set[str], b: set[str]) -&gt; float:
    if not a or not b:
        return 0.0
    return len(a &amp; b) / len(a | b)

score = jaccard(
    set(current.member_thread_ids),
    set(existing.member_thread_ids),
)

if score &gt;= settings.GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD:
    suppress_theme(
        reason=&quot;member_overlap_dedup&quot;,
        detail=f&quot;overlap_with={existing.theme_key}, jaccard={score:.2f}&quot;,
    )
</code></pre>
<p>This let the system reject duplicate themes without rewriting the entire membership architecture.</p>
<h3>3. Org Entity Fuzzy Merge</h3>
<pre><code class="language-python"># org_entity_resolve.py
def _org_core_collapsed(name: str) -&gt; str:
    parts = normalize(name).split()
    if parts and parts[0] in {&quot;project&quot;, &quot;the&quot;, &quot;my&quot;}:
        parts = parts[1:]
    return &quot;&quot;.join(parts)

def _levenshtein(a: str, b: str) -&gt; int:
    # standard dynamic programming implementation
    ...

def _find_fuzzy_org_candidate(
    name: str,
    candidates: list[str],
) -&gt; str | None:
    core = _org_core_collapsed(name)
    for candidate in candidates:
        other = _org_core_collapsed(candidate)
        if len(core) &gt;= 4 and _levenshtein(core, other) &lt;= 2:
            return candidate
    return None
</code></pre>
<p>This made it possible for variants like Kaiwen, Kaiwan, Kai Wen, and Kawen to converge to a single canonical entity rather than continuing to branch.</p>
<h2>QA Validation: Did the Fix Actually Work?</h2>
<p>The important question was not whether the code looked reasonable. It was whether the user-level outcome improved.</p>
<p>I ran a before/after audit on the target user in QA.</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Before</th>
<th>After</th>
</tr>
</thead>
<tbody><tr>
<td>Ready themes</td>
<td>23</td>
<td><strong>9</strong></td>
</tr>
<tr>
<td>Active org entities</td>
<td>13</td>
<td><strong>10</strong></td>
</tr>
<tr>
<td>Overlap pairs</td>
<td>13</td>
<td><strong>3</strong></td>
</tr>
<tr>
<td>Single-member themes</td>
<td>9</td>
<td>—</td>
</tr>
<tr>
<td>Suppressed themes</td>
<td>—</td>
<td><strong>14</strong> (5 overlap + 9 min-members)</td>
</tr>
</tbody></table>
<pre><code class="language-bash"># repair_user_themes.sh
# QA verify
docker compose exec -T api python scripts/repair_user_a45d_themes.py --dry-run

# deploy
git push origin main
ssh glia-qa &quot;cd ~/glia-core &amp;&amp; docker compose pull &amp;&amp; docker compose up -d&quot;
</code></pre>
<pre><code class="language-text"># before_after_output.txt
Before:
  Ready themes: 23
  Active org entities: 13
  Overlap pairs: 13

After:
  Ready themes: 9
  Active org entities: 10
  Overlap pairs: 3
</code></pre>
<p>Those remaining 3 overlap pairs were not bugs. They represented natural cross-theme sharing between truly different narrative angles: a close collaborator as a person, Singapore as a lived context, New Zealand as a retreat plan, The Future of Truth as a reflective theme. They share some threads, but they are still distinct stories.</p>
<h3>Surviving Narrative Angles</h3>
<p>After the repair, the surviving themes made sense as separate angles rather than duplicate rewrites:</p>
<ul>
<li>Kawen: the core project arc</li>
<li>User E: personal influence on the project vision</li>
<li>Co-working Space: work environment and network</li>
<li>User T: quiet support</li>
<li>Restless Egg: place and internal tension</li>
<li>Singapore: life context and self-compassion</li>
<li>The Future of Truth: reflective thread</li>
<li>New Zealand: retreat and planning</li>
<li>User R: creative follow-through</li>
</ul>
<p>That was the moment the original complaint stopped looking subjective. The data had changed shape in exactly the right way.</p>
<h2>Results and Engineering Impact</h2>
<p>On the surface, this looks like a simple reduction from 23 to 9. But the real improvement was not that the number got smaller. It was that the system stopped amplifying fragments and started converging on durable narrative units.</p>
<blockquote>
<p><strong>1. Better Narrative Trust</strong></p>
<p>The system now behaves more like a thematic compressor and less like a duplication engine. When users see fewer, more distinct stories, the product feels like it understands their life rather than reprocessing their data.</p>
</blockquote>
<blockquote>
<p><strong>2. Correct Diagnosis Over Superficial Patching</strong></p>
<p>The most important decision in this repair was not a code trick. It was resisting the urge to patch the writing layer before understanding the upstream structure. Prompt tuning would have masked the symptom while leaving the structural problem intact.</p>
</blockquote>
<blockquote>
<p><strong>3. Safer Rollout Through Feature Flags</strong></p>
<p>I shipped the fix behind three reversible flags: <code>GLIA_THEME_MIN_MEMBERS</code>, <code>GLIA_THEME_MEMBER_OVERLAP_DEDUP_THRESHOLD</code>, and <code>GLIA_ORG_ENTITY_FUZZY_ENABLED</code>. That made it easy to validate in QA, tune thresholds, and roll back safely if needed.</p>
</blockquote>
<h2>What Still Remains Imperfect</h2>
<p>I did not implement full thread-level cross-theme exclusivity. That would require a more global membership architecture, because the current refresh flow runs mostly per theme rather than per user with a joint allocation pass.</p>
<p>For this stage, that heavier redesign was unnecessary. The combination of entity merge, low-signal suppression, and overlap dedup was enough to resolve the main user-facing failure mode.</p>
<p>A few surviving theme bodies still contained the old &quot;Kaiwen&quot; spelling, but that was only stale generated text from before the entity merge. It was a surface artifact, not a structural issue, and would disappear on recompose.</p>
<h2>How I Would Extend This in a Later Phase</h2>
<p>The natural next step would be a global thread allocation pass: instead of each theme independently attracting member threads via keyword similarity, run a single per-user allocation step that assigns each thread to its highest-scoring theme and enforces cross-theme exclusivity.</p>
<p>This would require reworking the refresh flow from a per-theme loop to a per-user joint optimization. It&#39;s a meaningful architectural change, and the current phased fix deliberately stops short of it. But the data model and feature flags are already in place to support it when the time is right.</p>
<h2>Diagnostic Method: When Users Say AI Output Feels Fake</h2>
<p>One lesson from this repair is that when users say AI output feels fake, repetitive, or contrived, the problem is often not where the prose is written.</p>
<p>Before changing prompts, it is worth asking:</p>
<ol>
<li>Did the system split one real-world object into multiple identities?</li>
<li>Did it model similar themes more than once?</li>
<li>Did it assign the same source material into too many narrative units?</li>
</ol>
<p>If the answer to any of those is yes, then what looks like a content problem is probably a systems problem.</p>
<p>That is the kind of engineering problem I find most interesting: taking a vague human complaint and turning it into a concrete, testable, structural diagnosis.</p>
<h2>Engineering Takeaways</h2>
<ol>
<li>When generated output looks fake, audit upstream data structures before touching the prompt.</li>
<li>Entity fragmentation is a class of system degradation that is easy to overlook. Spelling variants get amplified exponentially downstream.</li>
<li>Jaccard overlap is a simple but effective dedup signal. Good enough as a pragmatic v1 before building a global allocation system.</li>
<li>Feature flags are not just for gradual rollouts. They are an expression of engineering judgment, letting you validate layer by layer and roll back with confidence.</li>
</ol>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-theme-debugging">https://dearartist.xyz/blog/glia-theme-debugging</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>When &quot;No New Moments&quot; Wasn&apos;t a Simple Bug</title>
      <link>https://dearartist.xyz/blog/glia-thread-debugging</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-thread-debugging</guid>
      <pubDate>Wed, 15 Apr 2026 16:00:00 GMT</pubDate>
      <description><![CDATA[A debugging story about thread state.]]></description>
      <category>glia</category>
      <category>debugging</category>
      <content:encoded><![CDATA[<p><em>Debugging thread routing, gate logic, and LLM failure modes in <a href="https://gliahq.com">Glia ↗</a>.</em></p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Value</th>
<th>Detail</th>
</tr>
</thead>
<tbody><tr>
<td>Messages → Thread</td>
<td>24 → 1</td>
<td>Over-merge on 4/14</td>
</tr>
<tr>
<td>Ready rate recovery</td>
<td>37.5% → 72.7%</td>
<td>After replay</td>
</tr>
<tr>
<td>Baseline</td>
<td>76%</td>
<td>Prior steady state</td>
</tr>
<tr>
<td>Root causes</td>
<td>3</td>
<td>Identified &amp; fixed</td>
</tr>
</tbody></table>
<h2>TL;DR</h2>
<blockquote>
<p><strong>Summary</strong></p>
<p>A user reported barely receiving new moments despite uploading many photos. After tracing through QA data, backend routing, gate prompts, and compose fallbacks, the real picture was more nuanced: the main issue was thread over-aggregation (24 messages merged into 1 thread), a secondary issue was an over-conservative story gate, and a diagnostic blind spot in compose validation made failures harder to debug. After targeted fixes and historical replay, the effective ready rate recovered from <strong>37.5%</strong> to <strong>72.7%</strong>, close to the <strong>76%</strong> baseline.</p>
</blockquote>
<h2>Background</h2>
<p>The original product feedback was simple:</p>
<blockquote>
<p><em>&quot;I&#39;ve uploaded quite a few photos recently, and they should be linkable to moments. But I haven&#39;t really been getting many moments in the past few days. I&#39;m wondering whether other users are seeing the same issue too.&quot;</em></p>
</blockquote>
<p>This was a good reminder that user-facing symptoms often compress several different failure modes into one sentence.</p>
<p>The temptation with LLM-backed systems is to jump straight to the most obvious theory: maybe image uploads are not triggering generation, maybe photos are not being converted into evidence, maybe media-only inputs are ignored, maybe the moment generation pipeline silently broke.</p>
<p>But the correct first move was not &quot;fixing.&quot; It was instrumented narrowing.</p>
<h2>The first hypothesis was wrong</h2>
<p>One early idea was that the user&#39;s recent activity might have been mostly photos with little or no text, and that media-only inputs were simply less likely to produce moments.</p>
<p>That hypothesis did not survive data validation.</p>
<p>After checking QA data directly, it became clear that her recent messages were not empty media posts. Most uploads were accompanied by text captions. So the issue was not &quot;photo-only content can&#39;t generate moments.&quot;</p>
<p>That mattered because it changed the entire debugging direction. Instead of asking:</p>
<blockquote>
<p><em>&quot;Why is meaningful recent activity ending up with so few visible moments?&quot;</em></p>
</blockquote>
<h2>The data pattern that changed everything</h2>
<p>Looking at two adjacent windows of activity made the shape of the problem visible.</p>
<pre><code class="language-sql">-- user_e_thread_comparison.sql
-- User E&#39;s thread / output comparison
-- ─────────────────────────────────────────────────
  window       │ messages │ threads │ ready │ rate
 ──────────────┼──────────┼─────────┼───────┼──────
  4/03 – 4/08  │  ~40     │  8      │  6    │ 76%
  4/10 – 4/14  │  ~38     │  8      │  3    │ 37.5%
 ──────────────┴──────────┴─────────┴───────┴──────
</code></pre>
<p>The first thing that stood out was not just the drop in ready rate. It was the collapse in thread count.</p>
<pre><code>   24 messages   →   1 thread
</code></pre>
<p>On 4/14, the user sent 24 messages, but they all landed in one thread. Historically, their activity looked more like many short threads, multiple distinct topics, several chances to generate moments. On 4/14, it looked like one long continuous session with many topic shifts but only one story thread — therefore at most one ready moment.</p>
<p>That changed the priority stack immediately.</p>
<h2>Root Cause #1: Session-Based Thread Aggregation Was Over-Merging</h2>
<p>At the top of the stack was a design flaw in the thread router. The router had a session-aware aggregation path designed for diary-like usage: if messages arrived within a short time window, they were assumed to belong to the same thought stream.</p>
<p>In principle, that sounds reasonable. In practice, it was too blunt.</p>
<h3>The problem</h3>
<p>Messages were being merged into the most recent thread based almost entirely on time proximity:</p>
<pre><code class="language-python"># story_thread_router.py
session_enabled = os.getenv(
    &quot;GLIA_STORY_THREAD_SESSION_ENABLED&quot;, &quot;0&quot;
).strip().lower() in {&quot;1&quot;, &quot;true&quot;, &quot;yes&quot;, &quot;on&quot;}

session_window_min = max(
    1,
    _env_int(&quot;GLIA_STORY_THREAD_SESSION_WINDOW_MINUTES&quot;, 10)
)

if session_enabled and candidates and not debug.get(&quot;low_signal&quot;):
    most_recent = min(
        candidates,
        key=lambda c: (now - c.last_msg_at).total_seconds(),
    )
    gap_min = max(
        0.0,
        (now - most_recent.last_msg_at).total_seconds() / 60.0
    )
    if gap_min &lt;= float(session_window_min):
        return most_recent.story_thread_id
</code></pre>
<p>The key issue was not that session aggregation existed. It was that it bypassed semantic boundary checks. That meant: no robust topic continuity check, no hard cap on thread growth, no protection against &quot;continuous chat about multiple unrelated things.&quot;</p>
<p>In User E&#39;s 4/14 case, that produced exactly the wrong behavior:</p>
<ul>
<li>playlist</li>
<li>outfit</li>
<li>dinner party prep</li>
<li>cleaning</li>
<li>running</li>
<li>reflections about people</li>
</ul>
<p>All of that got swallowed into one story thread because each message arrived within the session window.</p>
<h3>Why this mattered so much</h3>
<p>Downstream, story compose was effectively bounded by thread granularity. In practice:</p>
<pre><code>1 thread → 1 daily compose → ≤ 1 ready moment
</code></pre>
<p>So if routing crushed four meaningful clusters into one thread, every downstream component was already operating on a degraded input structure. This is why I considered router over-merge the true upstream root cause.</p>
<h2>Root Cause #2: The Story Gate Was Too Conservative</h2>
<p>The second issue sat in the story gate. Even when a thread was coherent and meaningful, the gate sometimes suppressed it as too lightweight.</p>
<p>This was especially visible for content like:</p>
<ul>
<li>preparing for a dinner party</li>
<li>discussing outfits</li>
<li>reacting to a friend&#39;s playlist or gesture</li>
<li>small but real emotional movement in social life</li>
</ul>
<p>These are not dramatic events. But they are exactly the kind of material that memory-native products should preserve.</p>
<h3>Two examples</h3>
<p>A 3-message thread about playlist / outfit / social anticipation had been suppressed with a rationale along the lines of: <em>&quot;brief reaction / acknowledgment; low narrative signal.&quot;</em></p>
<p>A 6-message thread around dinner party prep was suppressed as: <em>&quot;short positive update without deeper narrative or new life event.&quot;</em></p>
<p>From a narrow &quot;major event detection&quot; lens, that might sound reasonable. From a product lens, it was too strict.</p>
<h3>The prompt was partially to blame</h3>
<p>The gate prompt contained a conservative instruction:</p>
<pre><code class="language-text"># gate_prompt.txt
If uncertain between none and story, choose none.
</code></pre>
<p>This sounds safe, but in practice it biases the model against the entire middle band of content: not trivial, not dramatic, but still personally meaningful.</p>
<p>The few-shot examples also leaned too hard toward extremes. There were not enough examples of everyday but meaningful social content. So the gate was not &quot;broken&quot; in the usual sense. It was miscalibrated.</p>
<h2>Root Cause #3: Compose Failure Was Harder to Diagnose Than It Should Have Been</h2>
<p>The third issue was smaller in impact, but important operationally. A high-value thread about self-discovery and relationship reflection failed during compose. The failure surfaced as:</p>
<pre><code class="language-text"># compose_logs
story_thread_compose_contract_invalid:unknown
</code></pre>
<p>That &quot;unknown&quot; was a problem in itself. It made failures harder to cluster and debug. After inspecting the compose path, I found a bad observability pattern: a redundant <code>try/except Exception: pass</code> was swallowing signal.</p>
<h3>Before</h3>
<pre><code class="language-python"># compose_validator.py — before
try:
    from app.validators.contracts import _validate_story_blocks as _vsb
    ok, reason = _vsb(payload)
except Exception:
    pass
</code></pre>
<h3>After</h3>
<pre><code class="language-python"># compose_validator.py — after
ok, reason = _validate_story_blocks(payload)

if sub_reason == &quot;unknown&quot;:
    logger.warning(
        &quot;compose_contract_sub_reason_unknown&quot;,
        extra={
            &quot;story_thread_id&quot;: story_thread_id,
            &quot;request_id&quot;: request_id,
            &quot;payload_keys&quot;: list(payload.keys()) if isinstance(payload, dict) else [],
            &quot;has_blocks&quot;: isinstance(payload, dict) and &quot;blocks&quot; in payload,
        },
    )
</code></pre>
<p>This did not magically fix compose quality. But it fixed something equally important: the ability to understand <em>why</em> compose failed. In LLM systems, sometimes you should improve generation. Sometimes you just need to stop losing the error signal.</p>
<h2>Root Cause Summary</h2>
<table>
<thead>
<tr>
<th>Priority</th>
<th>Title</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><strong>P0</strong></td>
<td>Over-merge in session routing</td>
<td>Session-aware aggregation merged semantically unrelated messages into a single thread based purely on time proximity.</td>
</tr>
<tr>
<td><strong>P1</strong></td>
<td>Over-conservative gate prompt</td>
<td>Gate suppressed everyday social and life-planning content as &#39;too lightweight&#39; due to conservative uncertainty policy.</td>
</tr>
<tr>
<td><strong>P2</strong></td>
<td>Poor failure attribution</td>
<td>Compose validation failures surfaced as <code>sub_reason=unknown</code>, swallowing diagnostic signal behind a bare <code>except</code>.</td>
</tr>
<tr>
<td>—</td>
<td>Independent provider issue</td>
<td>Anthropic API returning HTTP 400 in QA, forcing fallback to OpenAI with different quality characteristics.</td>
</tr>
</tbody></table>
<h2>Fix strategy: solve the upstream quantity problem first</h2>
<p>Once the root causes were ranked, the repair sequence became straightforward.</p>
<pre><code>User feedback → QA validation → Code audit → Router fix (P0) → Gate fix (P1) → Replay → Provider split
   Symptom       Data audit     Root causes                                       Verification    Separate track
</code></pre>
<h3>P0 — Fix thread over-merge first</h3>
<p>I intentionally chose the smallest effective change rather than redesigning the entire routing system.</p>
<p><strong>Guard A: Cap session thread growth.</strong></p>
<pre><code class="language-python"># story_thread_router.py — guardrail
session_max_user_messages = _env_int(
    &quot;GLIA_STORY_THREAD_SESSION_MAX_USER_MESSAGES&quot;,
    8,
)

if session_enabled and candidates and gap_min &lt;= float(session_window_min):
    current_user_msgs = _count_user_messages(db, most_recent.story_thread_id)
    if (
        session_max_user_messages &gt; 0
        and current_user_msgs &gt;= session_max_user_messages
    ):
        debug[&quot;session_blocked&quot;] = &quot;max_user_messages&quot;
    else:
        return most_recent.story_thread_id
</code></pre>
<p>This avoids infinite growth in a single active session. Not a fancy model — a practical guardrail.</p>
<p><strong>Guard B: Topic-shift protection.</strong> I also added a topic-shift gate based on lexical overlap and Jaccard similarity. The initial version was directionally right but too sensitive in replay — it fragmented sessions too aggressively. So I rolled its default back to 0 and kept the cap-based protection as the primary deployed fix.</p>
<blockquote>
<p><em>Not every theoretically good guard belongs in the default config immediately.</em></p>
</blockquote>
<h3>P1 — Relax the gate for meaningful everyday content</h3>
<p>The second repair was prompt-level. I updated the gate prompt with new few-shot examples for dinner prep, playlist/friend interaction, outfit/social anticipation, a weaker uncertainty policy, and clearer instruction:</p>
<pre><code class="language-text"># gate_prompt.txt — updated
Everyday social moments that reflect a real experience,
relationship dynamic, preparation, anticipation, or emotional reaction
can still be worth preserving as a story.

If uncertain, lean none —
unless the thread clearly reflects a real lived moment.
</code></pre>
<p>This was a better first move than changing scoring formulas, because it directly addressed the model&#39;s decision boundary without broadening the whole system indiscriminately.</p>
<h3>P2 — Improve validation observability</h3>
<p>Finally, I cleaned up contract-failure attribution: remove swallowed signal, log unresolved attribution properly, make future compose failures easier to inspect.</p>
<h2>Verifying the fixes</h2>
<p>One of the most important parts of this case was resisting the urge to declare victory based only on code changes.</p>
<p>At first, the fixes were deployed, but the user had not sent any new messages yet. That meant no new router runs, no new gate decisions, and no real-world verification. So I took a safer middle path: do not rewrite historical thread assignments, but replay specific suppressed threads that were safe to re-run.</p>
<pre><code class="language-text"># replay_results.log
replay job: 3 historical threads
──────────────────────────────────────────────────────
 thread                        │ previous   │ result
──────────────────────────────┼────────────┼─────────
 self-discovery + reflection   │ compose ✗  │ ready ✓
 playlist / outfit / social    │ gate: none │ ready ✓
 dinner party prep             │ gate: none │ ready ✓
──────────────────────────────┴────────────┴─────────
</code></pre>
<p>The first two were the clearest signal. They showed that the gate prompt fix was not theoretical — it changed actual outcomes.</p>
<pre><code>   37.5%   →   72.7%   |   76%
   before      after        baseline
                replay
</code></pre>
<p>That improvement includes targeted replay of historical suppressed threads. So it is not identical to &quot;natural live traffic recovered by itself.&quot; But it is a strong validation that the main logic was repaired correctly.</p>
<h2>A separate problem emerged: provider reliability</h2>
<p>One thread remained suppressed even after replay. During replay, QA Anthropic requests were returning HTTP 400 consistently, which forced all calls to fail over to OpenAI. That introduced a separate class of failures: banned phrase checks, title grounding failure, provider-specific output quality issues.</p>
<p>This mattered, but it did not invalidate the earlier fixes. It simply meant the debugging work had split into two different tracks:</p>
<table>
<thead>
<tr>
<th>Track A — Fixed</th>
<th>Track B — Follow-up</th>
</tr>
</thead>
<tbody><tr>
<td>Router over-merge</td>
<td>Anthropic API 400 in QA</td>
</tr>
<tr>
<td>Gate over-suppression</td>
<td>OpenAI fallback compose quality</td>
</tr>
<tr>
<td>Validation observability gap</td>
<td>Finer attribution between validation layers</td>
</tr>
</tbody></table>
<p>That separation is useful. Otherwise engineering work turns into one giant undifferentiated bucket.</p>
<h2>Lessons learned</h2>
<blockquote>
<p><strong>1. Session continuity is not the same thing as topic continuity</strong></p>
<p>Just because messages arrive within ten minutes of each other does not mean they belong to the same semantic unit. That assumption works for some journaling behavior, but not for all conversational behavior.</p>
</blockquote>
<blockquote>
<p><strong>2. Fix upstream granularity problems before downstream quality problems</strong></p>
<p>If routing collapses four opportunities into one, then gate and compose are already operating at a disadvantage. This is why &quot;fix the router first&quot; was the right call.</p>
</blockquote>
<blockquote>
<p><strong>3. Conservative gates can quietly erase the product&#39;s real value</strong></p>
<p>The easiest thing for a classifier or gate to do is say &quot;no.&quot; But products like this are not supposed to preserve only dramatic life events. A lot of value lives in smaller but emotionally real moments.</p>
</blockquote>
<blockquote>
<p><strong>4. Observability is part of product quality</strong></p>
<p>An &quot;unknown&quot; failure reason is not just an ops annoyance. It slows down every future iteration. In LLM systems, diagnosability is not optional.</p>
</blockquote>
<blockquote>
<p><strong>5. Not every discovered issue belongs in the same fix batch</strong></p>
<p>The Anthropic 400 problem was real, but it was not the same as the original root cause. Separating &quot;fixed main issue&quot; from &quot;new independent issue&quot; kept the work focused.</p>
</blockquote>
<h2>Closing thoughts</h2>
<p>What I liked about this case is that it reinforced a pattern I trust more and more in product engineering:</p>
<blockquote>
<p><em>The visible symptom is often not the real unit of failure.</em></p>
</blockquote>
<p>&quot;No new moments&quot; sounded like a generation problem. In reality, it was a combination of routing granularity, product calibration, and observability quality.</p>
<p>And once those were disentangled, the fixes became smaller, clearer, and much more effective.</p>
<p><em>That is usually a good sign you are finally solving the right problem.</em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-thread-debugging">https://dearartist.xyz/blog/glia-thread-debugging</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Closing the Invite Flow Gaps in Glia&apos;s Social Feature</title>
      <link>https://dearartist.xyz/blog/glia-invite-flow-audit</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-invite-flow-audit</guid>
      <pubDate>Wed, 15 Apr 2026 11:30:00 GMT</pubDate>
      <description><![CDATA[Auditing the invite flow end-to-end.]]></description>
      <category>glia</category>
      <category>social</category>
      <category>audit</category>
      <content:encoded><![CDATA[<p><em>Engineering Case Study · 2026-04-15 11:30 · 14 min read · Engineering, Product, iOS, Backend, Systems Thinking, Rollout</em></p>
<p>An end-to-end audit of invite creation, onboarding handoff, state sync, push routing, and rollout readiness.</p>
<h2>TL;DR</h2>
<blockquote>
<p><strong>Summary</strong></p>
<p>I audited the full invite flow of Glia&#39;s social feature across backend and iOS. The backend capabilities mostly existed, but the client had several critical flow breaks. I fixed onboarding handoff, connection state refresh, dead-end push routing, feature flag gating, decline feedback, and funnel logging. After these changes, the main invite path became closed-loop and suitable for gradual rollout.</p>
</blockquote>
<h2>Context</h2>
<p><a href="https://gliahq.com">Glia ↗</a> is a memory-native product. One of its social surfaces lets a user tap <code>Invite to Connect</code> from a people entity card, send an invite link, and allow another user to accept that invite to unlock memory sharing around that person.</p>
<p>This work was not about building a brand new feature from scratch. It was about answering a more important question:</p>
<blockquote>
<p><em>Is this feature actually rollout-ready, or does it only look complete when reading the code?</em></p>
</blockquote>
<p>That distinction matters. In social systems, a feature can &quot;exist&quot; in the sense that endpoints are implemented, views render, and the happy path works in isolation. But users do not experience isolated endpoints. They experience a chain of transitions: from one screen to another, from one app state to another, and from one expectation to another.</p>
<p>I audited both <code>glia-core</code> backend and <code>glia-ios</code> client. The goal was to determine whether the flow was actually complete enough to support gradual rollout.</p>
<h2>Why this audit mattered</h2>
<p>The problem was not that the feature did not exist. The problem was that the user could still fall out of the flow.</p>
<p>That is the kind of issue that is easy to underestimate when looking at a codebase. A team can see invite creation working, see a landing page render, see an accept endpoint return success, and conclude that the feature is basically done. But from a product perspective, that is not enough.</p>
<p>A social invite flow is only real if it stays intact across the edges:</p>
<ul>
<li>a new user installing from an invite link</li>
<li>a returning user re-opening a people profile after acceptance</li>
<li>a push notification that actually lands somewhere useful</li>
<li>a disabled feature that stays hidden instead of surfacing as a broken interaction</li>
</ul>
<blockquote>
<p><em>This was a good example of why endpoint completeness is not the same thing as product completeness.</em></p>
</blockquote>
<h2>The end-to-end flow I audited</h2>
<p>The audit started from the exact user action that matters most: tapping <code>Invite to Connect</code> on a people entity card.</p>
<pre><code>People Card  →  Create Invite  →  Share Link  →  Install / Open  →  Accept Invite  →  Social Feed
(Tap invite)    (Backend API)    (External)     (Onboarding)        (State trans.)    (Connection live)
</code></pre>
<p>At a glance, most of these parts already existed. The backend could create invites. The accept endpoint existed. Push jobs were present. The feed path existed. But when I reconstructed the full chain, several important gaps became obvious.</p>
<h2>What I found</h2>
<p>The audit identified three true P1 breaks and several P2 issues.</p>
<table>
<thead>
<tr>
<th>Finding</th>
<th>Severity</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td>Onboarding handoff broken</td>
<td><strong>P1 / high</strong></td>
<td>New users from invite links lost the pending invite after completing onboarding. They had to manually reopen the link.</td>
</tr>
<tr>
<td>Stale connection state</td>
<td><strong>P1 / high</strong></td>
<td>People card showed &quot;Invite to Connect&quot; even when the connection was already active. Misled the inviter.</td>
</tr>
<tr>
<td>Dead-end push routing</td>
<td><strong>P1 / high</strong></td>
<td><code>new_shared_card</code> notifications opened an empty static page with no content and no useful CTA.</td>
</tr>
</tbody></table>
<h2>The three P1 breaks</h2>
<h3>1. Pending invite was dropped after onboarding for new users</h3>
<p>A new user could install the app from an invite link, complete onboarding, and still lose the pending invite flow. The token had been captured, but onboarding completion did not consume it. That meant the user had to reopen the invite link manually.</p>
<p>This is exactly the kind of bug that makes a feature look healthy in code and broken in reality.</p>
<h3>2. The connection state on the people card did not refresh</h3>
<p>A connection could already be active, but the people profile still showed <code>Invite to Connect</code>. This was not just stale UI. It actively misled the inviter into thinking the invite had not worked.</p>
<p>This was not a backend failure. It was a state synchronization failure on the client.</p>
<h3>3. new_shared_card push notifications led to a dead end</h3>
<p>Tapping the notification opened an empty static detail page with no content and no useful CTA. The same dead-end behavior also existed in Notification Center.</p>
<p>This was a classic example of a route existing without being a good destination.</p>
<h2>The P2 issues</h2>
<table>
<thead>
<tr>
<th>Finding</th>
<th>Severity</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td>Feature flag not respected</td>
<td>P2 / medium</td>
<td>Social invite button still rendered when the backend feature flag was disabled.</td>
</tr>
<tr>
<td>No decline feedback</td>
<td>P2 / medium</td>
<td>Decline action had no success or error feedback. Users could not tell if their action worked.</td>
</tr>
<tr>
<td>Missing env documentation</td>
<td>P2 / medium</td>
<td><code>.env.example</code> was missing social-related environment variables for safe deployment.</td>
</tr>
<tr>
<td>No funnel analytics</td>
<td>P2 / medium</td>
<td>There was no basic invite funnel logging, making it impossible to judge conversion during rollout.</td>
</tr>
</tbody></table>
<p>These were not the highest-severity breaks, but they mattered for rollout quality.</p>
<h2>How I fixed them</h2>
<p>I tried to keep the fixes narrow and explicit instead of redesigning the surface.</p>
<h3>Fix 1: Consume pending invite at onboarding exits</h3>
<p>On iOS, the pending invite token now gets consumed at onboarding completion paths, including the normal completion flow and the skip/test path.</p>
<pre><code class="language-swift">// OnboardingCompleteView.swift
await MainActor.run {
    SocialDeepLinkHandler.shared.consumePendingIfNeeded()
    router.clear()
    router.navigateTo(.chatHome)
}
</code></pre>
<p>The important point was not the exact screen name. It was the guarantee that onboarding could no longer swallow the pending invite state.</p>
<h3>Fix 2: Refresh connection state on people profile appear</h3>
<p><code>PeopleProfileDetailView</code> now refreshes social connections on appear, so the UI reflects the real connection state.</p>
<pre><code class="language-swift">// PeopleProfileDetailView.swift
.onAppear {
    loadProfile()
    loadTimeline()
    Task { await socialService.refreshConnections() }
}
</code></pre>
<p>This was intentionally small. I did not redesign the people card or move the logic into a more elaborate state machine. The goal was simply to ensure that the button reflects the actual backend state instead of stale cached data.</p>
<h3>Fix 3: Gate social entry by backend-driven availability</h3>
<p>The client now derives social feature availability from actual backend responses, instead of relying on build-time assumptions.</p>
<pre><code class="language-swift">// SocialService.swift
@Published var isSocialFeatureAvailable: Bool = true

func refreshConnections() async {
    do {
        let response = try await getConnections()
        cachedConnections = response.connections
        isSocialFeatureAvailable = true
    } catch {
        if Self.isFeatureDisabledError(error) {
            isSocialFeatureAvailable = false
            cachedConnections = []
        }
    }
}
</code></pre>
<p>When the backend returns feature disabled, the invite button is hidden rather than shown-and-fail.</p>
<h3>Fix 4: Route push notifications to socialFeed</h3>
<p>Instead of opening an empty static page, <code>new_shared_card</code> now routes to <code>socialFeed</code> and triggers a feed refresh.</p>
<pre><code class="language-swift">// NotificationRouter.swift
case &quot;new_shared_card&quot;:
    NotificationCenter.default.post(
        name: .socialFeedNeedsRefresh, object: nil
    )
    Router.shared.navigateTo(.socialFeed)
</code></pre>
<p>Both push handling and Notification Center taps converge on the same route. That route already had real content-loading behavior.</p>
<h3>Fix 5: Decline feedback</h3>
<p>Decline now exposes loading and error state, and shows an alert on failure. Users should not have to guess whether their action actually happened.</p>
<h3>Fix 6: Funnel logging</h3>
<p>I added lightweight structured event logging consistent with the project&#39;s existing pattern.</p>
<pre><code class="language-swift">// SocialAnalytics.swift
print(&quot;[event] social_invite_tapped entity_id=\(entityId)&quot;)
print(&quot;[event] social_invite_created entity_id=\(entityId) invite_id=\(response.inviteId)&quot;)
print(&quot;[event] social_invite_accepted&quot;)
print(&quot;[event] social_invite_declined&quot;)
</code></pre>
<p>These are intentionally basic. They are not meant to replace a full analytics pipeline. They are just enough to make the rollout observable.</p>
<h3>Fix 7: Backend env documentation</h3>
<p>The backend <code>.env.example</code> was updated with social-related environment variables and rollout notes.</p>
<pre><code class="language-bash"># .env.example
# Social Connections
GLIA_SOCIAL_ENABLED=0
GLIA_SOCIAL_INVITE_EXPIRY_DAYS=30
GLIA_SOCIAL_INVITE_RATE_LIMIT=20
GLIA_SOCIAL_PUSH_DAILY_LIMIT=3
</code></pre>
<p>That did not change runtime behavior directly, but it reduced rollout risk by making the deployment surface explicit.</p>
<h2>Why most of the fixes were on iOS, not backend</h2>
<p>This is an important distinction.</p>
<p>The reason most fixes landed on iOS was not that the backend was ignored. The backend was part of the audit from the beginning. In fact, one of the useful outcomes of the audit was that it clarified what the backend already had:</p>
<ul>
<li>invite creation</li>
<li>entity share state transitions</li>
<li>connection acceptance</li>
<li>push jobs</li>
<li>social feed read path</li>
</ul>
<p>Those core capabilities mostly existed.</p>
<p>The bigger gaps were in how the client consumed and surfaced them:</p>
<ul>
<li>onboarding was not preserving the flow</li>
<li>people profile state was not staying in sync</li>
<li>feature gating was not reflected in entry visibility</li>
<li>notifications did not route somewhere useful</li>
</ul>
<blockquote>
<p><em>The backend had the capabilities, but the client was still failing to turn them into a coherent user experience.</em></p>
</blockquote>
<p>That is why the right fix was not &quot;rewrite the backend.&quot; It was to close the gaps where user trust and flow continuity were actually breaking.</p>
<h2>What I intentionally did not do</h2>
<p>There are a few things I explicitly avoided.</p>
<ul>
<li>I did not redesign the existing UI. The issue was not that the visual design was wrong.</li>
<li>I did not introduce a new backend capability endpoint just to represent feature availability.</li>
<li>I did not overcomplicate the rollout path. The fixes were intentionally narrow.</li>
</ul>
<blockquote>
<p><em>This work was less about building a feature, and more about making an existing feature trustworthy.</em></p>
</blockquote>
<h2>Before vs After</h2>
<p>Before the audit, the feature looked mostly complete on paper. But from a user perspective, the flow still had major holes.</p>
<table>
<thead>
<tr>
<th>✗ Before</th>
<th>✓ After</th>
</tr>
</thead>
<tbody><tr>
<td>Onboarding could swallow a pending invite</td>
<td>Onboarding exits now continue the pending invite flow</td>
</tr>
<tr>
<td>People profile could show the wrong connection state</td>
<td>People profile refreshes and reflects actual connection state</td>
</tr>
<tr>
<td>Push and Notification Center could route to an empty destination</td>
<td>Push and Notification Center route to <code>socialFeed</code></td>
</tr>
<tr>
<td>A disabled feature could still expose a broken button</td>
<td>Feature-disabled state hides the social entry</td>
</tr>
<tr>
<td>Decline had no feedback</td>
<td>Decline provides loading and error feedback</td>
</tr>
<tr>
<td>Rollout had almost no funnel visibility</td>
<td>Basic invite funnel is observable</td>
</tr>
</tbody></table>
<p>The main invite path became a real closed loop: <code>create invite → share → accept → see feed</code>.</p>
<h2>What made the feature ready for gradual rollout</h2>
<p>Before these changes, I would not have considered the feature rollout-ready. The problem was not just polish. There were genuine P1 breaks in the primary path.</p>
<table>
<thead>
<tr>
<th>Dimension</th>
<th>What changed</th>
</tr>
</thead>
<tbody><tr>
<td>Entry gating</td>
<td>Client respects backend feature availability instead of exposing a button that fails when tapped.</td>
</tr>
<tr>
<td>State correctness</td>
<td>People card converges toward the real connection state instead of misleading the inviter.</td>
</tr>
<tr>
<td>Dead-end removal</td>
<td>Push and Notification Center now route to a useful destination.</td>
</tr>
<tr>
<td>Flow continuity</td>
<td>New-user onboarding no longer drops the invite chain.</td>
</tr>
<tr>
<td>Observability</td>
<td>Enough funnel logging to see where conversion drops during rollout.</td>
</tr>
<tr>
<td>Deployment safety</td>
<td>Social env variables documented in <code>.env.example</code> for safer rollout.</td>
</tr>
</tbody></table>
<blockquote>
<p><em>Rollout readiness depends on closed loops, not isolated endpoints.</em></p>
</blockquote>
<h2>Remaining known issues</h2>
<p>A few issues remain, but they do not block gradual rollout.</p>
<p>The incoming direction semantics inside <code>connectionForEntity</code> are still imperfect. In certain two-sided connection scenarios, the button state may not fully reflect the relationship the way it should.</p>
<p>There is also a short visibility window on first load before the client learns that the backend social feature is disabled. I accepted that tradeoff for now because it keeps the default experience more natural while still preventing the broken tap-to-error behavior.</p>
<p>These are follow-up issues, not reasons to block rollout.</p>
<h2>Final takeaways</h2>
<p>This audit ended up reinforcing a lesson I keep seeing in product engineering:</p>
<blockquote>
<p><em>Social systems fail at edges, not just in the happy path.</em></p>
</blockquote>
<p>A feature can look complete in code while still being untrustworthy in practice. The gaps are often not dramatic failures in core business logic. They are routing gaps, stale state, dead-end destinations, missing entry gating, and silent feedback failures.</p>
<p>Those are exactly the kinds of issues that damage trust because they make the product feel inconsistent, even when much of the system technically works.</p>
<p>The most valuable work here was not adding more surface area. It was closing the distance between system capability and user-complete flow.</p>
<blockquote>
<p><em>The job was not to make the feature bigger. It was to make the existing feature trustworthy.</em></p>
</blockquote>
<hr />
<p><em>Originally published at: <a href="https://dearartist.xyz/blog/glia-invite-flow-audit">https://dearartist.xyz/blog/glia-invite-flow-audit</a></em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-invite-flow-audit">https://dearartist.xyz/blog/glia-invite-flow-audit</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>From Audit to Deployment: Fixing Glia&apos;s Social Invite Flow End-to-End</title>
      <link>https://dearartist.xyz/blog/glia-invite-flow</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/glia-invite-flow</guid>
      <pubDate>Tue, 14 Apr 2026 18:24:00 GMT</pubDate>
      <description><![CDATA[Shipping the fix from audit to deployment.]]></description>
      <category>glia</category>
      <category>social</category>
      <category>deployment</category>
      <content:encoded><![CDATA[<p><em>Engineering Case Study · 2026-04-14 18:24 · 10 min read · Engineering, Product, Backend, iOS, QA, Deployment, Systems Thinking</em></p>
<p>A real product-engineering note on auditing a user-facing invite flow, tightening the state machine, aligning backend behavior, adding dynamic OG images, deploying to QA, and verifying the result.</p>
<blockquote>
<p><strong>Key Takeaways</strong></p>
<ul>
<li>The hardest part was not writing code. It was defining scope correctly.</li>
<li>A user-facing flow is not &quot;done&quot; if backend, client, deployment, and verification are not aligned.</li>
<li>Feature flags, state transitions, and preview surfaces are easy places for product truth and technical behavior to drift apart.</li>
<li>Deployment and smoke verification are part of the work, not an afterthought.</li>
<li>Documentation matters because otherwise the next person has to reverse-engineer intent from code.</li>
</ul>
</blockquote>
<h2>Context</h2>
<p>I recently spent time auditing <a href="https://gliahq.com">Glia ↗</a>&#39;s social invite flow around people/entity cards.</p>
<p>At first glance, the problem looked narrow: when a user taps invite from a people card, does the flow actually work?</p>
<p>But once I started tracing it properly, it became clear that this was not really a question about one button or one API endpoint. It was a question about whether the whole user-facing chain actually closed: the entry point, the backend behavior, the state transitions, the landing page, the preview surface, the deployment setup, the runtime environment, and the final verification.</p>
<p>That distinction matters. A lot of systems look complete when you inspect them file by file. Far fewer are actually complete when you follow the path a real user would take.</p>
<h2>What Glia social sharing actually is</h2>
<p>Glia&#39;s social sharing is not really a generic public share feature.</p>
<p>It is more specific than that. A user has memories inside the system, and some of those memories get organized around people entities. From there, the product can open a share/invite flow tied to a specific person. In other words, the user is not just sharing a link. They are initiating a relationship-aware flow around memories connected to a particular person.</p>
<p>That matters because the object being shared is more sensitive than a normal URL. It has identity, context, preview behavior, landing behavior, accept/decline semantics, and downstream effects on connection state, feed surfaces, and notifications.</p>
<pre><code>People Card  →  Invite Creation  →  Share / Landing  →  Accept / Decline  →  Connection
(Entry point)   (Backend API)       (Preview + OG)      (State transition)    (Feed + Notify)
</code></pre>
<p>So when I looked at this system, I was not asking whether a link could technically be generated. I was asking whether the whole product truth of that flow actually held together.</p>
<h2>Why I looked into this flow</h2>
<p>I was not interested in doing a generic code review.</p>
<p>The real question was whether the people-card invite experience was actually closed end to end. If a user starts from a people card, taps invite, shares something outward, and another person receives it, does the system really support that whole sequence in a coherent way?</p>
<p>That means checking more than backend correctness. It means checking whether the iOS entry point, backend semantics, preview behavior, landing page, connection state, and deployment reality all describe the same product.</p>
<blockquote>
<p><em>That is the kind of work I care about most. Not just &quot;does the code exist,&quot; but &quot;does the system tell the truth.&quot;</em></p>
</blockquote>
<h2>What I was actually auditing</h2>
<p>The object of the audit was not social sharing in a broad sense.</p>
<p>It was a narrower and more specific flow:</p>
<p>A user has a people entity card. They trigger an invite-related action from that card. That action eventually creates a share/invite flow tied to that person. The invite can then be opened, previewed, accepted, declined, and turned into a connection with downstream effects on notifications and social surfaces.</p>
<p>So the real question was: What exactly happens after the user taps invite on a people card, and is that experience actually closed end to end?</p>
<p>That required looking across multiple layers:</p>
<ul>
<li>backend invite creation and validation</li>
<li>invite landing page behavior</li>
<li>feature flag behavior</li>
<li>accept/decline state transitions</li>
<li>preview content and metadata</li>
<li>deployment assumptions</li>
<li>QA runtime behavior</li>
<li>documentation quality</li>
</ul>
<h2>What the audit quickly revealed</h2>
<p>What I found was not one catastrophic bug. It was something more common in real product systems: the flow mostly existed, but several important layers were not fully aligned.</p>
<p>The backend had the right general shape, but there were places where runtime behavior and product semantics could drift apart. The invite acceptance path needed a stronger transition model. The preview layer turned out to matter more than it first looked. And some pieces of rollout truth — especially around deployment and environment behavior — were not things I wanted to leave implicit.</p>
<p>That changed the nature of the work. This was not about adding a feature from scratch. It was about making an existing flow honest, stable, and actually deployable.</p>
<h2>The biggest risks I found</h2>
<p>The most important problems were not cosmetic. They lived in the places that usually create the biggest mismatch between product expectations and real behavior.</p>
<table>
<thead>
<tr>
<th>Finding</th>
<th>Severity</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td>Feature flag drift</td>
<td>Medium</td>
<td>Module-level constants vs runtime reads. Same flag, different behavior depending on code path and process timing.</td>
</tr>
<tr>
<td>Non-atomic acceptance</td>
<td><strong>High</strong></td>
<td>Read-check-write pattern on invite state. TOCTOU-vulnerable. Two concurrent accepts could both succeed.</td>
</tr>
<tr>
<td>Preview surface gaps</td>
<td>Medium</td>
<td>Missing OG images, mixed error semantics. The link preview did not represent the actual product state.</td>
</tr>
</tbody></table>
<h3>1. Feature flag behavior was not fully aligned</h3>
<p>The API layer and the rest of the system were not all reading the social feature flag the same way.</p>
<p>Some paths read the flag dynamically at runtime. Others captured it as a module-level constant. That meant a feature toggle could produce different behavior depending on which path was executed and when the process started.</p>
<pre><code class="language-python"># config.py — stale module-level constant
# This value is captured at import time — never re-evaluated
SOCIAL_ENABLED = os.getenv(&quot;GLIA_SOCIAL_ENABLED&quot;, &quot;0&quot;) == &quot;1&quot;

# Every call site that reads SOCIAL_ENABLED sees the value
# from when the module was first imported, not the current state.
</code></pre>
<pre><code class="language-python"># config.py — runtime check (fix)
def is_social_enabled() -&gt; bool:
    &quot;&quot;&quot;Read the flag at call time, not import time.&quot;&quot;&quot;
    return os.getenv(&quot;GLIA_SOCIAL_ENABLED&quot;, &quot;0&quot;) == &quot;1&quot;
</code></pre>
<p>That is the kind of issue that looks small until you try to operate the system in QA or production.</p>
<h3>2. The invite acceptance path had a real state transition problem</h3>
<p>The original accept flow had a classic read-check-write shape.</p>
<p>That means two near-simultaneous requests could both pass the pending check before one overwrote the other. In other words, the invite state machine was vulnerable to a TOCTOU-style race.</p>
<p>Even if the practical probability was low, the semantics were wrong. For a user-facing invite flow, that matters.</p>
<h3>3. The preview surface and actual landing behavior needed clearer truth</h3>
<p>Invite flows do not begin when the recipient opens the app. They begin when the link preview renders in a message thread.</p>
<p>That means the landing page, OG metadata, Twitter metadata, and preview image are not just presentation details. They are part of the functional experience.</p>
<p>If those pieces are broken or missing, the system is technically working but behaviorally incomplete.</p>
<h2>Fixes shipped</h2>
<p>I approached the fixes with one principle in mind: minimum correct scope.</p>
<p>Not every issue should become a refactor. Not every rough edge is a P0. The goal was to fix what affected correctness, product truth, and deployability.</p>
<h3>Feature flag consistency</h3>
<p>I aligned the social flag behavior so the system no longer relied on stale module-level values in critical paths.</p>
<p>That made the runtime behavior more honest and more predictable. If the flag changes, the relevant logic now reflects that at call time rather than depending on process start timing.</p>
<h3>Atomic invite acceptance</h3>
<p>The most important backend correctness change was making <code>accept_invite</code> atomic.</p>
<pre><code class="language-python"># services/social.py — atomic conditional update
result = db.execute(
    update(EntityShare)
    .where(
        EntityShare.invite_token == token,
        EntityShare.status == &quot;pending&quot;,
        EntityShare.expires_at &gt; now,
        EntityShare.owner_user_id != recipient_user_id,
    )
    .values(
        status=&quot;active&quot;,
        recipient_user_id=recipient_user_id,
        accepted_at=now,
    )
)

if result.rowcount == 0:
    raise InvalidInviteError(&quot;Token invalid, expired, or already used&quot;)
</code></pre>
<p>Instead of a read-check-write flow, the logic was changed to a conditional update path. That closes the race window and makes the state transition behave like an actual state transition rather than a sequence of loosely related checks.</p>
<p>That was one of the highest-value fixes in the whole effort because it tightened the core invariant of the invite lifecycle.</p>
<h3>Invite landing behavior and preview cleanup</h3>
<p>The invite page behavior was tightened so it no longer mixed unrelated semantics.</p>
<p>One of the things I wanted to avoid was pretending that feature disabled and invite expired were the same condition. They are not. The system should not tell a user a token has expired when the real issue is that the feature is turned off.</p>
<p>That kind of semantic precision matters more than it looks. It determines whether the product is understandable under failure conditions.</p>
<h3>Basic observability</h3>
<p>I also added the minimum level of server-side event logging needed to reason about the funnel.</p>
<p>Not a full analytics system. Not attribution. Just enough structured observability to answer basic questions like:</p>
<ul>
<li>was an invite created</li>
<li>was it viewed</li>
<li>was it accepted</li>
<li>was it declined</li>
</ul>
<p>That is the difference between operating blind and having a minimally useful signal.</p>
<h2>Why atomic state transitions mattered</h2>
<p>This was one of the clearest examples of engineering work that is easy to underestimate.</p>
<p>From the outside, accept invite sounds trivial. But in real systems, the meaning of acceptance is only as strong as the transition model underneath it.</p>
<p>If a flow can be accepted twice under race conditions, or if two requests can compete for ownership of the same invitation state, then the business object is not really stable.</p>
<blockquote>
<p><em>I care a lot about this class of issue because users do not experience systems as source code. They experience them as truth claims. When a product says &quot;this invitation was accepted,&quot; that statement should be backed by a state transition that is actually trustworthy.</em></p>
</blockquote>
<h2>Dynamic OG images</h2>
<p>One of the more interesting parts of the work was the preview image layer.</p>
<p>At first glance, a missing OG image looks like a minor presentation bug. But that is not really what it is.</p>
<p>For link-based invite flows, the preview image is part of the product surface. It changes the way the link looks in iMessage, WhatsApp, Telegram, and Twitter/X. It changes how personal the invitation feels. It changes click behavior.</p>
<p>I ended up treating this not as a static asset cleanup problem, but as a better product opportunity.</p>
<pre><code class="language-python"># routes/og.py — dynamic OG image endpoint
@router.get(&quot;/og/invite/{token}.png&quot;)
async def social_invite_og_image(token: str):
    share = get_entity_share_by_token(token)
    if not share:
        return generate_fallback_og_image()

    return generate_og_image(
        inviter_name=share.owner.display_name,
        entity_name=share.entity.name,
        width=1200,
        height=630,
    )
</code></pre>
<pre><code class="language-html">&lt;!-- invite.html — dynamic meta tags --&gt;
&lt;meta property=&quot;og:image&quot;
      content=&quot;https://api.glia.app/og/invite/{{ token }}.png&quot; /&gt;
&lt;meta property=&quot;og:image:width&quot; content=&quot;1200&quot; /&gt;
&lt;meta property=&quot;og:image:height&quot; content=&quot;630&quot; /&gt;
&lt;meta name=&quot;twitter:card&quot; content=&quot;summary_large_image&quot; /&gt;
&lt;meta name=&quot;twitter:image&quot;
      content=&quot;https://api.glia.app/og/invite/{{ token }}.png&quot; /&gt;
</code></pre>
<p>Instead of relying on a missing or generic static image, I added a dynamic OG image route that generates a 1200x630 PNG for each invite.</p>
<p>For valid invites, the image is personalized using the inviter name and entity name.</p>
<p>For invalid or expired invites, the system returns a safe fallback image rather than throwing an error.</p>
<p>That moved the preview layer from broken asset to a real product capability.</p>
<h2>Deployment and verification</h2>
<p>This part matters just as much as the code.</p>
<p>The work was only meaningful once it was:</p>
<ul>
<li>committed cleanly</li>
<li>pushed to both personal and organization remotes</li>
<li>confirmed consistent across local and remote heads</li>
<li>deployed to QA</li>
<li>rebuilt correctly</li>
<li>verified through smoke checks</li>
</ul>
<pre><code class="language-bash"># Push to remotes
$ git push origin main
To github.com:glia-app/glia-backend.git
   a3f1e2d..6dcd36f  main -&gt; main

$ git push org main
Everything up-to-date
</code></pre>
<p>The rebuild detail mattered because I had introduced font dependencies for OG image generation. A plain restart would not have been enough. The container image had to be rebuilt so the runtime environment actually matched the code assumptions.</p>
<pre><code class="language-bash"># Rebuild and deploy to QA
$ docker compose -f docker-compose.yml -f docker-compose.override.yml \
  -f docker-compose.qa.yml up -d api worker beat
=&gt; [internal] load build context
=&gt; [stage-1 4/8] RUN apt-get install -y fonts-jetbrains-mono
=&gt; exporting to image
Container glia-api-1     Started
Container glia-worker-1  Started
Container glia-beat-1    Started
</code></pre>
<p>After deployment, I verified:</p>
<pre><code class="language-bash"># Smoke verification
$ curl -s http://localhost:8000/health | jq .status
&quot;ok&quot;

$ curl -s http://localhost:8000/invite/&lt;token&gt;
200 OK — invite landing rendered

$ curl -s http://localhost:8000/og/invite/&lt;token&gt;.png -o /dev/null -w &#39;%{http_code}&#39;
200

$ pytest tests/social/ -q
62 passed in 4.31s
</code></pre>
<pre><code class="language-bash"># HEAD consistency check
$ echo &quot;local HEAD     = $(git rev-parse --short HEAD)&quot;
local HEAD     = 6dcd36f
$ echo &quot;personal HEAD  = $(git ls-remote personal main | cut -c1-7)&quot;
personal HEAD  = 6dcd36f
$ echo &quot;org HEAD       = $(git ls-remote org main | cut -c1-7)&quot;
org HEAD       = 6dcd36f
</code></pre>
<blockquote>
<p><em>That was the point where the work felt complete. Not when the code compiled. Not when tests passed locally. When the deployed system actually behaved the way the product claimed it behaved.</em></p>
</blockquote>
<h2>What I intentionally left out</h2>
<p>One of the most important parts of this kind of work is deciding what not to do.</p>
<p>I did not try to turn this into a full analytics system.</p>
<p>I did not redesign the broader connection model.</p>
<p>I did not expand the share model into a more generalized multi-recipient system.</p>
<p>I did not try to solve unrelated entity or story problems under the excuse of &quot;while I&#39;m here.&quot;</p>
<p>That discipline matters. A lot of product-engineering work goes off the rails because the initial problem is real, but the response is too expansive.</p>
<h2>What this revealed about product-engineering work</h2>
<p>This work reinforced something I care about a lot:</p>
<p>Good engineering is not just implementation quality. It is scope judgment.</p>
<p>The useful part was not simply finding bugs. It was separating:</p>
<ul>
<li>correctness problems</li>
<li>rollout problems</li>
<li>observability gaps</li>
<li>product-semantic questions</li>
<li>future design opportunities</li>
</ul>
<p>Those are not the same class of problem, and treating them as if they were leads to messy priorities and shallow fixes.</p>
<p>It also reinforced that product truth lives across boundaries.</p>
<p>A flow is not closed just because the backend has an endpoint. It is not closed just because the iOS app can make the request. It is not closed just because QA can load a page.</p>
<blockquote>
<p><em>It is closed when the system&#39;s behavior is coherent from user intent to deployed reality.</em></p>
</blockquote>
<h2>What I fixed</h2>
<ul>
<li>feature flag consistency in social invite paths</li>
<li>atomic invite acceptance</li>
<li>cleaner invite landing behavior</li>
<li>minimum observability for invite funnel actions</li>
<li>OG and Twitter metadata completion</li>
<li>dynamic OG image generation</li>
<li>QA deployment and runtime verification</li>
<li>documentation of final state</li>
</ul>
<h3>What remained intentionally out of scope</h3>
<ul>
<li>full analytics / attribution</li>
<li>broader social model redesign</li>
<li>generalized multi-recipient invite semantics</li>
<li>unrelated entity/story refactors</li>
<li>expansion beyond the current product definition</li>
</ul>
<h2>What I learned</h2>
<p>The part of engineering work I trust most is the part that survives deployment.</p>
<p>I like work that can be described clearly after the fact: what was broken, what was actually fixed, what was left out on purpose, and what the system now truthfully does.</p>
<p>That is what makes a code change feel real.</p>
<blockquote>
<p><em>This was a good reminder that some of the most valuable engineering work is not building something from scratch. It is making an already-existing system honest, closed, and operable.</em></p>
</blockquote>
<hr />
<p><em>Originally published at: <a href="https://dearartist.xyz/blog/glia-invite-flow">https://dearartist.xyz/blog/glia-invite-flow</a></em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/glia-invite-flow">https://dearartist.xyz/blog/glia-invite-flow</a>.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>My Starship Terminal Setup for macOS</title>
      <link>https://dearartist.xyz/blog/starship-setup</link>
      <guid isPermaLink="true">https://dearartist.xyz/blog/starship-setup</guid>
      <pubDate>Tue, 14 Apr 2026 09:30:00 GMT</pubDate>
      <description><![CDATA[My personal Starship prompt config.]]></description>
      <category>terminal</category>
      <category>macos</category>
      <category>developer-tools</category>
      <content:encoded><![CDATA[<p>A clean and practical terminal prompt setup using <a href="https://starship.rs">Starship</a> and JetBrainsMono Nerd Font.</p>
<p>I spent more time in my terminal than I&#39;d like to admit, so at some point I decided it should at least look the way I want it to. Not flashy. Not themed within an inch of its life. Just clean enough that I actually enjoy opening it, and informative enough that I don&#39;t have to think about where I am or what branch I&#39;m on.</p>
<p>This is that setup. It&#39;s small, it&#39;s opinionated, and it works well for my day-to-day. I published the config on GitHub in case anyone wants to start from the same place.</p>
<h2>Preview</h2>
<p>Below is what the prompt actually looks like in practice — a powerline-style segmented prompt with user, directory, git branch + dirty flag, Python version, and a timestamp. Each segment uses a Catppuccin-style color and the seamless arrow transition typical of Starship.</p>
<pre><code class="language-text">Terminal — zsh
─────────────────────────────────────────────
 yuanh  ~  ❯ cd ~/code/starship-config
 yuanh  ~/code/starship-config  ❯ ls
README.md    screenshots    starship.toml
 yuanh  ~/code/starship-config  ❯ python3 --version
Python 3.11.7
 yuanh  ~/code/starship-config  ❯ cd ~/code/glia-core
 yuanh  …/glia-core   main $!?  🐍 v3.11.7  13:41 ❯ git status
On branch main
Changes not staged for commit:
    modified: .env.dev
    modified: .env.qa
 yuanh  …/glia-core   main $!?  🐍 v3.11.7  13:41 ❯ _
</code></pre>
<p><img src="https://dearartist.xyz/substack-visuals/starship-setup/starship-setup-powerline-prompt-card.png" alt="Catppuccin Mocha Powerline prompt with five colored segments: user, directory, git, python, time" /></p>
<p><em>Powerline prompt: user · directory · git · python · time.</em></p>
<p><em>What the prompt looks like in practice.</em></p>
<h2>The Stack</h2>
<ul>
<li><strong>macOS Terminal</strong> — The native terminal app. Nothing fancy, nothing extra.</li>
<li><strong>zsh</strong> — Default shell on macOS. Fast, extensible, well-supported.</li>
<li><strong>Starship</strong> — A minimal, fast, cross-shell prompt written in Rust.</li>
<li><strong>JetBrainsMono Nerd Font Mono</strong> — Monospace font with ligatures and icon glyphs baked in.</li>
</ul>
<h2>What the Prompt Shows</h2>
<ul>
<li>Current user</li>
<li>Current directory</li>
<li>Git branch and status</li>
<li>Python version</li>
<li>Timestamp</li>
</ul>
<h2>How to Use It</h2>
<p><strong>1. Install Starship</strong></p>
<pre><code class="language-bash">curl -sS https://starship.rs/install.sh | sh
</code></pre>
<p><strong>2. Enable it in your shell</strong></p>
<pre><code class="language-bash"># Add to ~/.zshrc
eval &quot;$(starship init zsh)&quot;
</code></pre>
<p><strong>3. Install the font</strong></p>
<pre><code class="language-text">Download JetBrainsMono Nerd Font from nerdfonts.com
Set it as your terminal font.
</code></pre>
<p><strong>4. Copy the config</strong></p>
<pre><code class="language-bash">cp starship.toml ~/.config/starship.toml
</code></pre>
<h2>GitHub</h2>
<p><a href="https://github.com/yuannh/starship-config">View on GitHub</a></p>
<p>Clone it, tweak it, make it yours.</p>
<hr />
<p><em>Originally published at: <a href="https://dearartist.xyz/blog/starship-setup">https://dearartist.xyz/blog/starship-setup</a></em></p>
<hr />
<p><em>Originally published at <a href="https://dearartist.xyz/blog/starship-setup">https://dearartist.xyz/blog/starship-setup</a>.</em></p>]]></content:encoded>
    </item>
  </channel>
</rss>
