Breath First Search

I Built a Production MCP Server in 10 Hours. The Real Lever Was Context Engineering.

Sanjit Saluja — Mon, 16 Mar 2026 23:10:04 GMT

A few weeks ago I was on call. Busy week — a fair number of pages, a couple of incidents, the usual pressure of being the person who has to figure out what's broken right now. My debugging loop was Cursor → TablePlus → Cursor → TablePlus on repeat. Write a query, copy it, paste it into TablePlus, run it, copy the results, paste them back into Cursor so the AI could reason about them, get a follow-up query, copy that, switch back — repeat.

I was spending more time on clipboard management than on actually understanding the problem.

By day three, I started building an MCP server on the side: a tool that gives AI agents direct, read-only access to our production Postgres replicas. Cloudflare tunnel auth, IAM token management, PII scrubbing, four databases, audit logging. I worked on it an hour or two a day between pages. By the end of the week it was in final review: a small set of audited, replica-only query tools with PII protection that let Cursor inspect production data directly — no manual token generation, no copy-pasting between apps. About 10 hours of total effort — and the reason it went that fast wasn't the model. It was that I stopped treating context as chat history and started treating it as a system to design.

Context Engineering

For a while, my default approach to building with AI was a single long conversation. Describe the feature, iterate, keep going. It works for small things. But once a feature stops fitting in one sitting, things go sideways — the model forgets earlier decisions, contradicts itself, confidently breaks things I already fixed. Past a certain thread length, recent information is just louder than old information.

By context engineering, I don't mean writing clever prompts. I mean deliberately controlling three things: what the model sees, what it doesn't see, and what state lives outside the conversation. It's closer to interface design than prompt engineering.

Here's roughly how I partition it:

Step	What I give it	What I leave out	Why
Spec	My idea + one question at a time	Full design upfront	I want behavior before implementation details
Task breakdown	The complete spec	The codebase	Otherwise it reverse-engineers a plan from the current codebase instead of the intended system
Execution	Spec + current task + what's done so far	All other tasks	Prevents speculative coding and keeps changes local

How This Played Out on the MCP Server

Step 1: Spec Interactively

I opened a conversation with a reasoning model and gave it one instruction:

Spec Development Prompt

We're going to develop a detailed spec together in three phases.

## Phase 1: Clarify requirements and constraints

Goal: Build a shared understanding of what we're building and why.

Rules:
- Ask one question at a time. Each builds on my previous answer.
- Do not move to the next question until I've responded.
- Cover: user-facing behavior, data model, integration points, non-functional requirements.
- When I'm vague, ask me to be specific. When I give you a solution, ask me what problem it solves.

**Exit gate:** Summarize your understanding as a bulleted requirements list. Do not proceed until I confirm.

## Phase 2: Challenge the architecture

Goal: Pressure-test the design before we commit.

Behave like a skeptical staff engineer reviewing a design doc. Do not accept my initial architecture at face value. Specifically:
- When you detect an assumption, ask what problem it solves and whether there's a simpler alternative.
- Before converging, ensure we've explored at least one credible alternative.
- Call out irreversible choices and unstated assumptions explicitly.
- Probe for gaps in: retry semantics, failure modes, state lifecycle, observability.

**Exit gate:** List the key decisions made, alternatives rejected with rationale, and open questions. Do not proceed until I confirm.

## Phase 3: Produce the spec

Goal: A developer-ready spec I can hand to a reviewer or work from directly.

Output format:
1. **Overview** — what and why (2-3 sentences)
2. **Requirements** — confirmed from Phase 1
3. **Architecture** — components, data flow, integration points
4. **Key decisions** — what we decided, why, and what we rejected
5. **Data model** — tables, fields, relationships
6. **API surface** — endpoints, events, or interfaces exposed/consumed
7. **Edge cases and failure modes**
8. **Open questions**

---

Begin Phase 1. Ask the first question.

Here's the idea: [IDEA]

The one-question-at-a-time constraint has become my favorite trick. It forces the model to integrate each answer before moving on. It also forces me to think instead of dumping half-formed ideas into the thread.

I walked in with a rough idea: wrap our existing Cloudflare tunnel + AWS IAM auth flow behind MCP tools so AI agents can query production read replicas without manual setup. Two hours of back-and-forth, and two things happened that I don't think would have happened if I'd just started coding.

The model challenged my architecture. My initial design proposed shelling out to CLI tools — running aws rds generate-db-auth-token as a subprocess, parsing stdout, piping results around. I'd been doing it manually during on-call, so automating the same commands felt natural. The model pushed back: why spawn processes and parse strings when @aws-sdk/rds-signer gives you Signer.getAuthToken() directly? Testable, no string parsing, respects the credential provider chain natively.

I didn't just take its word for it — I asked it to walk through the tradeoffs. But the conversation surfaced a better design than what I walked in with.

Deep domain knowledge showed up where I didn't expect it. The model was fluent in IAM auth token mechanics, RDS replica routing, Cloudflare Access tunnel lifecycle, Postgres session-level guardrails. It suggested default_transaction_read_only=on at the session level as defense-in-depth on top of DB role permissions. It flagged that IAM tokens expire after 15 minutes and proposed reconnect-on-auth-failure instead of timer-based refresh — simpler, and it handles edge cases like clock skew naturally. None of this was prompted; it emerged because the spec format gave the model room to reason about the full problem.

By the end I had a structured document: problem statement, tool interface, connection model, module breakdown, security constraints, error codes, and explicit open decisions. Those suggestions didn't surface because of a magical prompt. They surfaced because the one-at-a-time format let the model reason about the full problem before I pushed it into implementation.

Step 2: Task Breakdown

I use Dex for task management, but the tool matters less than the property: tasks live as files on the local filesystem rather than in a cloud board or a chat thread. This turned out to matter more than I expected.

I fed the finished spec into Dex and broke it into tasks scoped to single modules: target registry with replica-only validation, Cloudflare tunnel lifecycle manager, IAM token generation with reconnect-on-failure, session-level Postgres guardrails, audit logging. Each independently mergeable, with explicit dependencies.

The key thing about tasks on disk: they persist between sessions without me having to carry them in a conversation thread. When I opened a new coding session the next day between pages, I pointed the model at the spec and one Dex task file — not yesterday's chat history. The model couldn't build ahead because it only saw one task. It didn't need to remember what I did yesterday. The state lived on the filesystem, not in the context window.

This is the difference between a long thread and externalized state. A thread loses fidelity as it grows. A task file on disk is the same on day one and day five.

Step 3: Execution

Each coding session got a three-part context: spec for intent, current task for scope, dependency state for boundaries. For example: target registry done, tunnel manager done, audit logger pending — so the model knew what interfaces already existed and what it should not invent.

The first pass produced working code with the wrong structure — everything in a monolithic index.ts and an ever-growing types.ts. I've seen this enough to think of it as a default AI code smell: a few general-purpose files that just keep expanding.

I opened a fresh session with the spec's module breakdown as explicit context: "here's the target architecture — target-registry, access-adapter, client-adapter, query-orchestrator, audit-logger. Refactor into this structure." Clean separation on the second pass.

The model didn't write bad code because it's bad at code. It wrote monolithic code because nothing in its context told it not to.

Step 4: Review

Asking the model that wrote the code to also review it is like reviewing your own PR. So I used a separate model for code review — one better at reading and critiquing than generating. This reflects something I've come to believe more broadly: generation and critique are different tasks and benefit from different contexts, sometimes different tools.

The review caught edge cases in tunnel process cleanup during shutdown, validated that Postgres guardrails were applied on every connection (not just the first), and checked consistency across the error model.

Total: about 10 hours across the week. Two hours on the spec. A few hours building across sessions. An hour each on code review and security review. Fifteen minutes of final QA.

Where This Doesn't Work

I want to be honest about the limits.

The model assumed an architecture I hadn't decided on. On a previous project, I got several tasks deep before realizing the model had picked an approach that closed off what I actually wanted. Had to revert and restart. The fix I've landed on: surface architectural decisions as explicit open questions in the spec and flag them before unblocking dependent tasks. The MCP build caught this early — but only because I'd been burned before.

First-pass code quality was mediocre. The MCP build had this. If nothing in context specifies module boundaries, code piles up. I now include the target architecture explicitly in the execution context.

Spec too big, tasks explode. Twenty-plus tasks with blurry boundaries usually means the feature isn't one feature. More than eight to ten has been my signal to split the spec.

The model inherited the local testing culture. In parts of the codebase where tests were sparse, it treated tests as optional. Models mirror visible norms. I've started adding "each task includes tests for the behavior it introduces" to the task breakdown, and calling out low coverage at the start of coding sessions.

There are also categories of work where I don't think this workflow fits: greenfield exploration where requirements are genuinely unknown, large cross-team architectural decisions that need human alignment more than AI output, or domains where correctness depends on deep hidden context that can't easily be written into a spec. This is a workflow for building well-understood features faster. It's not a replacement for figuring out what to build.

What I Took Away

The biggest shift for me was realizing that reliability doesn't come from asking the model to "remember." It comes from deciding what belongs in the context window and what belongs outside it. Once I started treating context as a designed interface instead of accumulated chat history, multi-day AI-assisted work got much better.

The model matters. But on multi-day engineering work, context has been the bigger lever.

I'm still iterating on this. If you've found approaches that work for keeping context tight on multi-day features — especially across service boundaries — I'd like to hear about them.

How I Cut Page Load Time by 90%

Sanjit Saluja — Sun, 15 Feb 2026 22:57:02 GMT

I built Zugzwang, a browser-based chess puzzle app powered by Stockfish WASM. It worked, but the initial load was brutal—nearly 39 seconds on a throttled 4G connection before the board was even visible. Slower devices showed even worse numbers.

If you're shipping WASM or heavy client-side dependencies, you've probably hit this wall.

What was the holdup? Everything was on the critical path. A modal the user hadn't opened yet, CSS for themes the user hadn't selected, and a 7.3 MB chess engine the user didn't need until their first move.

Three commits later, board-visible time dropped from 38.7s to 3.7s (p50). Here's exactly what I did.

Defining the Metrics

Before making any changes, I needed clear targets.

Board-visible time is my primary metric: the moment the chessboard element (.ui-board-root) first renders with a non-zero layout box. This is custom instrumentation via Playwright's MutationObserver, and it directly measures "when can the user see the puzzle and start thinking."

LCP (Largest Contentful Paint) is the standard Web Vital that captures when the largest element finishes rendering. In this app, LCP closely tracks board-visible time but includes additional paint work. I used LCP as a supporting metric via PerformanceObserver.

FCP (First Contentful Paint) measures when any content first appears. This stayed relatively stable across optimizations—the big wins came from what happened after first paint.

Cloudflare supports LCP and FCP out of the box and both Chrome DevTools and Lighthouse also report LCP and FCP. For all measurements, I used the throttled 4G simulation (1.6 Mbps down, 750 ms RTT) on the same machine with Playwright automation.

Commit 1: Lazy-Load the Menu Modal

Problem: The MenuModal component (stats, settings, theme pickers) was statically imported in the main app. Its code shipped in the main bundle and was parsed on every page load, even though most users don't open the menu immediately.

Fix: React's lazy() + Suspense.

const LazyMenuModal = lazy(async () => {
  const module = await import("@/components/MenuModal");
  return { default: module.MenuModal };
});

A shouldRenderMenu state gate ensures the component tree doesn't even mount until the user first opens the menu. This avoids the lazy chunk being prefetched by React before it's wanted:

const [shouldRenderMenu, setShouldRenderMenu] = useState(false);

useEffect(() => {
  if (isMenuOpen) setShouldRenderMenu(true);
}, [isMenuOpen]);

Result:

Metric	Before	After	Delta
Main JS (initial route)	427.60 kB	415.20 kB	-12.40 kB (-2.9%)
Main JS gzip	133.74 kB	130.99 kB	-2.75 kB (-2.1%)
MenuModal chunk (loaded on demand)	—	16.88 kB (4.70 kB gzip)	—

This change barely moved the needle. But it established the mental model for the bigger wins: if a component is behind a user interaction, it doesn't belong in your initial bundle.

Commit 2: Load Only the Selected Board and Piece Styles

Problem: The app ships multiple chessboard themes (blue, brown, gray, green) and piece sets. All of them were statically imported as CSS—meaning every user downloaded every theme on first load, even though they can only see one at a time.

Fix: A tiny style loader (chessground-style-loader.ts) that dynamically imports only the active theme's CSS:

const boardThemeLoaders: Record Promise> = {
  blue:  () => import("@/styles/chessground-board-theme-blue.css"),
  brown: () => import("@/styles/chessground-board-theme-brown.css"),
  // ...
};

const loadedBoardThemes = new Set();
const loadingBoardThemes = new Map>();

async function loadStyleOnce(
  name: string,
  loaded: Set,
  loading: Map>,
  loaders: Record Promise>
) {
  if (loaded.has(name)) return;
  if (loading.has(name)) return loading.get(name);

  const promise = loaders[name]()
    .then(() => { loaded.add(name); })
    .finally(() => { loading.delete(name); });

  loading.set(name, promise);
  return promise;
}

The Board component calls ensureBoardThemeStyles(theme) and ensurePieceSetStyles(pieceSet) in useEffect hooks whenever the theme prop changes. The loaded set and loading map prevent duplicate requests.

UX guardrail: To avoid a flash of unstyled board, I load the user selected theme CSS synchronously in the initial bundle—only alternative themes load dynamically.

I also split the monolithic CSS file into per-theme files using CSS custom properties and gradients, which gave the bundler clean split points.

Result:

Metric	Before	After	Delta
Initial app CSS	159.51 kB	81.15 kB	-78.36 kB (-49.1%)
Initial app CSS gzip	33.02 kB	12.92 kB	-20.10 kB (-60.9%)

A 61% reduction in CSS over the wire. CSS is render-blocking by default—the browser won't paint anything until it's finished parsing all linked stylesheets. Cutting the CSS payload in half directly accelerated first paint.

Commit 3: Decouple Puzzle Render From Stockfish Startup

This was the big one. The app loads the Stockfish AI as WASM to simulate computer moves and provide user move feedback.

Problem: The app waited for the WASM to initialize before rendering the puzzle board. Stockfish's WASM binary is ~7.3 MB. On a slow connection, the user stared at a loading spinner for 36+ seconds before seeing a single chess piece.

But here's the thing: the user needs to see the board right away. The engine is only needed to validate moves. That's a meaningfully different moment in the user flow.

Deep Dive

These waterfall charts show exactly what changed. In the baseline, the board couldn't render until the 36-second WASM download completed:

Baseline: Board visibility blocked on Stockfish WASM (36.28s)

After decoupling, the board renders in ~3.7s while WASM downloads in the background:

Current: Board visible at 3.7s; WASM download continues in parallel

The striped bar in the current waterfall shows Stockfish still in-flight when the board becomes visible. That's the critical path fix visualized.

Fix: I restructured the initialization sequence so the puzzle data and board render proceed independently of the engine:

Fetch puzzles and render immediately. The puzzle JSON is small (~1.4s to load on throttled 4G). Once it arrives, mount the board.
Initialize Stockfish in the background. A stockfishRef holds the engine instance; a createPuzzleStrategy() function lazily initializes it on the first move that actually requires engine evaluation.
Show engine state only when relevant. A new isAwaitingEngineMove flag drives a "Loading engine..." indicator in PuzzleInfo, but only when the user has made a move and the engine hasn't finished loading. Before that, the user sees the board and can think about the position.

// Lazy engine initialization — only when we actually need evaluation
function createPuzzleStrategy() {
  if (stockfishRef.current) return engineStrategy(stockfishRef.current);

  beginEngineWait();
  return initStockfish()
    .then(engine => {
      stockfishRef.current = engine;
      endEngineWait();
      return engineStrategy(engine);
    })
    .catch(() => {
      endEngineWait();
      return solutionBasedStrategy(); // graceful fallback
    });
}

Graceful Degradation

The fallback to solutionBasedStrategy() is a deliberate architectural choice. If Stockfish fails to load—network timeout, WASM unsupported, whatever—the app remains functional. It checks moves against the known solution line instead of running a full evaluation. Users lose engine analysis for alternative lines, but they can still solve puzzles. This matters for offline scenarios and older devices where WASM might be flaky.

I also added tests for the loading states (PuzzleInfo.test.tsx) covering the three key scenarios: active play, engine loading during validation, and puzzle completion.

Result:

Metric	Baseline p50	Current p50	Delta
Board visible	38,669 ms	3,673 ms	-34,996 ms (-90.5%)
LCP	38,688 ms	4,488 ms	-34,200 ms (-88.4%)
FCP	2,376 ms	2,268 ms	-108 ms (-4.5%)

The board now renders as soon as the puzzle data arrives. Stockfish loads in the background. The user starts thinking about the position 35 seconds earlier.

What I Considered But Didn't Ship

I also evaluated SSR, WASM streaming, and service workers.

SSR didn't address this bottleneck. It could pre-render markup, but Stockfish still downloads and initializes on the client regardless of how the markup arrives. The dominant cost—7.3 MB of WASM—remains unchanged. SSR would add architectural complexity without fixing the actual critical path.

WASM streaming was already present. Stockfish's runtime uses instantiateStreaming with a fallback path. I tested forcing non-streaming, and it barely moved the needle—engine readiness changed by ~35ms. Not worth pursuing.

Service workers are the one meaningful follow-up. They won't help cold-load board visibility, but they should dramatically reduce repeat-visit engine startup by caching the WASM binary. I'll try this next.

This investigation reinforced the core lesson: measure before you architect. SSR and streaming sounded like obvious wins until I traced the actual bottleneck.

Summary

Change	What Moved Off the Critical Path	Key Savings
Lazy-load menu modal	16.88 kB of JS (modal code)	-2.9% initial JS
Dynamic theme loading	Unused CSS themes and piece sets	-61% initial CSS (gzip)
Decouple Stockfish	7.3 MB WASM binary	-90.5% board-visible time

Full Distribution Results

Metric	Baseline p50	Current p50	Delta	Baseline p95	Current p95
Board visible	38,669 ms	3,673 ms	-90.5%	38,687 ms	3,675 ms
LCP	38,688 ms	4,488 ms	-88.4%	38,707 ms	4,495 ms
FCP	2,376 ms	2,268 ms	-4.5%	2,384 ms	2,268 ms

Methodology

All measurements from 7 cold-load runs per revision using Playwright with Chromium. Network throttled via CDP to simulate Slow 4G (1.6 Mbps down, 0.75 Mbps up, 750 ms RTT). Fresh browser context per run with cache disabled. Board-visible time measured via MutationObserver watching for .ui-board-root with non-zero layout box. LCP and FCP captured via PerformanceObserver.

Takeaways

Audit your critical path, not your bundle size. Bundle size is a proxy metric. The real question is: what does the user need to see and interact with right now, and what can wait? In this case, the largest payload (Stockfish WASM) wasn't even render-blocking by nature—I had just wired it up that way.

CSS is the silent blocker. JavaScript gets all the performance discourse, but CSS is render-blocking by default. Shipping 160 kB of CSS when the user only needs 80 kB means the browser is parsing themes the user will never see before it paints anything.

Lazy loading is a spectrum. React.lazy() is the obvious tool, but the same principle applies to CSS, WASM, and any asset. The pattern is always the same: identify the trigger (user interaction, route change, first move), load the asset at that trigger, and handle the loading state gracefully.

Measure before you architect. The "obvious" optimizations (SSR, streaming) weren't the right levers for this problem. The bottleneck wasn't server rendering or download efficiency—it was blocking on resources that weren't needed yet.

The full source is at github.com/sanjitsaluja/zugzwang-puzzle-trainer. If you're building with heavy WASM dependencies, I'd be curious to hear how you've handled the initialization tradeoff.