I Built a Production MCP Server in 10 Hours. The Real Lever Was Context Engineering.
A few weeks ago I was on call. Busy week — a fair number of pages, a couple of incidents, the usual pressure of being the person who has to figure out what's broken right now. My debugging loop was Cursor → TablePlus → Cursor → TablePlus on repeat. Write a query, copy it, paste it into TablePlus, run it, copy the results, paste them back into Cursor so the AI could reason about them, get a follow-up query, copy that, switch back — repeat.
I was spending more time on clipboard management than on actually understanding the problem.
By day three, I started building an MCP server on the side: a tool that gives AI agents direct, read-only access to our production Postgres replicas. Cloudflare tunnel auth, IAM token management, PII scrubbing, four databases, audit logging. I worked on it an hour or two a day between pages. By the end of the week it was in final review: a small set of audited, replica-only query tools with PII protection that let Cursor inspect production data directly — no manual token generation, no copy-pasting between apps. About 10 hours of total effort — and the reason it went that fast wasn't the model. It was that I stopped treating context as chat history and started treating it as a system to design.
Context Engineering
For a while, my default approach to building with AI was a single long conversation. Describe the feature, iterate, keep going. It works for small things. But once a feature stops fitting in one sitting, things go sideways — the model forgets earlier decisions, contradicts itself, confidently breaks things I already fixed. Past a certain thread length, recent information is just louder than old information.
By context engineering, I don't mean writing clever prompts. I mean deliberately controlling three things: what the model sees, what it doesn't see, and what state lives outside the conversation. It's closer to interface design than prompt engineering.
Here's roughly how I partition it:
| Step | What I give it | What I leave out | Why |
|---|---|---|---|
| Spec | My idea + one question at a time | Full design upfront | I want behavior before implementation details |
| Task breakdown | The complete spec | The codebase | Otherwise it reverse-engineers a plan from the current codebase instead of the intended system |
| Execution | Spec + current task + what's done so far | All other tasks | Prevents speculative coding and keeps changes local |
How This Played Out on the MCP Server
Step 1: Spec Interactively
I opened a conversation with a reasoning model and gave it one instruction:
Spec Development Prompt
We're going to develop a detailed spec together in three phases.
## Phase 1: Clarify requirements and constraints
Goal: Build a shared understanding of what we're building and why.
Rules:
- Ask one question at a time. Each builds on my previous answer.
- Do not move to the next question until I've responded.
- Cover: user-facing behavior, data model, integration points, non-functional requirements.
- When I'm vague, ask me to be specific. When I give you a solution, ask me what problem it solves.
**Exit gate:** Summarize your understanding as a bulleted requirements list. Do not proceed until I confirm.
## Phase 2: Challenge the architecture
Goal: Pressure-test the design before we commit.
Behave like a skeptical staff engineer reviewing a design doc. Do not accept my initial architecture at face value. Specifically:
- When you detect an assumption, ask what problem it solves and whether there's a simpler alternative.
- Before converging, ensure we've explored at least one credible alternative.
- Call out irreversible choices and unstated assumptions explicitly.
- Probe for gaps in: retry semantics, failure modes, state lifecycle, observability.
**Exit gate:** List the key decisions made, alternatives rejected with rationale, and open questions. Do not proceed until I confirm.
## Phase 3: Produce the spec
Goal: A developer-ready spec I can hand to a reviewer or work from directly.
Output format:
1. **Overview** — what and why (2-3 sentences)
2. **Requirements** — confirmed from Phase 1
3. **Architecture** — components, data flow, integration points
4. **Key decisions** — what we decided, why, and what we rejected
5. **Data model** — tables, fields, relationships
6. **API surface** — endpoints, events, or interfaces exposed/consumed
7. **Edge cases and failure modes**
8. **Open questions**
---
Begin Phase 1. Ask the first question.
Here's the idea: [IDEA]
The one-question-at-a-time constraint has become my favorite trick. It forces the model to integrate each answer before moving on. It also forces me to think instead of dumping half-formed ideas into the thread.
I walked in with a rough idea: wrap our existing Cloudflare tunnel + AWS IAM auth flow behind MCP tools so AI agents can query production read replicas without manual setup. Two hours of back-and-forth, and two things happened that I don't think would have happened if I'd just started coding.
The model challenged my architecture. My initial design proposed shelling out to CLI tools — running aws rds generate-db-auth-token as a subprocess, parsing stdout, piping results around. I'd been doing it manually during on-call, so automating the same commands felt natural. The model pushed back: why spawn processes and parse strings when @aws-sdk/rds-signer gives you Signer.getAuthToken() directly? Testable, no string parsing, respects the credential provider chain natively.
I didn't just take its word for it — I asked it to walk through the tradeoffs. But the conversation surfaced a better design than what I walked in with.
Deep domain knowledge showed up where I didn't expect it. The model was fluent in IAM auth token mechanics, RDS replica routing, Cloudflare Access tunnel lifecycle, Postgres session-level guardrails. It suggested default_transaction_read_only=on at the session level as defense-in-depth on top of DB role permissions. It flagged that IAM tokens expire after 15 minutes and proposed reconnect-on-auth-failure instead of timer-based refresh — simpler, and it handles edge cases like clock skew naturally. None of this was prompted; it emerged because the spec format gave the model room to reason about the full problem.
By the end I had a structured document: problem statement, tool interface, connection model, module breakdown, security constraints, error codes, and explicit open decisions. Those suggestions didn't surface because of a magical prompt. They surfaced because the one-at-a-time format let the model reason about the full problem before I pushed it into implementation.
Step 2: Task Breakdown
I use Dex for task management, but the tool matters less than the property: tasks live as files on the local filesystem rather than in a cloud board or a chat thread. This turned out to matter more than I expected.
I fed the finished spec into Dex and broke it into tasks scoped to single modules: target registry with replica-only validation, Cloudflare tunnel lifecycle manager, IAM token generation with reconnect-on-failure, session-level Postgres guardrails, audit logging. Each independently mergeable, with explicit dependencies.
The key thing about tasks on disk: they persist between sessions without me having to carry them in a conversation thread. When I opened a new coding session the next day between pages, I pointed the model at the spec and one Dex task file — not yesterday's chat history. The model couldn't build ahead because it only saw one task. It didn't need to remember what I did yesterday. The state lived on the filesystem, not in the context window.
This is the difference between a long thread and externalized state. A thread loses fidelity as it grows. A task file on disk is the same on day one and day five.
Step 3: Execution
Each coding session got a three-part context: spec for intent, current task for scope, dependency state for boundaries. For example: target registry done, tunnel manager done, audit logger pending — so the model knew what interfaces already existed and what it should not invent.
The first pass produced working code with the wrong structure — everything in a monolithic index.ts and an ever-growing types.ts. I've seen this enough to think of it as a default AI code smell: a few general-purpose files that just keep expanding.
I opened a fresh session with the spec's module breakdown as explicit context: "here's the target architecture — target-registry, access-adapter, client-adapter, query-orchestrator, audit-logger. Refactor into this structure." Clean separation on the second pass.
The model didn't write bad code because it's bad at code. It wrote monolithic code because nothing in its context told it not to.
Step 4: Review
Asking the model that wrote the code to also review it is like reviewing your own PR. So I used a separate model for code review — one better at reading and critiquing than generating. This reflects something I've come to believe more broadly: generation and critique are different tasks and benefit from different contexts, sometimes different tools.
The review caught edge cases in tunnel process cleanup during shutdown, validated that Postgres guardrails were applied on every connection (not just the first), and checked consistency across the error model.
Total: about 10 hours across the week. Two hours on the spec. A few hours building across sessions. An hour each on code review and security review. Fifteen minutes of final QA.
Where This Doesn't Work
I want to be honest about the limits.
The model assumed an architecture I hadn't decided on. On a previous project, I got several tasks deep before realizing the model had picked an approach that closed off what I actually wanted. Had to revert and restart. The fix I've landed on: surface architectural decisions as explicit open questions in the spec and flag them before unblocking dependent tasks. The MCP build caught this early — but only because I'd been burned before.
First-pass code quality was mediocre. The MCP build had this. If nothing in context specifies module boundaries, code piles up. I now include the target architecture explicitly in the execution context.
Spec too big, tasks explode. Twenty-plus tasks with blurry boundaries usually means the feature isn't one feature. More than eight to ten has been my signal to split the spec.
The model inherited the local testing culture. In parts of the codebase where tests were sparse, it treated tests as optional. Models mirror visible norms. I've started adding "each task includes tests for the behavior it introduces" to the task breakdown, and calling out low coverage at the start of coding sessions.
There are also categories of work where I don't think this workflow fits: greenfield exploration where requirements are genuinely unknown, large cross-team architectural decisions that need human alignment more than AI output, or domains where correctness depends on deep hidden context that can't easily be written into a spec. This is a workflow for building well-understood features faster. It's not a replacement for figuring out what to build.
What I Took Away
The biggest shift for me was realizing that reliability doesn't come from asking the model to "remember." It comes from deciding what belongs in the context window and what belongs outside it. Once I started treating context as a designed interface instead of accumulated chat history, multi-day AI-assisted work got much better.
The model matters. But on multi-day engineering work, context has been the bigger lever.
I'm still iterating on this. If you've found approaches that work for keeping context tight on multi-day features — especially across service boundaries — I'd like to hear about them.