Table of Contents
- Introduction
- What changed in November 2025
- The experiment: does it actually work?
- Rebuilding the org chart
- Quality at scale: the Handbook
- Adapting the stack
- Opening the build to the whole company
- Learnings
- What’s next
Introduction
Six months ago, our engineers wrote most of our code by hand. LLMs were around, but scoped: autocomplete, targeted refactors, boilerplate, or very specific workflows. Anything that required real judgement stayed with a human. We had tried handing bigger tasks to AI agents, with little success.
Today, the same backlog gets drained overnight by a fleet of agents we rarely babysit. Designers, PMs, and even CSMs ship features by mentioning a bot in Slack. Engineering day-to-day has been totally reshuffled. And our 25-person team handles problems that, a year ago, would have needed several hundred.
This post is a retrospective of how we got there: what triggered the shift, how we benchmarked the new models before trusting them, how we rebuilt our org, and the infrastructure we had to invent to keep quality intact at 10x the PR rate. It also draws a summary of what isn’t solved yet.
What changed in November 2025
Every other month, AI companies and labs release new frontier models that claim to be revolutionary compared to their predecessors, crushing all the records on the major benchmarks. More often than not, this turns out to be more marketing gloss than real improvement. November 2025 was the rare case where the hype came closest to the reality.
Within a few weeks, Anthropic shipped Opus 4.5 and OpenAI shipped GPT-5.2. The raw benchmark deltas were one thing, but the qualitative shift mattered more: models could now execute long, complex, autonomous tasks with a reliability that crossed a threshold we cared about. The kind of work we used to reserve for engineers because LLMs would veer off, hallucinate APIs, or quietly produce something that looked right and behaved wrong.
That said, we approached the whole thing with healthy skepticism. There’s a well-documented pattern where senior engineers feel more productive with AI tooling while their actual throughput drops, and METR’s 2025 study on experienced open-source developers remains the canonical example. Add to that the general volume of hype around generative AI broadly speaking, and you have a recipe for bad decisions. We wanted to measure rather than vibe our way into a strategy.
The experiment: does it actually work?
Methodology
We designed the fairest comparison we could think of: run the agent against the product team on the exact same backlog, during a normal sprint cadence, for several weeks.
A few rules we stuck to:
- Same inputs, raw and unedited. Agents received the exact raw material a human would get on the same ticket: the user story, screenshots, design files, wireframes. Rewriting tickets to be agent-friendly would have biased the comparison, so we kept the ambiguities in place. If a ticket was ambiguous for a human, it stayed ambiguous for the agent.
- Real merge bar. A PR was counted as “validated” only when it passed CI, cleared our full quality gates, and would have earned a human reviewer’s merge. The same bar we apply to the rest of the team, with zero concessions for “it kind of works”.
- Count the round-trips. If the agent needed three rounds of feedback to get there, those three rounds counted. A PR that eventually lands after five iterations of corrections is hard to call a win, so we tracked each iteration as a cost.
- Track the tokens consumption (and thus the bill). Every agent run had its token usage and dollar cost logged. A task that takes 7 hours of senior time is a poor trade at $500 of tokens, and we wanted to see where that curve actually sat in practice.
On the human side, we captured two separate estimates per ticket: one from a senior engineer, one from a junior. Our product team is deliberately heterogeneous (people straight out of school working alongside engineers with decades of experience), and averaging them into a single “team velocity” would have hidden the signal. We wanted to see where the agent landed against each profile.
Results
We ran this for several weeks across most of our active backlog, and the resulting dataset is too long to drop into a blog post as-is. The table below is a representative slice of hand-picked tickets to span the shape of the experiment and provide a good overview of the overall result. A couple of straightforward bug fixes, some UI polish, a recurrence edge case that pushed back on the agent, a larger multi-bug cleanup, a few proper feature-sized pieces of work. Same merge bar as the rest of the team, same raw inputs, same codebase. Every ticket in the slice was picked up by the agent, taken end-to-end, and merged into production.
| Task | Takeovers | Junior est. (min) | Senior est. (min) | Agent (min) | Cost (€) | Model |
|---|---|---|---|---|---|---|
| Export by agent | 0 | 120 | 60 | 5 | 0.71 | Opus 4.5 |
| UI improvements | 2 | 1,260 | 480 | 16 | 1.22 | Opus 4.5 |
| Fix week calculation | 5 | 120 | 60 | 25 | 2.95 | Opus 4.5 |
| Stop/start recurrence switch | 2 | 300 | 120 | 12 | 5.30 | Opus 4.5 |
| Discard deleted jobs in analytics | 0 | 240 | 60 | 3 | 2.11 | Opus 4.5 |
| Fix export table column | 0 | 180 | 60 | 4 | 4.27 | Opus 4.5 |
| Fix specific recurrence issue | 0 | 120 | 60 | 5 | 1.92 | Opus 4.5 |
| Fix curve diagram stacking | 0 | 300 | 120 | 13 | 8.77 | Opus 4.5 |
| Multiple bugs | 3 | 420 | 180 | 42 | 28.80 | Opus 4.5 |
| Recurrence labels mismatch | 0 | 120 | 60 | 5 | 28.00 | Opus 4.6 |
| Single-task jobs | 3 | 2,160 | 600 | 25 | — | Opus 4.6 |
| Fix main view on job deletion | 0 | 240 | 30 | 5 | — | Opus 4.6 |
| Optional duration fields in creation modal | 1 | 420 | 60 | 23 | — | Opus 4.6 |
| Save user viewing preferences | 2 | 1,200 | 240 | 37 | — | Opus 4.6 |
| Aggregate (14 tasks) | 18 | 7,200 | 2,190 | 220 | 84.05 | — |
Crunched into a handful of headlines, from this slice and confirmed against the broader run:
- The vast majority of tickets merged into production. Across the broader experiment, the overall merge rate settled around 95%.
- Velocity gains went up to roughly x10 against senior estimates and x32 against junior estimates.
- Median cost per merged PR came in under €3. The single expensive outlier, a multi-bug cleanup, closed at €28.80, which is still an order of magnitude below what an equivalent senior hour costs us.
- Most tickets merged after one round of human feedback or none at all. The worst case in this slice asked for five rounds, on a tricky ticket that needed real domain discussion to nail down.
The x32 headline sits at the aggregate level and, on its own, hides a wide distribution that we think is worth staring at. Per-ticket ratios against junior estimates stretch from roughly x5 on the trickiest items in the slice up to around x86 on the most boilerplate-heavy ones. Certain classes of ticket (well-scoped, lots of boilerplate, straightforward CRUD) collapse to near-zero for an agent, whereas the same ticket takes a junior half a day because they’re simultaneously learning the codebase and writing the code. Other tickets, usually the ones that require product judgement or tricky state handling, land much closer to parity. That spread, and the consistency of the shape, is really the main signal we took away from the experiment.
These results sent a crystal-clear signal: our backlog could now be drained around the clock, with minimal human intervention, at a quality level we’re willing to ship.
Rebuilding the org chart
While the numbers were great, they naturally raised a set of hard but legitimate human questions.
The morning after we shared the results internally, the first question we got from engineering was the obvious one: “if an agent can do what I do every day, what am I still here for?” It was the perfect opportunity to step back and have a deep, honest conversation about what the purpose of an engineer actually is.
So we ran a full-day workshop to reset the fundamentals.
Rethinking (or not) the engineering role
The discussion sessions led us to a first critical conclusion, one I personally hold dearly: coding has never been the mission of an engineer. It’s the tool. The mission has always been to understand a problem and shape a solution that solves it elegantly, safely, and at scale. For most of the industry’s history, coding absorbed so much of the day that we ended up confusing the tool with the job. Now that the tool is close to a commodity, the real job becomes visible again, and it turns out to need more engineering judgement than before, precisely because the consequences of each judgement compound faster than they used to.
Interestingly, engineers were not the only ones shaken by the process. Design, product management, and engineering have been fighting the same battles for years: context lost in translation, tickets that say what but not why, designers shipping mockups that can’t be built, engineers shipping features that don’t match the need. The silos were a necessary evil when each discipline was a full-time job.
AI unifies them. A product engineer (or product builder) is someone who can fetch the context themselves, talk to a customer, run an experiment, shape the UX, implement the feature end-to-end, measure the outcome, and iterate. This works because the agent quietly absorbs the parts that used to demand a separate specialist — the wiring, the boilerplate, the test scaffolding, the pixel-pushing — which frees the human to spend their time on the parts that truly require taste and judgement.
This is the direction we’re pushing the team in, with the goal of getting to fewer silos, shorter feedback loops, and real ownership of a feature all the way from the original user pain through to production. That said, the original skills of each discipline are far from becoming useless. On the contrary, each person retains a deeper expertise in their own field, and should be treated as the reference to turn to for advice or review. In that context, communication, debate, and daily exchanges are more important than ever.
The rise of platform engineering
Building software has gotten easy. Methods like the BMAD Method are lowering the barrier further every month. But building and maintaining quality software at scale (hundreds of PRs a day, coming from product engineers, designers, and the rest of the company) is meaningfully harder than it used to be.
How do you review hundreds of PRs a day with a small team? How do you keep a consistent UX when twelve different people shipped features this week? How do you keep tech debt from exploding? How do you make sure the infrastructure components are sized to absorb new load and usage patterns? How does onboarding and documentation keep up?
This is where the platform engineer comes in. The role goes well beyond running the pipeline or cutting releases, though both tasks remain part of it. At our scale, a platform engineer is someone who thinks about the codebase as a system, industrializes review and QA, systematizes best practices, and builds the internal tooling that lets a 25-person company ship like a 200-person one.
The honest part: we’re asking people to partially reinvent what they do, fast. That’s violent, and we’ve been transparent about it from the start. The workshop helped as an opening act, but the real work is a multi-quarter transition supported by career paths, tooling, and genuine time to learn.
Quality at scale: MerciYanis Handbook
Agents will happily make decisions for you, and that is precisely where the problem starts. Give an unbriefed agent a ticket and it will reach the “done” state by making whatever arbitrary choices get it there. Arbitrary choices without the global context, without the architectural vision, without knowing why a given abstraction exists or why a given library is banned. This is where AI-slop actually comes from: the root cause sits almost always in the prompt, which quietly omitted the context the agent needed to make the right call.
As CTO, the line I draw is simple: we do not ship code we don’t understand and don’t control. Outsourcing your product, whether to a contractor or to an agent, is the worst thing a tech company can do. You’ll pay for it on the first migration, the first scale event, the first incident. So the question became: how do we give agents enough context that their “arbitrary” choices stop being arbitrary?
The answer was the MerciYanis Handbook.
What’s in it
The Handbook is, conceptually, the context an engineer would absorb in their first three months on the team, made explicit, written down, and kept fresh. It essentially consolidates all the onboarding documentation that historically lived in our Notion. It’s structured as a multi-chapter, 100%-markdown repository that anyone in the product team can contribute to. Here is the global index page:
# MerciYanis Handbook
Comprehensive documentation for the MerciYanis platform. Designed to serve engineers, product managers, product designers, operations teams, and LLM agents.
## Purpose
This handbook provides:
- Technical reference for development and operations
- Onboarding material for new team members
- Context for AI-assisted development
- Living documentation that evolves with the platform
## Table of Contents
### 1. [Engineering](./engineering/INDEX.md)
Technical foundation of the platform:
- [Principles](./engineering/PRINCIPLES.md) - Core engineering philosophy
- [Architecture](./engineering/ARCHITECTURE.md) - System design and patterns
- [Codebase](./engineering/CODEBASE.md) - Repository organization
- [Infrastructure](./engineering/INFRASTRUCTURE.md) - DevOps and deployment
- [Best Practices](./engineering/BEST_PRACTICES.md) - Coding standards
- [Unit Tests](./engineering/UNIT_TESTS.md) - Testing guidelines
## Quick Start
### For Engineers
1. Start with [Engineering Principles](./engineering/PRINCIPLES.md)
2. Understand the [Architecture](./engineering/ARCHITECTURE.md)
3. Set up your environment per [Infrastructure](./engineering/INFRASTRUCTURE.md)
4. Follow [Best Practices](./engineering/BEST_PRACTICES.md) when coding
### For LLM Agents
When assisting with MerciYanis codebase:
1. Read relevant handbook sections for context
2. Follow established patterns in [Architecture](./engineering/ARCHITECTURE.md)
3. Adhere to [Best Practices](./engineering/BEST_PRACTICES.md)
4. Generate tests per [Unit Tests](./engineering/UNIT_TESTS.md) guidelines
## Contributing
This is a living document. Update it when:
- Adding new patterns or conventions
- Changing architectural decisions
- Discovering undocumented behavior
- Improving clarity based on questions received
We co-wrote it with our most experienced engineers specifically to capture things that were not in any existing doc: the decades of hard-won intuition about why this pattern is preferred over that one in our particular codebase.
How agents use it
The Handbook only works if agents actually read the right chapter at the right moment. For that, we wrote custom Claude Code slash commands that orchestrate the flow.
A command like /fix wraps “fix this bug” inside a sequenced plan that routes the agent through the right chapters before it writes a single line:
1. Read Chapter 2 (Architecture) — locate the relevant service and boundaries.
2. Read Chapter 5 (Stack) — confirm which libraries and patterns apply.
3. Reproduce the bug locally. Do not proceed if you cannot reproduce it.
4. Implement the fix in the smallest possible diff.
5. Read Chapter 8 (Testing) — write unit tests that would have caught this bug.
6. Run the full test suite. Do not open a PR if it is red.
7. Open the PR with a description following Chapter 9 (PR Hygiene).
Each step points the agent at the relevant chapter before it acts. Arbitrary choices still happen at the edges, but they happen inside a well-defined frame, and the output quality is dramatically more consistent as a result.
With the Handbook in place, we moved from “close to production quality most of the time” to “close to 100% quality” on the tasks we run through it.
Adapting the stack
The Handbook fixes the what, while fixing the how required a matching effort on the stack itself, the plumbing that determines whether an agent can actually exercise its decisions safely.
Problem 1: the full stack, from zero, in five minutes
A human engineer working locally typically boots only the microservices they care about. They know the rest of the system is running in staging, or that the library they depend on was published last week, or that they can mock the queue for now. They curate a partial view of the world.
An agent cannot afford that luxury. If you want an agent to truly test its work end-to-end (and you do, because otherwise it will confidently claim a fix works when it doesn’t), it needs the full stack running: databases seeded, Kafka up, every microservice reachable, internal libraries built from source rather than pulled from whichever published version happens to lag behind the branch it’s working on.
So we wrote a playbook and the scripts behind it with one explicit goal: any agent or human, starting from a clean machine, must be able to boot the entire platform in under five minutes. Databases, message queues, all services, realistic seed data, all of it coming up in a single command with no surprise prerequisites to install along the way. Whenever the playbook fails to produce a working stack, we treat that as a bug in the playbook itself and fix it there rather than pushing the burden onto whoever tried to run it.
Problem 2: parallel agents without stepping on each other
Humans work serially, one ticket at a time, one branch at a time. Agents are comfortable running wide: we routinely have dozens of agent sessions executing concurrently, each on a different ticket.
Claude Code worktrees solve the single-repo case cleanly, and they’re a great starting point for anyone with a monorepo. Our situation is further along the complexity curve: multiple repositories, shared infrastructure, and, crucially, schema migrations. An agent working on a data model change will mechanically break every other agent sharing the same database, and trying to parallelize through a single Postgres cluster invites a pile of race conditions and corrupted state.
We needed something stronger: full environment isolation per agent session. A Git branch was nowhere near enough on its own; each session had to come with its own complete, independent replica of the stack.
The environment replication system
The system we built spins up, in a handful of minutes, an entirely isolated replica of the platform: its own databases, its own Kafka, its own message broker, its own instances of every service. When an agent picks up a task, it receives a fresh environment that stays dedicated to it for the duration of the run and is torn down as soon as the session finishes.
A few things made this tractable:
- Everything was already Dockerized, and portable. This was the single biggest gift from our past selves. If the services hadn’t been containerized end-to-end, this project would have been a multi-quarter rewrite instead of a multi-week one.
- The stack is really lightweight and modular. It allows us to easily run multiple instances of the whole infrastructure on a single machine. It wouldn’t have been possible with resource-hungry services. We don’t know if we will be able to keep this advantage as we grow, but we’re good with it for now.
- Traefik for routing. Each isolated environment gets its own URL, resolved by Traefik rules to the right set of containers. Fifty copies of the same Docker image can run simultaneously, each reachable at its own stable hostname, with no port collisions to reason about.
The end result: an agent working on a schema migration and another agent fixing a UI bug run in fully independent worlds. Their PRs land independently, their side effects stay contained, and nothing ever crosses the streams.
This was the hardest piece of infrastructure we built, and also the one we get the most leverage out of. The velocity gain from the experiment would have collapsed on first contact with production without this system behind it.
Opening the build to the whole company
For the first few months of 2026, this setup ran internally for the product team, and it worked almost too well for its own good, because the next question quickly became unavoidable: if a non-technical PM can drive a feature end-to-end through an agent, why can’t anyone else in the company do the same?
We’re 25 people. Sales, customer success, ops: everyone is one Slack message away from the product. We’ve always been frustrated by how slowly customer feedback travels from a sales call to a shipped improvement. What if we shortcut that?
The design constraint: no setup
The product team has enough technical fluency to install Docker, clone a repo, and run our playbook. Most of the rest of the company sits further from that world, and pushing them to set up a local dev environment would have been a terrible adoption story on its own, and would have required countless hours of technical support. We needed an interface that was already familiar, already trusted, already central to how work happens here.
Slack was the only serious candidate. It’s where every conversation already happens, it’s public-by-default in a way that suits the collaborative spirit we wanted, and everyone already knows how to use it with confidence.
The Slack bot
We put the stack on a dedicated server, isolated from all our other infrastructure, and wrote a Slack bot on top. The pattern is simple: mention the bot, describe what you want, and it does the rest.
Under the hood, on each request the bot:
- Spins up a fresh isolated environment (the same system described above).
- Runs the agent against the task, against that isolated stack.
- Produces preview URLs for every affected product surface: web app, back office, whatever applies.
- Posts them back in the thread.

The preview environment is a full, live replica of the platform, sitting somewhere between a staging branch and a throwaway feature sandbox. It is owned by no one, disposed of when the thread closes, and actually usable in a way that mocks or screenshots never quite manage. A CSM who asked for a back-office shortcut sees the new button live in their browser a few minutes later. A salesperson prototyping a feature on the back of a customer call can share the preview link with that same customer by the afternoon.
The collaborative side turned out to be our favorite emergent property. Because the bot posts everything in public threads, the rest of the team can chime in. “Have you thought about X?” “This clashes with the upcoming Y.” “Let’s make the button smaller.” Features get iterated on by multiple people before a PR ever gets reviewed for production, which gives us the co-construction loop we always wanted and could never quite find the bandwidth for.
The unexpected second life: live documentation
A few weeks in, we noticed people using the bot for something that had never been on our roadmap (a misjudgment on our part, in hindsight): asking questions.
“Does the platform support this integration?” “Is there a way to bulk-export X?” “What’s our current approach to rate limiting?” The bot has read access to the latest codebase, which means it can answer these by grounding itself in the actual source of truth rather than in whatever our documentation happens to say today. And because we prompted it to translate as it answers, it replies in business terms (the vocabulary of the feature, the customer, the workflow) rather than in the vocabulary of the codebase.

Three things make this better than traditional documentation:
- It stays permanently fresh, because the codebase itself is the source of truth the bot reads from.
- It’s answer-shaped, so asking a specific question returns a specific answer rather than dropping you on a documentation page to hunt through.
- It’s trustworthy enough that sales has started using it to answer customer security questions: “which components are certified?”, “how do we handle PII?”, “which vendors do we use?”.
That last use case surprised us the most: sales used to escalate every security question to engineering, which turned into a multi-day loop during procurement cycles, and they now self-serve instead. A meaningful step toward full autonomy for teams that used to depend on engineers for context.
Learnings
The new agentic stack has been running inside the product team for a few months now, and the Slack bot itself has been available to the whole company since March. The results have been very promising so far (for a 25-person company), with roughly 15 new requests every week, and we expect the usage to grow rapidly as more and more people get comfortable with the tool. Here is what we learned through this initiative.
Roles are moving faster than people can adapt
Every role in the company is being redefined, and the timescale has collapsed from years to months. A single quarter is now enough to reshape what a given role actually involves day-to-day. Agentic work is becoming the default tool of the trade, the way a laptop became the default twenty years ago, and companies that pretend otherwise will find themselves overtaken by the ones that don’t.
Telling a senior designer that their craft is being recomposed into something new within a quarter is, however, a violent thing to do to someone. Some people will find their way through this new organization; others will not, unfortunately. The company’s leadership team has a clear responsibility to carry people through the transition rather than steamroll them, and we’ve put real effort into doing exactly that: honest conversations, clear career paths for the new roles, and dedicated time to learn and share.
The bottlenecks don’t disappear, they move
A 25-person company that ships like a 200-person one inherits the problems that come with that scale: product coherence across dozens of concurrent workstreams, consistent UX when twelve different people shipped features this week, and coordinating product direction now that everyone in the company can build. These used to be problems we would have worried about if we ever got big, and yet here we are facing them today at a fraction of the headcount.
The concrete bottleneck that matters most right now: code review. We have a small engineering team, and the PR rate has gone up by something like an order of magnitude. A clean answer has been elusive so far. We’re experimenting on several fronts in parallel: more automation on the CI side, automated security review, agent-assisted review, agent-on-agent pre-review. Each piece truly helps, though the combined effect still falls short of a silver bullet. Cracking this will likely be one of the defining problems of 2026 for us.
A second bottleneck is climbing fast behind it: product arbitration. With the whole company now able to submit feature requests backed by a working live preview rather than a speculative Jira ticket, the volume of “should this actually ship?” decisions has climbed sharply. The product team has become the final gate on what graduates from a Slack-bot experiment into a shipped feature, and that call has to be anchored in product vision and strategy rather than in whoever posted first or pushed hardest in the thread. Scaling the judgement itself is the part we still need to crack: triaging quickly without killing the co-construction loop, saying no in a way that leaves contributors willing to come back, and keeping the resulting product coherent across dozens of well-intentioned hands. It’s a good problem to have, and it sits squarely on our 2026 list alongside code review.
Docker-native, cloud-native, no vendor lock-in
The environment replication system described earlier was built in a matter of two weeks because every one of our services was already fully containerized when we started. In a world where we still had a mix of half-containerized and bare-metal services, the same project would have stretched over several quarters of foundational work before the interesting parts could even begin.
The broader architectural decisions made years earlier mattered just as much. Everything dockerized, everything cloud-native, and no vendor lock-in on any critical piece of the infrastructure. Those three choices together are what make it physically possible for us to stand up the full platform in minutes, anywhere we want: a laptop, a dedicated server, a different cloud provider on the afternoon we needed to switch. The “fresh environment per agent request” pattern behind the Slack bot only works because the underlying stack is that portable. Most companies will find this playbook hard to copy overnight, precisely because these foundations are painful to retrofit after the fact.
Stack and framework choices compound
Stack choice matters more than ever, and we are deeply grateful for the specific frameworks we committed to years ago. A well-factored, well-isolated, well-documented stack becomes the foundation for all of this; a tangled one becomes the ceiling.
Three properties of our stack punched well above their weight in the AI-native transition:
- Convention over configuration. Languages and frameworks with strong opinions on file structure, naming, and wiring massively reduce the decision space an agent has to navigate on every task. Without those guardrails, two agent sessions on the same prompt will happily produce two different structures. Opinionated frameworks short-circuit that divergence before it starts. On that front, Perseid played a crucial role for us.
- Scalability built in. We deliberately picked components that scale horizontally, keep services stateless, and separate compute from state cleanly. That scaling headroom matters more as the PR rate climbs, because the stack has to absorb a much higher volume of in-flight work without being re-architected every quarter.
- Explicit concepts over magic. The more a framework names its concepts plainly (this is a controller, this is a repository, this is a migration), the easier it is for an agent to locate the right pattern on its own. Magic frameworks, where behavior emerges from naming conventions or runtime inspection, are exactly the places agents get creative in the wrong ways.
If we were starting MerciYanis today, these properties would sit at the top of our day-one checklist. The stack you pick on day one shapes not just your team’s velocity for the next five years, but also how effectively your team can partner with agents for the next ten.
The time to first PR paid off twice
Long before agents entered the picture, we obsessed over the time to first PR: the time between a new engineer joining the team and opening their first PR. We kept grinding it down: clone the repo, run one command, the full stack boots, tests pass, you ship your first change. In 2024, this time sat at roughly four days; we brought it down to under an hour in 2025, and today it runs under ten minutes.
The original payoff was purely human: faster onboarding, less friction on day one, far fewer “works on my machine” support tickets in week one. The real surprise came later, once agents joined the workflow. Every investment we had made in smoothing the human time to first PR turned out to double as an investment in the agent time to first PR. Agents have the same brittle boot-up needs as a new hire, the same need for seed data that is realistic but not precious, the same intolerance for hidden prerequisites. The exact playbook that gets a human productive in ten minutes is, step for step, the same playbook that lets an agent start work on a fresh ticket in seconds.
If you are deciding where to invest engineering effort in 2026, this onboarding time is one of the highest-leverage places we have found. Any friction you remove pays off twice: once for the next human on the team, once for every agent run after that.
Yes, we still write plenty of code by hand
A common assumption when people first hear how we work is that our engineers have largely stopped writing code. The day-to-day tells a different story: we still maintain a good share of the codebase by hand, and that share carries more weight now than it did a year ago.
A few reasons it earns its place:
- Agents anchor themselves on what already exists. The cleanest way to steer an agent toward the right pattern is to have a well-written reference implementation sitting somewhere adjacent in the codebase. A half-hour spent hand-crafting the first instance of a new pattern pays itself back across every subsequent agent-generated copy: consistency, naming, error handling, test shape. All of it falls in line for free.
- Some code demands to be hand-written. The most complex, most domain-specific, highest-leverage pieces of the system stay with humans: core scheduling logic, thorny data migrations, security-critical paths, and the places where a subtly wrong design decision quietly costs you a quarter of future velocity. Delegating those is how AI-slop ends up dressed as senior code.
- Hand-written code is fuel for the next agent run. The Handbook captures our principles in prose, and prose carries us most of the way. The rest (the small stylistic preferences, the recurring idioms, the naming conventions too context-dependent to fully articulate) lives in the existing code itself. Every time we write something well by hand, we are also enlarging the pool of in-repo examples the agent draws inspiration from on its next task.
The balance we’ve settled on, stated plainly: agents scale the patterns we’ve already chosen to build well, and humans keep writing the anchors those patterns grow from.
What’s next
We’re happy with where we are, and also well aware that we’ve barely finished the first act.
- Harder tasks. The Handbook works well on well-scoped tickets. On truly complex, cross-cutting work (large refactors, tricky data migrations), the agent’s context gets compressed over long sessions and the quality slowly drifts. We’re exploring better task decomposition, per-agent skills, and intermediate checkpoints to keep the thread coherent even on hour-long runs.
- Review automation. As mentioned above, this is the bottleneck, and it’s where we’re investing most of our engineering time right now.
- Design. The design discipline has the most room left to integrate deeper into this loop. Agents currently consume designs as input, whereas we’d like them to participate earlier in the process, up to generating and iterating on design proposals alongside the humans.
- Faster environment spin-up. A handful of minutes is good; just a few seconds would be better, because it changes the ergonomics of just try it from “deliberate action” to “idle thought”.
- Treat the codebase as a system. With this many PRs landing every day, our repos have to be treated the way a platform engineer at a large company treats theirs: refactor for clarity, systematize patterns, favor changes that lower the cognitive cost of the next change. This is a mindset shift for the whole team.
- Expand beyond product. The Slack bot pattern extends naturally to marketing, sales, CSM, and even IoT. Each has workflows that could be agent-first in their own way. Product was the natural starting point, and we’re already planning the next rollouts.
Conclusion
Six months ago, we were a product team that used LLMs as assistants. Today, any employee can ship a feature through Slack, our backlog drains overnight, and the bottleneck has cleanly moved from writing code to reviewing it. The numbers behind that shift (roughly x10 velocity speed-up, a 95% overall merge rate, median cost well under €3 per merged PR) are what convinced us this was not another hype cycle to wait out.
The ingredients for success are simultaneously straightforward to list and difficult to obtain, depending on a company’s initial technological, cultural, and organizational choices: a lightweight, modular, containerized stack; a handbook that captures the tribal knowledge agents arrive without; a system that spins up isolated environments on demand; a collaborative interface that lowers the barrier for the rest of the company; and a candid, continuing conversation with the team about what their job is becoming.
Code review at scale remains unsolved for us. The shape of the platform engineering role at our size is still coming into focus. The cultural transition is ongoing, and will be for a while longer. Even so, we are convinced that this is the shape of how engineering and product work from here on out.
Questions about going AI-native? Building something similar? Reach out to us at engineering@merciyanis.com — we’re always happy to compare notes.