The Lure of Malleability in AI-generated Software
Evolving Engineering Practices for the AI Era #
Imagine building a house by discussing with your civil contractor prompt by prompt. Nobody would seriously build a house that way. But software? Software seems malleable - something that we can change easily with just another prompt or an updated requirement spec. Except it’s harder in practice than it seems. Unlike a house though, software is genuinely more malleable. But its malleability is consistently over-estimated and AI amplifies it further. That is the lure and the trap.
Software has always suffered a malleability bias. Even during the days when Agile was catching on, there was a tendency to believe that there was no need to spend time on spec and design, rather just build stuff sprint on sprint. Of course, that didn’t last very long. Any production grade or enterprise grade software that requires long-term maintainability and extensibility must adhere to basic principles of planning, design, execution and review.
Even with a spec driven approach to anchor this, there is a tempting and natural tendency to approach AI as though it were an experienced engineer, an experienced product manager, a domain expert and a seasoned architect - all in one! It’s easy to conflate development stages, roles and skills. With my experience building two AI-first applications using AI, it’s clear that if you want consistent sustainable results without AI going off the rails and messing up your code over time, there is a need for a framework that ensures discipline and process. This is important for human collaboration too - we need multiple human beings to be able to reason about the use-cases, workflows, schemas, edge-case handling and other things without having to rely entirely on reading the code.
We are moving from AI as a coding assistant to an AI-first software engineering world - where agents plan, spec, build, test, and fix with increasing autonomy. This shift makes good engineering practices around AI generation more critical, not less. It is this discipline that explains the wide divergence of optimists and naysayers on AI productivity - it’s not the model or tools.
In this post, I delve into the evolving AI-driven software engineering practices for building greenfield services and apps, with a closing look at how these practices might extend to brownfield.
Adapting the SDLC to AI constraints #
The framework I describe here is organized around the typical SDLC stages (Spec, Design, Build, Verify) adapted for AI, rather than around agent personas or roles. This is a deliberate choice. In an AI-first world, we are no longer theoretically bound by the number of agents we can have doing the work (unlike the number of humans on a team), nor are we bound by the amount of knowledge or skill an agent can have. Organizing around lifecycle stages keeps the framework flexible yet structured enough, dividing the work into phases - manageable dependency-ordered units of work.
A lifecycle oriented framework means agent boundaries follow what each lifecycle stage requires and not the agent’s role. For instance, the Architect Agent may operate at both the design stage and at the post-coding verification stage. This flexibility allows boundaries to shift as the team matures. Running through all of this is the importance of designing a well thought-out human-gated flow with deliberate touch points for humans in the loop at each stage. Applying human judgment thoughtfully is as important as knowing how to get the agents to do the work required.
That said, there is one important caveat. Strong personas make sense for pieces of work where domain knowledge matters a great deal more than generic skills. The Product Manager Agent is the clearest example of this. Domain understanding of your service or app’s target audience is precisely what makes it valuable. Instructing it to take on a well-informed domain persona meaningfully improves the quality of the spec.
Phase is the new Sprint #
With an AI-first approach, each phase of work is a set of features that can be fully tested in isolation. These are iterative - each phase cycles through Build and Verify before the next phase begins.
In human-led development, sprints defined practical boundaries around human constraints: two-week increments, manageable story points, team coordination rhythms, etc. In AI-first development, phases do the same but are defined by AI constraints instead - preventing too much scope in a single session to avoid token exhaustion, containing context to avoid drift over long sessions and sequencing by dependency so that foundational layers are fully tested before the next phase builds on them. Scoping each phase session explicitly is an important guardrail that prevents the agent from over reaching into future phases. Unlike a human developer who uses judgment about scope, an AI agent needs the boundary stated explicitly or it will cross it without having a stable foundation beneath.
Phases are shaped by a different set of limits and unconstrained by artificial time boundaries. Worth noting that just as with sprints, not all potential phases need to be fully spec’d out at the beginning (unlike waterfall). While the current phase or two needs a fully realized spec, additional phases can be evolved over time.
Markdown as your programming language #
The shift from prompt-driven to spec-driven begins with scoping out your MVP before generating a single line of code. Not a verbose detailed spec, but good enough to clarify coverage and use-cases. This ensures Architecture Agents can subsequently generate a more stable and consistent architecture and data model, rather than making huge swings of changes at every increment.
These practices follow a sequence - Spec, Design, Build, Verify - and the discipline at each stage builds on the previous. Markdown (MD) - the lingua franca for human-AI interaction - is what ties them together. MD files are both AI-friendly and human-readable, they can be put into version control and are very portable. They represent the foundational structure for providing context at every stage. The spec files, architecture files, code and tests together form the shared context pool for all agents - each agent draws from whichever portions are relevant to its instructions, giving every stage of the lifecycle a consistent and common grounding.
Engaging an AI domain expert #
The first order of business with MD usage is ensuring that your Product Manager Agent has the right system context. AI is great at assuming a well-informed persona in a vast number of domains. The system context should instruct the agent to leverage domain understanding to make informed recommendations about feature completeness, user workflows and edge cases. Spec drift is real and can result in weird use-case assumptions by the Agent. Grounding the Product Manager Agent’s context in the domain is what keeps this in check.
As you converse with the Agent to spec out your service or app (user-flows, functional and non-functional requirements, etc.), specify who will use outputs - whether it’s other AI Agents that design the architecture and the user-interface, or if it’s a human being. AI Coding Agents need far less verbosity in the spec than a human simply because of the vast pre-trained knowledge - both on coding patterns and domain understanding. The audience for the spec artifacts has fundamentally changed. Specs were traditionally written for human developers to interpret and model their work on. In AI-first development, the primary consumer of the spec is the agent which needs unambiguous implementation-ready instructions. This changes not just the style but the purpose of documentation. The right question to drive spec verbosity is “What does the AI Coding Agent need in order to build this correctly?” rather than “How can I explain this thoroughly?”. In the spirit of this, the Product Manager Agent’s system context must carry instructions to avoid marketing language, verbose explanations and over-documentation.
A tandem workflow for features and test specs #
The Product Manager Agent that generates the feature specs should also generate test specs - acceptance criteria and edge cases, data validation and boundary conditions. Unlike traditional test case documentation, these don’t need to be comprehensive or detailed. Things that an AI Coding Agent can easily infer from the feature spec don’t need to be kept in the test spec. Instead, focus on discussing with the Agent and defining non-obvious criteria and constraints. Specifying that any test scenario documented should be automatable ensures that the test specs are actionable rather than theoretical.
A tandem workflow between the human and the Agent provides structure to how feature and test specs are built together. An illustration of such a workflow would be: Joint Discovery → Clarification → AI Proposals → Debate → Feature spec → Test spec → Commit. Each commit step acts as a deliberate human gate offering a checkpoint to review before the next stage proceeds. While discovery maybe specific to feature specs, the AI proposals and debates apply across other lifecycle stages as well (architecture, UI, non-functionals). This is what makes the collaboration with AI genuinely productive and insightful while keeping it grounded without drifting as it does most of the work.
As each feature gets written out, a cross-feature check after each new feature ensures that already written features are not impacted by conflicts, overlapping ownership or cross references. Without this, Agents can occasionally create incompatible features and at worst even bizarre contradictions.
Anchoring the Foundation for both Agents and Humans #
The version control strategy and folder structure must serve human reviewers as much as they serve AI agents. Maintainability and human readability are not a nice-to-have but a core requirement for a sustainable framework. Co-locating the spec files, the architecture and the code is what makes this possible, while also reducing fragmentation risk and keeping this shared context pool accessible to both humans and agents.
With the MVP spec in place, the architecture and data model are the next pieces of the foundation to lock in. This must include things like the tech stack, data flow, key design patterns, integration points, API contracts, storage/indexing requirements, separation of business layer and even hosting platform patterns such as serverless. Everything that follows flows from this foundation and helps reduce churn downstream.
Non-functional AI guardrails and UI coherence #
Non-functional requirements tend to get emphasized the least when it comes to AI code-generators when in reality it’s equally critical they be explicitly spec’d for AI - including the kind of scalability, reliability, security and performance your service or app needs to meet. Unlike a human engineer, AI will not get certain types of things right by default - it needs them spec’d out or it will simply do its own thing. This includes reliability patterns like retries and idempotency and specific safety guardrails: deletion safeguards to prevent irreversible actions being coded into your app. It also includes schema and action whitelisting to constrain what can be queried or acted upon, and audit trails to trace what happened and why. If your app uses AI to provide AI-based features to users, pricing guardrails are essential - documenting rate limits and reset timings prevents runaway token usage that a human engineer would typically avoid but an agent may not.
For apps with a UI, specifying a design philosophy, visual identity and interaction model upfront serves a similar purpose. It gives the Coding agent a stable foundation to generate from rather than making arbitrary choices that generate an incoherent experience. This is especially important if your app supports contextual natural language understanding and offers proactive assistance, where disambiguation of user inputs, user confirmations and action boundaries must be explicitly defined. All of these can be generated by a UI/UX Agent from your spec files and preferences. Mockups can be generated using the feature specs and UI specs. These mockups can then be fed directly as visual context to the Coding Agent, grounding the generated UI in something concrete.
Controlled Autonomy with Human Checkpoints #
Constraining what context the AI can access during code generation is not just a precaution, rather by limiting web search tools and restricting folder access the agent is kept focused on your codebase and architecture rather than chasing tangents from the open web or your intranet. Requiring the agent to propose changes before writing them - setting it to “Confirm” or “Ask Permissions” - is a useful checkpoint for the human in the loop even if you don’t need to review every line or action. It’s also worth thinking through the choice of language and framework carefully. Statically typed languages tend to produce better accuracy downstream in AI generated code, reducing the inference burden on the agent and the debugging burden on the human reviewer.
The effectiveness of the AI driven lifecycle hinges on getting the Coding Agent to generate tests based on your features and test specs, then having it run those tests against the generated code, fixing issues and retesting until all tests pass. A human review at the end is beneficial - not necessarily a line-by-line review but at the very least a broad brush check on the core logic and structure. Any issues found during this review should not only be fixed by the Agent, but also routed back to the Test Agent to fix testing gaps or to the Product Manager Agent to update feature specs if necessary for better clarifying the feature requirement.
Towards Malleability of Engineering Teams #
With a framework like this and different system contexts for agents at each step of the lifecycle, incremental feature development accelerates far beyond what prompt-based development alone can sustain. Further, the codification of standards and orchestration of context among multiple agents with humans in the loop - how context flows, where it lives, how team conventions are maintained and how conflicts get resolved - deserve as much attention as the code itself.
It’s worth noting that a lot of these greenfield practices and benefits could start spilling over to brownfield as well when a) similar kinds of context and specs can be written (maybe even part-generated) for brownfield apps/services, and b) an accuracy tipping point is achieved at which AI productivity gains are so high that finding errors becomes much harder and time consuming for humans.
The software malleability lure is real but the gains from avoiding the trap are even greater. A similar trap now afflicts human engineering teams - the lure of bolting AI onto existing team structures without genuinely rethinking how people work, where human judgment is required, and how cognitive load and context ownership are distributed across the team. The task expansion from all of this means engineers and PMs reading thousands of lines of spec, reviewing key diffs and resolving conflicts. It means managing context, and juggling such context-switches all day every day.
What we need is for engineering teams themselves to become malleable around this immensely productive yet cognitively demanding work. With the lure of AI, what was an architectural malleability trap is now organizational.
