Building AI Agents: The Architecture Guide We Wish We Had

Building AI Agents: The Architecture Guide We Wish We Had
Building AI agents turned out to be one of those adventures that started with wide-eyed optimism and ended with battle-tested wisdom. What began as a simple exploration of intelligent automation became a masterclass in architectural decisions, cost optimization, and the art of surviving payment system mysteries. This is the story of how we went from framework shopping to production-ready agents, with all the plot twists that make software development the thrilling rollercoaster it is.

Chapter 1: The Rom-Com Phase (Everything Looks Amazing)
Our journey began in that magical place where all software projects start: the land of infinite possibilities. We needed to build AI agents, and like any sensible development team, we started by exploring what was already out there. The requirement was clear: we needed agents that could handle complex workflows, integrate with multiple tools, and scale in production.
First stop: Amazon Bedrock. AWS’s managed service looked promising with its foundation model access and built-in agent capabilities. The documentation was polished, the examples were clean, and everything seemed to work seamlessly in the demos. Bedrock offered the classic AWS experience: powerful, enterprise-ready, but with that distinctive complexity that makes you wonder if you need a PhD in cloud architecture just to get started.
Then we discovered Strands SDK, and suddenly everything clicked. Unlike traditional workflow-based frameworks that require developers to hardcode complex task flows, Strands embraced a model-driven approach. The beauty was in its simplicity: just define a prompt, provide some tools, and let the LLM’s reasoning abilities handle the orchestration.
The final contender was Model Context Protocol (MCP). Anthropic’s open standard for connecting AI systems with data sources looked elegant and future-proof. The protocol was gaining adoption from major players like OpenAI and Google DeepMind.
But here’s where real-world constraints crash the theoretical party. MCP, despite its technical elegance, wasn’t suitable for our deployment requirements. Since our tool integrations were straightforward API calls to existing REST endpoints, we didn’t need the additional abstraction layer that MCP provides. The protocol excels when you need dynamic tool discovery, complex multi-step workflows, or standardized access to diverse data sources. However, for our use case where we had a defined set of API endpoints with known schemas, implementing MCP servers would have added unnecessary complexity without meaningful benefits
The Verdict: Strands SDK won our hearts and minds. Its model-driven approach aligned perfectly with how modern LLMs actually work, and the fact that it was already powering production systems at AWS gave us the confidence to move forward. Strands is actively used in production by multiple AWS teams for their AI agents in production, including Kiro, Amazon Q, and AWS Glue. [1]
Chapter 2: try { findPerfectModel() } catch (EveryPossibleException)
With our framework chosen, we faced the next challenge: selecting the right language model. This wasn’t just about performance—we needed a model that could support tool calling while streaming, which immediately ruled out several options. The streaming requirement was non-negotiable for user experience, but it created technical constraints that would haunt us throughout the project.
Plan A: Our initial choice was Amazon’s Nova models—specifically Nova Lite and Nova Pro. The pricing was attractive: Nova Lite at $0.000071 per 1K input tokens (≈ ₹0.00630) and $0.000284 per 1K output tokens (≈ ₹0.02520), while Nova Pro cost $0.00094 per 1K input tokens (≈ ₹0.08341) and $0.00376 per 1K output tokens (≈ ₹0.33362). Compared to other models in the market, Nova offered excellent value for money at this scale, especially for high-volume, lightweight tasks where ultra‑low per‑1K pricing compounds into meaningful savings. [2]
But the honeymoon phase didn’t last long. We quickly ran into content filtering issues that blocked legitimate generated text. The Nova models’ content filters were overly aggressive, treating perfectly reasonable agent responses as potentially harmful content. This wasn’t just an occasional hiccup—it was a systematic problem that made the models unusable for our use case. After investigating with AWS support and reviewing Nova’s documentation on content filters, we found that the models’ safety mechanisms were operating at an overly conservative threshold. We attempted to work with support to understand if there were filtering configurations we could adjust, but discovered that the content filtering parameters weren’t configurable at the model level through Bedrock. Several community discussions in AWS forums highlighted similar experiences from other teams, confirming this wasn’t an isolated incident but rather an architectural limitation of how Nova’s filtering integrates with Bedrock’s deployment. This systematic issue made the models effectively unusable for our use case without accepting unacceptable error rates.
Plan B: Switch to Anthropic’s Claude models through Bedrock. This should have been straightforward, but AWS had other plans. The dreaded “payment instrument failed” error appeared, blocking our access to Claude models. After some investigation, we discovered this was an AISPL (Amazon Internet Services Private Limited) account restriction—a common issue for Indian AWS accounts.

Plan C: While waiting for payment issues to resolve, we tried the GPT-OSS-20B model. Everything worked fine initially, but the model had an unfortunate habit of stopping mid-execution during streaming operations. The model identifies the correct tool to use but fails to actually execute it. This created unpredictable failures that made the agents unreliable in production scenarios
After multiple support tickets and payment method changes (temporarily switching to invoice-based billing to resolve the AISPL restrictions), we finally gained access to Anthropic’s Claude models. Claude 3.7 Sonnet became our production choice, recommended by Anthropic’s documentation for its token efficiency and tool use optimization.
Chapter 3: The Orchestrator Who Knew Too Much
With our model selected, we faced the critical architectural decision. Our initial approach was straightforward: one orchestrator agent managing one subagent that had access to all 10 tools. In Strands SDK, this meant passing the name, description, and input schema for all tools whenever the orchestrator called the subagent
The result was a token consumption nightmare. Every query consumed 15,000-17,000 tokens, translating to approximately ₹4.4 (≈ $0.053 USD) per query with Claude 3.7 Sonnet pricing. The bulk of this cost came from repeatedly sending all tool schemas in every request—a classic case of architectural inefficiency masquerading as simplicity
The Breakthrough: We redesigned the architecture to follow a 1:1 pattern—10 subagents, each managing exactly one tool. This change was transformative. Instead of sending 10 tool schemas with every request, each subagent only needed to know about its specific tool. The orchestrator could now route requests to the appropriate specialized agent, dramatically reducing token overhead.
The results spoke for themselves: a 33% reduction in token consumption, bringing our costs down from ₹4.4 per query to a range of ₹2.3-4.0. But we didn’t stop there.
The Haiku Optimization: Tiered Model Strategy
For simpler subagents that didn’t require the full reasoning power of Claude 3.7 Sonnet, we implemented Claude 3.0 Haiku.
The pricing difference is dramatic.
- Claude 3.7 Sonnet: $0.003 per 1K input tokens (≈ ₹0.26619) and $0.015 per 1K output tokens (≈ ₹1.33095).
- Claude 3 Haiku: $0.00025 per 1K input tokens (≈ ₹0.02218) and $0.00125 per 1K output tokens (≈ ₹0.11091).
- That’s roughly a 91.7% lower price for Haiku vs Sonnet on both input and output, which is effectively ≈92% and aligns with the hybrid model strategy described in this chapter.
Chapter 4: The Heist Movie (Stealing Back Our Token Budget)
4.1: The Target (Understanding Token Variability)
We began by analyzing token consumption across different request types. The pattern that emerged was crucial: token consumption varied wildly based on the tools being used.
An API call returning a 500-character JSON response consumed significantly more tokens than a document retrieval.
This variability was the key to our heist. Not all queries were created equal. If we could identify which queries could be handled by Haiku and which genuinely needed Sonnet, we could reclaim substantial budget.
4.2: The Intelligence
We simply assigned queries to the appropriate subagent based on their inherent complexity. Some tasks ran directly with Haiku, and others directly with Sonnet:
- Tasks requiring multi-step reasoning → Sonnet
- Straightforward lookups or calculations → Haiku
- Potentially nuanced or context-heavy responses → Sonnet
- Simple executions without complex analysis → Haiku
4.3: The Execution (Results of the Hybrid Strategy)
Our hybrid approach using both Claude 3.7 Sonnet and Claude 3.0 Haiku proved to be the sweet spot. Simple tasks like basic calculations or document downloading could be handled by Haiku at a fraction of the cost. Complex reasoning, multi-step planning, and nuanced analysis remained with Sonnet.
The results:
- Cost reduction: From ₹4.4 per query to ₹2.3-4.0 per query
- Capability preservation: No loss of quality for appropriate tasks
This tiered approach gave us the best of both worlds: cost efficiency where we could afford to optimize, and capability where we needed it most.
The heist was complete. We’d stolen back our token budget without sacrificing the system’s ability to think.

Chapter 5: The Feel-Good Ending
After months of trial, error, and unexpected surprises, we finally reached the point where everything clicked. Looking back, it wasn’t just the models, the frameworks, or even the tools that mattered—it was the decisions we made early, the ones that let us adapt and evolve when the unexpected hit.
The Beauty of Design
Remember the chaos of our first attempt? One orchestrator, one sub-agent, ten tools, queries burning tokens like wildfire. It was messy—but it set the stage for our breakthrough. Because we had invested in modular, clean design from the start, we didn’t need to tear everything down. A simple configuration change transformed our system: ten focused sub-agents, each owning a single tool. Suddenly, performance improved, token costs dropped, and the architecture felt alive, stable, and scalable. The relief was palpable—we had turned chaos into order, and it worked beautifully.
Winning Against Hidden Challenges
Payment restrictions, content filters, mid-stream execution failures—each felt like a roadblock designed to stop us. But tackling them taught us resilience and creative problem-solving. By the time our hybrid model strategy was in place—Claude 3.7 Sonnet handling complex reasoning and Claude 3.0 Haiku taking care of simpler tasks—we had a system that was not only efficient, but smart, adaptable, and sustainable.
From Lessons to Triumph
What started as a whirlwind of uncertainty had become a production-ready, cost-effective, and reliable AI agent system. The architecture decisions that seemed like abstract best practices earlier now shone as the backbone of our success. Every token saved, every query streamlined, was a small victory—and together they added up to a big win.
The Takeaway
It’s tempting to focus only on models or frameworks, but the real magic comes from thinking ahead, building cleanly, and trusting your design. When the chaos inevitably hits, a solid foundation doesn’t just save time—it saves the day.
Looking at the system now, it’s hard not to smile. What began as a bumpy adventure ended as a triumphant journey. The agents are running, costs are controlled, and the architecture hums along like a well-oiled machine. In the end, that’s the feel-good part: seeing months of work come together into something that just works—and works beautifully.