How we're building AI worker agents at Awell

Date

May 13, 2025

Our agentic stack

Evaluation & observability

The evaluation platform allows us to understand how agents are performing. We use it to tune the prompts we give to the agents and benchmark models to select the best one for the job.

We also use it to evaluate the performances of the agents themselves in simulated real life scenarios.

Although it feels a bit like stating the obvious, it’s worth explicitly calling out that this is where most of the work goes today. Getting to a working initial implementation of an agent (even a complex multi-agent system) is fairly easy now that we have a platform & deployment automation set up. Ensuring that it performs consistently and that we have the appropriate human-in-the-loop guardrails in place is tough.

Agents

Agents are always designed for a specific job. Clearly defining the job of each agent helps us designing effective evaluation strategies and make the best of the LLM models.

Agents currently communicate with one another using a custom protocol - we’re thinking of replacing it with Google Agent2Agent protocol for increased interoperability. We primarily use the supervisor pattern to delegate tasks to other agents, with a bit of heuristics added through conditional edges in the agent graph.

We decided to use LangGraph to build our agents as it has little to no vendor lock in and provides very useful abstraction layers (agent graph + memory store in particular). Also worth noting that their prototyping app (LangGraph Studio) is delightful.

Memory

The memory is where we store data related to the interactions between users and our agents.

Examples of how this memory can be used:

provide a shared context when a task requires multiple agents
record and classify past interactions to help agents understand best practices, then feed these best practices back into the agent prompts to improve the the quality of their responses

So far we mainly played with Semantic & Episodic memory. Using procedural memory and allowing prompt auto-tuning by the agents themselves remains a “maybe later” for now. (Maybe once system prompt learning takes off ?)

A2A protocol

I wanted to get a sense of how Google’s A2A protocol impacts the way we think about and design agents, so I cloned their sample repo and played around with it for a day.

The biggest insight I got from this exercise is that it provides a structured language to talk about agent to agent collaboration. Agent can advertise skills. Agents can exchange messages to perform tasks. Agents can generate artefacts as part of completing tasks.

Language structures our thinking, so having a common language to talk about how agents can and should collaborate is going to help us find new and hopefully better ways to solve these problems.

The biggest let down was when I found the prompt to handle task delegation which essentially boils down to: “You are a expert delegator that can delegate the user request to the appropriate remote agents”.

It sort of worked for prototyping, but even when dealing with very simple tasks specifically worded to match an agent’s skills I frequently found myself having to confirm suggestions made by the mode given this “expert delegator” prompt.

If we see more public facing agents emerge in the wild - which seems inevitable considering the section on Agent Discovery, then pretty soon we’ll have to build agents that can efficiently delegate to agents based on skills descriptions that are completely outside of our control. I’m really curious to see what patterns are going to emerge to handle this problem - will it be supervisor-based, swarm-like delegation, hybrid or something entirely new? And what best practices will we find to advertise agent skills & capabilities.

Back