Platform Engineering - Part One

February 07, 2026

Problem Statement

For over a decade, organizations have built internal platforms by stitching together CI/CD pipelines, infrastructure-as-code templates, container orchestrators, and bespoke automation scripts — often with each team rolling its own toolchain. The result is familiar: isolated silos, "ticket ops" bottlenecks where developers wait days for infrastructure, fragmented security posture, and a crushing cognitive load on engineers who must understand every layer from Kubernetes networking to cloud IAM. Platform engineering as a discipline emerged precisely to tame this complexity — treating the internal developer platform as a product, paving golden paths, and delivering self-service capabilities so that developers can ship safely without becoming infrastructure experts. Yet just as organizations began to get a handle on that challenge, a new wave arrived.

Over the last year it has felt like AI agents have exploded everywhere: MCP servers and clients, local and remote LLM endpoints, Microsoft Foundry–hosted agents, custom copilots wired into your apps, and more. Every demo looks magical, but when you try to plug these pieces into your real platform, the same questions keep coming back: Where is this agent actually running? Which identity and permissions is it using? How does it reach my clusters, data stores, and SaaS systems? Who – or what – am I really talking to when I type a prompt?

This first post is for platform and application teams who want to embrace agents without turning their environment into an opaque tangle of services. We'll reframe the problem, introduce a mental model for modern AI-powered platforms on Azure, and set the stage for the rest of the series.

Solution

Before diving into specific services, it helps to separate two complementary ideas that often get conflated:

	Platform for Building AI Apps	AI-Powered Platform
Focus	Design and host copilots, agents, and AI-driven features	Use AI to operate, observe, and troubleshoot your infrastructure
Example	Build a customer-facing agent in Microsoft Foundry	Ask an MCP-connected agent "why is my pod crash-looping?"

This series addresses both columns. This first post introduces the shared compute and platform building blocks that underpin either approach. Part 2, "Troubleshooting Your Platform the Agentic AI Way", will focus on the second column — walking through detailed architectures with Azure MCP Server, AKS MCP Server, HolmesGPT, and the Agentic CLI for AKS. Subsequent parts will cover platforms for building AI apps and opinionated reference architectures that blend both perspectives. Here we keep things brief and set up the mental model.

Azure compute options for customized platforms

When you design a modern, AI-aware platform on Azure you are usually choosing from a handful of compute runtimes — each with different trade-offs around control, cost, and operational overhead:

Service	Best for	Infra control	Agent-friendly highlights
AKS Automatic	Teams that want Kubernetes without managing every knob — built-in best practices, auto-upgrades, node auto-provisioning	Medium — opinionated defaults, less to configure	Good fit for hosting MCP servers, agent backends, and traditional microservices side-by-side with minimal cluster ops
AKS Standard	Regulated or large-scale environments needing deep networking, GPU node pools, custom admission controllers	High — full control over every layer	Run complex multi-tenant platforms, custom MCP servers, eBPF-based observability (Inspektor Gadget), and heavy agent workloads
Azure Container Apps — Consumption	Bursty, event-driven agents and APIs	Low — fully managed	Great for lightweight agent endpoints or webhooks that idle most of the time
Azure Container Apps — Dedicated	Steady, long-running agent backends or MCP servers	Medium — reserved capacity	Predictable latency and throughput for always-on agent runtimes
Azure Functions — Flex Consumption	Glue logic, orchestration steps, background tasks	Low	Ideal for event-driven extensions to your platform (queue processors, triggers, lightweight orchestration)
Hosted Agents in Microsoft Foundry	PaaS-first agent development with managed identity, governance, and Microsoft ecosystem integration	Low	Define agents, tools, and connections without managing infrastructure

These compute runtimes rarely stand alone. Most platforms also depend on a set of cross-cutting services:

Service	Role in the platform
Azure API Management (APIM) — AI Gateway	Central entry point for LLM and agent traffic — token-rate-limiting, semantic caching, load balancing across model endpoints, usage analytics, and policy-driven governance for AI calls
Microsoft Entra ID	Identity backbone — authenticates users, services, agents, and MCP servers; issues OAuth tokens; enforces Conditional Access and RBAC across every layer
Azure Monitor (incl. Log Analytics, Managed Prometheus, Container Insights)	Unified observability — metrics, logs, traces, and alerts for clusters, apps, and agent interactions; feeds data to HolmesGPT and other agentic troubleshooting tools

Key insight: Most real-world platforms are composites — for example, AKS Standard running core microservices with an ACA consumption tier handling bursty agent APIs and Functions wiring everything together through events.

Key components of a modern AI-based platform

Agents, LLMs, and MCP — a quick mental model

Think of it like a body:

Component Analogy Role

LLM 🧠 Brain Reasoning, understanding, decision-making

Agent 🎯 Coordinator Orchestrates the loop between the user, the LLM, and tools

MCP 🗣️ Language / Protocol An open standard that defines how agents discover and call tools

MCP Client 👂 Ears The component inside the agent (or IDE) that speaks MCP to a server

MCP Server 🔌 Nervous system Receives MCP requests, maps them to real operations, and returns results

MCP Tools 💪 Hands & muscles Individual capabilities exposed by an MCP server (Azure APIs, kubectl, queries, etc.)

Component	Analogy	Role
LLM	🧠 Brain	Reasoning, understanding, decision-making
Agent	🎯 Coordinator	Orchestrates the loop between the user, the LLM, and tools
MCP	🗣️ Language / Protocol	An open standard that defines how agents discover and call tools
MCP Client	👂 Ears	The component inside the agent (or IDE) that speaks MCP to a server
MCP Server	🔌 Nervous system	Receives MCP requests, maps them to real operations, and returns results
MCP Tools	💪 Hands & muscles	Individual capabilities exposed by an MCP server (Azure APIs, kubectl, queries, etc.)

Let's unpack these a bit further:

LLM (Large Language Model) — The reasoning engine behind every agent interaction. When you "talk to AI," you are ultimately sending tokens to an LLM that returns a completion. Examples you'll see across this series:

LLM	Provider	Typical access path
GPT-4o / GPT-4.1	Azure OpenAI Service or OpenAI API	Most common choice when the platform is on Azure
o3 / o4-mini (reasoning models)	Azure OpenAI Service or OpenAI API	Multi-step planning and complex troubleshooting
Claude (Sonnet / Opus)	Anthropic API	Popular in developer tooling (Cursor, Claude Desktop)
Open-source models (Llama, Phi, Mistral)	Self-hosted on AKS (GPU node pools) or Azure AI Model Catalog	Data-sovereignty or cost-sensitive scenarios

All of the tools discussed in this series (Azure MCP Server, AKS MCP Server, HolmesGPT, Agentic CLI) are BYO-LLM — you point them at whichever model endpoint you prefer.

Agent — The orchestrator that sits between the user and the LLM. An agent decides when to call the LLM, which MCP tools to invoke, and whether to ask the user for confirmation before executing a write operation. Examples: GitHub Copilot in VS Code, HolmesGPT, the AKS Agentic CLI, or a custom agent you build in Microsoft Foundry.

MCP (Model Context Protocol) — An open standard (originally from Anthropic, now widely adopted) that defines a JSON-RPC interface for agents to discover available tools, invoke them with structured parameters, and receive typed results. Think of it as "USB-C for AI tools" — one protocol, many servers.

MCP Client — The piece of software inside an agent or IDE that knows how to speak the MCP protocol. For example, GitHub Copilot in VS Code has a built-in MCP client; Claude Desktop has one; and HolmesGPT includes one that can connect to any MCP server.

MCP Server — A process (local binary or remote container) that implements the MCP protocol, advertises a set of tools, and executes them on behalf of the client. Azure MCP Server and AKS MCP Server are two Microsoft-built MCP servers covered in this series.

MCP Tools — The individual operations an MCP server exposes — for example "list blobs in a storage account", "run kubectl get pods", or "start an Inspektor Gadget DNS trace". Each tool has a typed schema so the LLM knows what parameters to provide.

Azure MCP Server

Microsoft's official MCP implementation covering 40+ Azure services (Storage, Cosmos DB, App Configuration, Monitor, Resource Graph, and more). It can run locally on a developer workstation (using az login credentials and stdio transport) or be deployed remotely on Azure Container Apps with Managed Identity for use by Microsoft Foundry or custom remote agents.

AKS MCP Server

A Kubernetes-specialised MCP server that adds cluster management, workload operations, real-time eBPF observability via Inspektor Gadget, and multi-cluster Fleet management. Like Azure MCP Server, it supports both local and in-cluster (remote) deployment.

HolmesGPT

An open-source, CNCF Sandbox agentic AI framework for root cause analysis. It decides what data to fetch, runs targeted queries against 20+ observability tools (Prometheus, Loki, Datadog, MCP servers, etc.), and iteratively refines its hypothesis. Read-only by design, safe for production.

Agentic CLI for AKS

An Azure CLI extension (az aks agent) powered by the AKS Agent (based on HolmesGPT). It lets you type natural-language questions — "why is my pod failing?" — and get a structured root cause analysis, either from your laptop (client mode) or from a pod inside the cluster (cluster mode).

How these pieces compose into a platform

The compute substrate (AKS Automatic / Standard, ACA, Functions) hosts your apps, services, and optionally the agents themselves.
The control surface (Azure MCP Server, AKS MCP Server, platform APIs, events) exposes safe, identity-governed capabilities that agents can call.
The agent layer (Foundry-hosted or custom-built, backed by HolmesGPT or the Agentic CLI) becomes an intelligent co-pilot for developers and SREs — helping them deploy, observe, and troubleshoot the platform.

What comes next in this series

The goal of this post is to introduce the new terminology — LLMs, agents, MCP, MCP servers, MCP tools — and provide a high-level view of their capabilities and role in platform engineering. By establishing this shared vocabulary, platform teams can better understand how these AI components fit into the broader landscape of building and operating internal developer platforms.

The broader goal of this series is to demonstrate how platform engineering principles — as defined by the CNCF and Microsoft — can be realized on Azure through secure, reliable, and scalable reference implementations. We'll explore how newer platforms like AKS Automatic and Azure Container Apps enable different implementation patterns, incorporate emerging CNCF projects like kro, and show how agentic AI features can eliminate operational toil while maintaining the self-service, governed framework that platform engineering promises.

Part 2, "Troubleshooting Your Platform the Agentic AI Way", will focus on the second column from the table above — walking through detailed architectures with Azure MCP Server, AKS MCP Server, HolmesGPT, and the Agentic CLI for AKS. Subsequent parts will cover platforms for building AI apps and opinionated reference architectures that blend both perspectives.