Distributed AI observability

A private AI path is only as trustworthy as the telemetry that proves it stays healthy. Equinix Fabric Streams emits route churn, connection state, BGP session events, and metric samples for every Fabric Cloud Router and Connection. This recipe pipes those streams into Datadog or Grafana via private Fabric paths — so even your observability traffic never touches the public internet.

The problem

Your AI inference path crosses three metros (IAD, DFW, SV5) and reaches a GPU partner from each. Latency budgets are tight (sub-30ms P95 to first token). When something degrades, you need to know which hop, which metro, which connection — within seconds. The traditional answer is “ship logs to Datadog over the public internet.” For regulated workloads or sovereign-AI deployments, that makes the observability rail a side-channel for data exfiltration. Putting your CISO at war with your SRE team is a bad outcome. You need observability that:

Flows over the same private Fabric paths as the production traffic.
Samples and rate-limits at the Fabric layer so a noisy metric storm can’t blow your Datadog ingestion budget.
Works across metros without three separate stitch jobs.

The architecture

  IAD                    DFW                    SV5
  ┌────────┐             ┌────────┐             ┌────────┐
  │ FCR    │             │ FCR    │             │ FCR    │
  │ + conn │             │ + conn │             │ + conn │
  └───┬────┘             └───┬────┘             └───┬────┘
      │                      │                      │
      │  Fabric Streams      │  Fabric Streams      │  Fabric Streams
      │  subscription        │  subscription        │  subscription
      └─────────┬────────────┴───────────┬──────────┘
                │                        │
                ▼                        ▼
       ┌────────────────┐       ┌────────────────┐
       │   Datadog      │       │   Grafana      │
       │   (metrics +   │       │   (logs +      │
       │   traces)      │       │   route events)│
       └────────────────┘       └────────────────┘

Each FCR + Connection pair has a Fabric Streams subscription attached. The subscription routes samples to either:

A Datadog AWS Private Link endpoint via Fabric Connection (datadog/observability)
A Grafana Cloud private endpoint (grafana/cloud)
A Splunk / New Relic / Honeycomb endpoint (drop-in)

Sampling and rate limits are enforced via Network Edge ACL templates sitting between the FCR and the observability sink.

Required provider packages

equinix/fabric-cloud-router

Three FCRs, one per metro.

equinix/fabric-streams

Stream subscriptions, asset attachments, alert rules.

equinix/fabric-connection

Connections from each FCR to the observability provider.

equinix/network-edge-device

Optional rate-limit / DPI between FCR and observability sink.

datadog/observability

Datadog API + private endpoint sink.

grafana/cloud

Grafana Cloud private endpoint, Loki + Mimir + Tempo.

Add the packages

equinix-dev init distributed-ai-observability
cd distributed-ai-observability

equinix-dev add equinix/fabric-cloud-router
equinix-dev add equinix/fabric-streams
equinix-dev add equinix/fabric-connection

# Pick one (or both) — Datadog and Grafana are drop-in alternatives.
equinix-dev add datadog/observability
equinix-dev add grafana/cloud

equinix-dev plan --metros IAD,DFW,SV5

Terraform recipe

locals {
  metros = {
    iad = "DC"
    dfw = "DA"
    sv5 = "SV"
  }
}

# Three FCRs (one per metro).
module "fcr" {
  for_each = local.metros
  source   = "equinix/fabric-equinix/fabric"
  version  = "0.28.1"

  cloud_router_name        = "fcr-observability-${each.key}"
  cloud_router_metro_code  = each.value
  cloud_router_package     = "BASIC"
  cloud_router_account_num = var.equinix_account_number
}

# A Fabric Streams subscription per metro, capturing route + metrics events.
resource "equinix_fabric_stream_subscription" "metro_telemetry" {
  for_each = local.metros

  name        = "subscription-observability-${each.key}"
  description = "Route and connection telemetry for ${upper(each.key)}"

  filters {
    type   = "EQUINIX_DEFINED"
    values = ["fabric.connection.*", "fabric.route.*", "fabric.bgp.session.*"]
  }

  sink {
    type = "DATADOG"
    settings = {
      datadog_site = "datadoghq.com"
      api_key_ref  = var.datadog_api_key_secret_ref   # not the secret value
    }
  }

  # Sample 100% of state changes; sample 10% of metric samples to keep
  # ingestion bounded.
  sampling = {
    state_change_rate = 1.0
    metric_rate       = 0.1
  }
}

# Attach each FCR + its connections to the relevant subscription.
resource "equinix_fabric_stream_asset_attachment" "fcr_metros" {
  for_each = local.metros

  subscription_id = equinix_fabric_stream_subscription.metro_telemetry[each.key].id
  asset_type      = "FABRIC_CLOUD_ROUTER"
  asset_id        = module.fcr[each.key].cloud_router_id
}

# Alert rule — page someone if route churn exceeds a threshold.
resource "equinix_fabric_stream_alert_rule" "route_churn" {
  for_each = local.metros

  subscription_id = equinix_fabric_stream_subscription.metro_telemetry[each.key].id
  name            = "alert-route-churn-${each.key}"
  metric_name     = "fabric.route.churn.events_per_minute"
  operator        = "GREATER_THAN"
  threshold       = 50
  window_minutes  = 5
  severity        = "P2"
  notify_emails   = ["sre@example.com"]
}

MCP trace

// 1. List existing subscriptions in the account.
{
  "tool": "search_streams",
  "arguments": { "owner_account": "${EQUINIX_ACCOUNT_NUMBER}" },
  "result": {
    "subscriptions": [
      { "id": "sub-iad-001", "metro": "DC", "asset_count": 4 },
      { "id": "sub-dfw-001", "metro": "DA", "asset_count": 4 }
    ]
  }
}

// 2. Inspect Datadog connector readiness.
{
  "tool": "validate_stream_sink",
  "arguments": { "sink_type": "DATADOG", "site": "datadoghq.com" },
  "result": { "valid": true, "ingestion_budget_remaining_eps": 12000 }
}

// 3. Mutating — would create a subscription — BLOCKED.
{
  "tool": "create_stream_subscription",
  "arguments": { "name": "subscription-observability-sv5", "...": "..." },
  "result": {
    "status": "BLOCKED",
    "reason": "mutation_policy = blocked_by_default_requires_human_confirmation",
    "preflight_gates": [
      "datadog_api_key_secret_present",
      "datadog_ingestion_budget_sufficient",
      "fabric_streams_owner_acknowledged",
      "alert_recipient_email_validated"
    ]
  }
}

Sampling math (so the FinOps team can sleep)

Three FCRs × ~4 connections each × 10 metric samples per minute × 365 days = ~63 million samples per year, before sampling. With the recipe’s metric_rate = 0.1 (10%), that drops to ~6.3M samples/year — well under most Datadog standard ingestion plans. Per-event cost (state changes — connection up/down, BGP session events) is typically under 0.1% of metric volume. Keep the state_change_rate = 1.0 (100% sampled) since you want every state change to land.

Variants

Send to Grafana Cloud instead of Datadog

Replace the sink.type = "DATADOG" block with "GRAFANA_CLOUD" plus a grafana_url and mimir_endpoint. The rest is unchanged.

Bring your own observability — kafka / pubsub

Fabric Streams supports a generic KAFKA_TOPIC sink (with SASL/SCRAM auth) and a GCP_PUBSUB sink. Drop the Datadog package, swap in equinix/fabric-streams only.

Add Network Edge for ACL'd egress

If your security team requires a stateful firewall in the path between FCR and observability sink, drop in equinix/network-edge-device with a Fortinet or Palo Alto VNF.

Private AI inference path

The production traffic path that this observability is watching.

Multi-cloud private interconnect

The other production pattern — also worth attaching Fabric Streams to.

Get started

Use cases

Concepts

Reference

Distributed AI observability

The problem

The architecture

Required provider packages

equinix/fabric-cloud-router

equinix/fabric-streams

equinix/fabric-connection

equinix/network-edge-device

datadog/observability

grafana/cloud

Add the packages

Terraform recipe

MCP trace

Sampling math (so the FinOps team can sleep)

Variants

Next

Private AI inference path

Multi-cloud private interconnect

Get started

Use cases

Concepts

Reference

Documentation Index

​The problem

​The architecture

​Required provider packages

equinix/fabric-cloud-router

equinix/fabric-streams

equinix/fabric-connection

equinix/network-edge-device

datadog/observability

grafana/cloud

​Add the packages

​Terraform recipe

​MCP trace

​Sampling math (so the FinOps team can sleep)

​Variants

​Next

Private AI inference path

Multi-cloud private interconnect

The problem

The architecture

Required provider packages

Add the packages

Terraform recipe

MCP trace

Sampling math (so the FinOps team can sleep)

Variants

Next