Service discovery: DNS vs registry vs sidecar
Trade-offs across the three approaches.
Service A wants to call service B. In a static world, B’s address is in a config file. In the real world, B has 30 pods, spread across 3 availability zones, and the pods come and go every few minutes during rolling deploys, autoscaling, and node failures. Service discovery is how A finds the current set of B’s pods, fast enough to be useful and accurate enough not to keep talking to dead ones. Three patterns dominate: DNS-based, registry-based (client- or server-side), and service mesh / sidecar. Each one trades simplicity for accuracy.
The problem
flowchart LR
A(["Service A"]):::client
Q{"Where is B?"}:::query
B1[("B pod 1<br/>10.0.1.5")]:::server
B2[("B pod 2<br/>10.0.1.6")]:::server
B3[("B pod 3<br/>10.0.1.7")]:::server
A ==> Q
Q -.-> B1
Q -.-> B2
Q -.-> B3
NOTE["Pods come and go.<br/>The list of addresses changes constantly.<br/>How does A keep up?"]:::infra
Q --> NOTE
classDef client fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
classDef query fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
DNS-based discovery: cheap, slow to update
A DNS name resolves to one or more IPs. The DNS server is the source of truth; clients ask DNS and trust the answer. This is how the world has worked since 1983.
sequenceDiagram
autonumber
participant A as Service A
participant DNS as DNS server
participant B as Service B pod
A->>DNS: A record for service-b.internal?
DNS-->>A: [10.0.1.5, 10.0.1.6, 10.0.1.7]
A->>B: HTTP request to 10.0.1.5
Note over A,DNS: client caches the IPs<br/>(typical TTL: 30-300 seconds)
In Kubernetes, every service has an internal DNS name (service-b.namespace.svc.cluster.local); the cluster’s DNS (CoreDNS) keeps it updated as pods join and leave.
Strength. Every language has a DNS client. Zero application code needed. The cluster does the work.
Weakness. DNS caches. A pod that died 10 seconds ago may still be in the client’s cache for another minute. Clients keep hammering dead IPs until the TTL expires. Cannot do health-aware routing (DNS just returns IPs; it does not know which are healthy beyond a coarse “are they registered”).
Used for: internal Kubernetes service-to-service for simple cases, traditional infrastructure, anything where eventual consistency at minute-granularity is acceptable.
Registry-based discovery: more accurate, more moving parts
A central registry (Consul, etcd, Eureka, Kubernetes itself) holds the current set of healthy instances per service. Clients consult the registry directly, or via a server-side load balancer that does.
flowchart TB
subgraph CLIENT["Client-side discovery"]
direction LR
AC(["Service A"]):::client
REG[("Registry<br/>(Consul, etcd)")]:::infra
BS[("Service B pods")]:::server
AC ==>|"who is B?"| REG
REG ==>|"[healthy pods]"| AC
AC ==>|"call directly"| BS
end
subgraph SERVER["Server-side discovery"]
direction LR
AS(["Service A"]):::client
LB[["Load balancer<br/>watches registry"]]:::infra
REG2[("Registry")]:::infra
BS2[("Service B pods")]:::server
AS ==>|"call B's LB endpoint"| LB
REG2 -.->|"healthy pods"| LB
LB ==> BS2
end
classDef client fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
Each service instance registers on startup (“I am service B, here is my address, here is my health check”) and the registry removes it when it fails or goes away. Clients (or LBs) get fresh information within seconds, not minutes.
Strength. Fast updates, health-aware. Rich metadata (tags, version, region).
Weakness. The registry is itself a critical service. Every client needs registry-aware code (or a sidecar; see below). Server-side load balancing adds a hop.
Used for: traditional microservices on VMs (Consul + load balancer), some Kubernetes setups that use etcd directly through the API server.
Service mesh / sidecar: the modern default
Every pod gets a sidecar proxy (Envoy is the canonical implementation; used by Istio, Linkerd, and others). The application makes plain HTTP calls to service-b; the sidecar intercepts, looks up the current healthy endpoints, and forwards. The application never deals with discovery at all.
flowchart LR
AAPP(["Service A<br/>app code"]):::client
AENV[["Envoy sidecar<br/>(Pod A)"]]:::infra
BENV[["Envoy sidecar<br/>(Pod B)"]]:::infra
BAPP[("Service B<br/>app code")]:::server
CP[("Control plane<br/>Istio / Linkerd")]:::store
AAPP -->|"localhost"| AENV
AENV ==>|"mTLS, retries, LB"| BENV
BENV -->|"localhost"| BAPP
CP -.->|"push current endpoints"| AENV
CP -.->|"push current endpoints"| BENV
classDef client fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef store fill:#e9d5ff,stroke:#7e22ce,color:#581c87,stroke-width:1.5px
Strength. Discovery is free at the application level. The mesh also handles mTLS, retries, circuit breaking, traffic shifting, and observability. Polyglot teams stop worrying about library support per language.
Weakness. Operationally heavy: a control plane to maintain, sidecar per pod (memory and CPU overhead), a steeper learning curve, more moving parts during incidents. Sidecars also add a hop to every request, costing a small amount of latency.
Used for: production Kubernetes deployments at scale, mixed-language microservices, anywhere you want service-to-service mTLS and traffic policies as a platform capability.
Side by side
flowchart TB
subgraph D["DNS-based"]
direction LR
D1["update lag: 30s - 5min"]:::weak
D2["health awareness: none"]:::weak
D3["complexity: trivial"]:::strong
end
subgraph R["Registry-based"]
direction LR
R1["update lag: seconds"]:::strong
R2["health awareness: explicit"]:::strong
R3["complexity: registry to operate"]:::mid
end
subgraph M["Service mesh / sidecar"]
direction LR
M1["update lag: seconds, sometimes sub-second"]:::strong
M2["health awareness: rich, with retries and breakers"]:::strong
M3["complexity: significant; full control plane"]:::weak
end
classDef weak fill:#fed7aa,stroke:#c2410c,color:#7c2d12,stroke-width:1.5px
classDef mid fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
Picking an approach
flowchart TB
Q1{"Is the system small,<br/>and Kubernetes default service<br/>routing enough?"}:::query
Q2{"Do you need service mesh features<br/>(mTLS, traffic shifting, retries) as platform capabilities?"}:::query
Q3{"Are you on traditional VMs<br/>or hybrid infra?"}:::query
A1["DNS-based<br/>(Kubernetes Service + CoreDNS).<br/>Good default."]:::strong
A2["Service mesh.<br/>Istio, Linkerd, Consul Connect.<br/>Platform-grade."]:::strong
A3["Registry-based<br/>(Consul, Eureka).<br/>Works across VMs and k8s."]:::mid
Q1 -->|"yes"| A1
Q1 -->|"no"| Q2
Q2 -->|"yes"| A2
Q2 -->|"no"| Q3
Q3 -->|"yes"| A3
classDef query fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef mid fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
Most teams’ progression: start with DNS-based (Kubernetes Service + CoreDNS), graduate to a service mesh once mTLS, traffic shifting, or per-service retries become real requirements.
Two scenarios
Scenario one: a small SaaS on Kubernetes, ~10 services.
Built-in Kubernetes services with CoreDNS. Each service has a stable DNS name. Pods come and go; CoreDNS keeps the records updated. No mesh, no sidecar. Operational simplicity wins.
Scenario two: a fintech with 200 services across 4 languages and mandatory mTLS.
Istio with Envoy sidecars. Service-to-service mTLS is automatic. Traffic-shifting between versions during deploys is a platform capability. Per-service circuit breakers and retries are configured in the mesh, not in code. The complexity is real but every service inherits the resilience patterns for free.
What this connects to
- Load balancer basics. Discovery feeds the LB pool. See Load balancer: why, how, when.
- Health checks. What the discovery layer uses to decide who is healthy. See Health checks: liveness vs readiness vs startup.
- mTLS. Service mesh’s killer feature for security. See API key vs OAuth vs mTLS.
- Circuit breaker. Often configured in the mesh sidecar. See Circuit breaker.
- Microservices vs monolith. Discovery is a problem you only have once you have many services. See Microservices vs monolith.
Common mistakes
- DNS TTL too high. Dead pods stay in client caches for minutes. Drop TTL to 5-30 seconds for service discovery DNS.
- No health-based removal. Discovery returns IPs but does not know if they are healthy. Pair with readiness probes; let the discovery layer filter.
- Sidecar without a control plane. Envoy without Istio’s control plane is just Envoy. The discovery part is missing.
- Service mesh without a real need. A mesh adds operational load; without mTLS, traffic shifting, or polyglot microservices, you are paying for capability you do not use.
- Custom service discovery code. Almost always wrong. The platform’s primitives are tested by thousands of teams; your handwritten code is not.
- Cross-cluster discovery as an afterthought. Multi-cluster discovery is genuinely hard; pick a mesh that supports it explicitly if you need it.
Quick recap
- DNS-based: simple, slow to update, no health awareness. Fine for many setups.
- Registry-based: faster updates, health-aware. Common on VM-based microservices.
- Service mesh / sidecar: discovery plus mTLS, retries, circuit breaking, traffic shifting as platform capabilities. Heavy but powerful.
- Default to Kubernetes’ built-in DNS. Graduate to a mesh when you need its features, not because it sounds modern.
This concept sits in Stage 4 (Scaling and reliability) of the System Design Roadmap.