Partial Feature Rollout in Large-Scale Distributed Systems
A production-grade blueprint using Azure App Configuration, the .NET Feature Management SDK, Service Bus, Front Door, and Azure Monitor — written for engineers who ship to real traffic.
Who this is for: Senior backend engineers and architects building cloud-native systems on Azure. This is not a "what is a feature flag" primer — it is a production-grade design walkthrough with real SDK code, real trade-offs, and honest discussion of failure modes, multi-region topology, testing strategy, and where the Microsoft-native approach wins and loses against alternatives.
Table of Contents
- Problem Statement
- Key Requirements
- Why Azure App Configuration — and When to Choose Something Else
- High-Level Architecture
- Feature Flag Evaluation Flow
- The Cold-Start and Bootstrap Problem
- Rollout Strategies
- Building a Custom Feature Filter — TenantFilter In Full
- Variants and Experimentation — Beyond Boolean Flags
- Consistency vs. Availability Trade-offs
- Multi-Region Topology
- Flag Schema Evolution — Changing Rules Mid-Rollout
- Failure Modes and Mitigations
- Testing Strategy — Unit, Integration, and Contract Tests
- Observability
- Security and Governance
- Common Pitfalls and Anti-Patterns
- Production Readiness Checklist
1. Problem Statement
Shipping software used to mean one thing: deploy, and everyone gets it simultaneously. For a small team with a small user base, that is fine. For a system serving millions of requests per hour across dozens of microservices in multiple Azure regions, it is a liability.
The core problem is simultaneity. A NullReferenceException on an unhandled user configuration hits every user at once. A full rollback on AKS takes 10–30 minutes of degraded service. And a Git commit message does not satisfy the SOC 2 / ISO 27001 requirement to answer "what changed, when, who approved it, and what was the previous state."
A deployment is a mechanical act. A release is a business decision. Feature flags separate the two.
The solution is to ship code that is off by default, and then deliberately, measurably turn it on — for 1% of users, then a specific tenant, then 25%, then globally. Each step is observable, reversible, and audited.
2. Key Requirements
Before any code is written, five properties define the design envelope.
| Property | Requirement | Why It Matters |
|---|---|---|
| Safety / Blast-Radius Control | Kill-switch propagation < 30s; exact percentage targeting | A flag you cannot disable in under a minute is not a safety mechanism — it is theatre |
| Scalability | 10,000+ RPS per service, zero latency addition | Every API call evaluates flags; a slow evaluator becomes the bottleneck in every request path |
| Low Latency | In-process evaluation; no network call on hot path | Calling App Configuration on every request at 10K RPS would add 20–50ms per request and generate 10K API calls/second |
| Auditability | Immutable log: who, when, old value, new value | SOC 2 / ISO 27001 compliance; post-incident root-cause analysis |
| Governance | Approval workflows; no runtime write access | A service that can toggle its own flags is a compliance violation waiting to happen |
One critical consequence of the Low Latency requirement: the local in-process cache is non-negotiable. You load flags from Azure App Configuration at startup and refresh them on a background timer (default: 30 seconds). Flag evaluation is a dictionary lookup — microseconds, not milliseconds. The 30-second refresh window is the consistency trade-off you are deliberately accepting.
3. Why Azure App Configuration — and When to Choose Something Else
Architects justify their tool choices. Here is the honest comparison.
The Alternatives
LaunchDarkly is the market leader in dedicated feature flag platforms. It has a significantly richer targeting engine (multi-variate flags, advanced rules, real-time streaming, built-in experimentation with statistical significance), a purpose-built SDK with sub-second propagation, and first-class support for mobile and client-side SDKs. If your primary requirement is sophisticated A/B experimentation with business-analyst-friendly tooling and you are not constrained to the Microsoft ecosystem, LaunchDarkly is the stronger product for that use case.
Unleash (open-source) and Flagsmith (open-source / SaaS) are credible alternatives if you need self-hosted control over flag data for compliance reasons and do not want to be coupled to any cloud vendor's proprietary service.
Why Azure App Configuration Wins in the Microsoft Ecosystem
| Factor | Azure App Configuration | LaunchDarkly |
|---|---|---|
| Identity integration | Native Entra ID / Managed Identity — zero credential management | API key or OAuth; credential rotation required |
| Azure DevOps / GitHub Actions | First-class pipeline tasks; ARM and Bicep templates | Third-party action; separate secret management |
| Compliance boundary | All data stays in your Azure tenant, in your region, under your retention policies | Data leaves your tenant to LaunchDarkly's SaaS |
| Cost model | Included in your Azure spend / EA agreement | Per-seat SaaS pricing that scales uncomfortably at enterprise size |
| Propagation latency | 2–30s (push + poll hybrid described in this article) | Sub-second streaming (genuine advantage) |
| Experimentation depth | Basic variant support (v4 SDK); no statistical engine | Full A/B with p-values, confidence intervals, guardrail metrics |
The honest verdict: If you need statistical experimentation at scale (not just routing, but measuring lift with significance), LaunchDarkly or a dedicated experimentation platform like Azure Experimentation (preview) is the right call. If you need a governed, auditable, compliance-friendly feature flag system tightly integrated into an existing Azure estate, App Configuration is the right call. Most large Microsoft-stack enterprises need the latter.
4. High-Level Architecture
The system has four logical planes, each with a clear responsibility mapped to specific Azure services.
| Plane | Azure Service | Responsibility |
|---|---|---|
| Control Plane | Azure App Configuration | Source of truth for flag definitions, filter configs, and environment labels |
| Control Plane | Azure Key Vault | Referenced secrets (never stored in App Configuration directly) |
| Data Plane | Microsoft.FeatureManagement SDK | In-process flag evaluation using a locally cached snapshot — no network call per request |
| Data Plane | Azure Front Door | Edge-level traffic routing: canary backends, header injection, geo-targeting |
| Data Plane | Azure Service Bus | Push-based flag-change propagation for near-instant cache invalidation (<2s) |
| Observation Plane | Azure Monitor + App Insights | Flag evaluation telemetry, per-flag error rates, latency metrics, and alerting |
| Observation Plane | Log Analytics Workspace | Centralised KQL-queryable audit and evaluation logs across all services and regions |
| Governance Plane | Azure DevOps / GitHub Actions | Flag lifecycle: creation, environment promotion, approval gates, scheduled expiry |
| Governance Plane | Azure Entra ID | Data Reader for service principals; Data Owner for release managers only |
The Request Flow
[Client Request]
│
▼
[Azure Front Door] ──── WAF + Rules Engine
│ (injects X-Feature-Variant header
│ or routes to canary backend pool)
▼
[AKS / App Service — .NET 8 API]
│
│ IFeatureManager.IsEnabledAsync("NewCheckout", ctx)
▼
[In-Process Cache] ──── refreshed every 30s from App Configuration
│ (background thread — NOT the hot path)
│ Applies: TargetingFilter, PercentageFilter,
│ TimeWindowFilter, TenantFilter (custom)
▼
[Decision: ON / OFF / Variant]
│
▼
[Business Logic Branch] ──── telemetry event → Application Insights
Key Design Insight: Azure App Configuration is never on the hot path. The SDK maintains a local in-memory cache, refreshed asynchronously. Your 10,000 RPS service makes approximately 2 calls to App Configuration per minute per replica — not per request.
5. Feature Flag Evaluation Flow
Step 1 — SDK Registration in Program.cs
dotnet add package Microsoft.FeatureManagement.AspNetCore
dotnet add package Microsoft.Extensions.Configuration.AzureAppConfiguration
Wire up App Configuration with Managed Identity — no connection strings, no secrets in config:
// Always use Managed Identity — never a connection string in config
builder.Configuration.AddAzureAppConfiguration(options =>
{
options
.Connect(new Uri(appConfigEndpoint), new ManagedIdentityCredential())
.Select(KeyFilter.Any, LabelFilter.Null) // baseline (no label)
.Select(KeyFilter.Any, environmentLabel) // env override: "production"
.UseFeatureFlags(ff =>
{
ff.Label = environmentLabel;
ff.CacheExpirationInterval = TimeSpan.FromSeconds(30);
})
.ConfigureRefresh(refresh =>
{
// Refresh ALL flags when the sentinel key changes
refresh
.Register(".appconfig.featureflags", refreshAll: true)
.SetCacheExpiration(TimeSpan.FromSeconds(30));
});
});
builder.Services.AddAzureAppConfiguration();
builder.Services
.AddFeatureManagement()
.AddFeatureFilter<TargetingFilter>()
.AddFeatureFilter<PercentageFilter>()
.AddFeatureFilter<TimeWindowFilter>()
.AddFeatureFilter<TenantFilter>(); // implemented in full in Section 8
// Middleware — drives the background refresh loop
app.UseAzureAppConfiguration();
Step 2 — Strongly Typed Flag Constants
Never use magic strings at call sites. A CI step validates that every constant in FeatureFlags.cs exists in App Configuration — if it does not, the build fails (see Section 14 for the contract test that enforces this):
// One source of truth for flag names.
// A CI contract test validates every constant exists in App Configuration.
public static class FeatureFlags
{
public const string NewCheckout = "NewCheckout";
public const string V2PricingEngine = "V2PricingEngine";
public const string AiRecommendations = "AiRecommendations";
public const string BulkExportV2 = "BulkExportV2";
}
Step 3 — ITargetingContextAccessor Implementation
This is where your identity model meets the flag system. Groups are how you target millions of users via a handful of rules — without exploding the flag configuration size:
public class HttpTargetingContextAccessor : ITargetingContextAccessor
{
private readonly IHttpContextAccessor _http;
public ValueTask<TargetingContext> GetContextAsync()
{
var user = _http.HttpContext?.User;
var userId = user?.FindFirstValue(ClaimTypes.NameIdentifier) ?? "anonymous";
var tenantId = user?.FindFirstValue("tid") ?? "default";
var tier = user?.FindFirstValue("subscription_tier") ?? "free";
return ValueTask.FromResult(new TargetingContext
{
UserId = userId,
// Groups let you target millions of users via a handful of rules
Groups = [\("tenant:{tenantId}", \)"tier:{tier}"]
});
}
}
Step 4 — Evaluation at the Call Site
Centralise flag evaluation in the service layer — never in the data access layer or raw controllers:
public class CheckoutService
{
private readonly IFeatureManager _features;
public async Task<OrderResult> PlaceOrderAsync(Cart cart, CancellationToken ct)
{
// One evaluation. One consistent decision for this request.
var useNewCheckout = await _features.IsEnabledAsync(FeatureFlags.NewCheckout);
return useNewCheckout
? await _newCheckoutPipeline.ExecuteAsync(cart, ct)
: await _legacyCheckoutPipeline.ExecuteAsync(cart, ct);
}
}
6. The Cold-Start and Bootstrap Problem
This is the gap most feature flag articles skip entirely, and it is the first thing an architect asks in a design review: what happens before the cache is warm?
Quick Reference — The Four Mitigations
| # | Mitigation | Protects Against |
|---|---|---|
| 1 | Startup timeout with graceful fallback | App Configuration unreachable at boot → pod crash |
| 2 | Embed safe defaults in appsettings.json |
Cold start with no App Configuration response → unknown state |
| 3 | Readiness probe with initialDelaySeconds |
Pod receiving traffic before cache is warm |
| 4 | Explicit RefreshAsync() in background workers |
Non-HTTP workers never hitting the refresh middleware |
The Problem
On first startup — or on a cold-start in a scale-to-zero environment like Azure Container Apps or Consumption-plan Functions — the AddAzureAppConfiguration call in Program.cs performs a synchronous, blocking load from App Configuration before the application accepts any traffic. Three failure scenarios emerge:
App Configuration is unreachable at boot. The application crashes during startup and Kubernetes marks the pod as unhealthy. No traffic is served, not even from the old replica set. Depending on your rolling update strategy, this can take down your entire deployment.
App Configuration responds slowly at boot. Cold start latency spikes. On Azure Container Apps with scale-to-zero, this adds directly to the first-request latency experienced by the user.
Partial flag state at startup. If App Configuration returns a subset of keys (due to a transient error mid-response), the local cache is inconsistently populated. Some flags are missing and silently fall back to their defaults.
The Mitigations
Mitigation 1: Startup timeout with a graceful fallback
Configure the initial load with an explicit timeout. If App Configuration does not respond within the timeout, start with embedded defaults — do not crash:
builder.Configuration.AddAzureAppConfiguration(options =>
{
options
.Connect(new Uri(appConfigEndpoint), new ManagedIdentityCredential())
.UseFeatureFlags(ff => { ff.Label = environmentLabel; })
.ConfigureStartupOptions(startupOptions =>
{
// Do not crash on startup if App Configuration is unreachable.
// Start with embedded defaults; background refresh will correct state.
startupOptions.Timeout = TimeSpan.FromSeconds(10);
});
});
Mitigation 2: Embed safe defaults as appsettings.json fallback
The AddAzureAppConfiguration call merges on top of the existing IConfiguration. Pre-populate appsettings.json with all feature flags set to their safe default state (the old code path). If App Configuration is unreachable at startup, the service runs with known-safe defaults instead of crashing:
// appsettings.json — safe defaults for every flag
// These are OVERRIDDEN by App Configuration on successful load
{
"FeatureManagement": {
"NewCheckout": false,
"V2PricingEngine": false,
"AiRecommendations": false,
"BulkExportV2": false
}
}
Mitigation 3: Health check for scale-to-zero environments
For Azure Container Apps and Azure Functions on Consumption plan, the cold-start problem compounds because App Configuration's Managed Identity token acquisition adds its own latency (50–200ms for the first token on a fresh instance). Expose a startup health probe and configure a generous initialDelaySeconds:
builder.Services.AddHealthChecks()
.AddAzureAppConfiguration(
name: "appconfig",
tags: ["ready"]);
// Kubernetes / Container Apps readiness probe
// Do not mark the pod ready until App Configuration is warm
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready")
});
# AKS deployment — give the pod time to warm the cache before receiving traffic
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
failureThreshold: 3
Mitigation 4: Pre-warm the cache in the background service, not the request pipeline
For long-running background workers (Azure Service Bus consumers, Hangfire workers), the UseAzureAppConfiguration() middleware is not available since there is no HTTP pipeline. Drive refresh explicitly via a hosted service:
public class AppConfigWarmupService : IHostedService
{
private readonly IConfigurationRefresher _refresher;
public async Task StartAsync(CancellationToken ct)
{
// Eagerly refresh on worker startup — do not wait for first poll interval
await _refresher.RefreshAsync();
}
public Task StopAsync(CancellationToken ct) => Task.CompletedTask;
}
Scale-to-Zero Rule of Thumb: If your p99 cold-start budget is under 500ms, pre-populate all flag defaults in
appsettings.jsonand treat App Configuration as an async correction layer — not a blocking startup dependency.
7. Rollout Strategies
7.1 — Percentage-Based Rollout
The workhorse of gradual exposure. The Microsoft.Targeting filter uses deterministic consistent hashing (MurmurHash3 over the user ID) to assign each user to a stable bucket. The same user always gets the same experience — no flickering between requests.
{
"id": "NewCheckout",
"description": "Gradual rollout of the new checkout flow — Sprint 42",
"enabled": true,
"conditions": {
"client_filters": [
{
"name": "Microsoft.Targeting",
"parameters": {
"Audience": {
"DefaultRolloutPercentage": 10,
"Groups": [
{ "Name": "tier:enterprise", "RolloutPercentage": 100 },
{ "Name": "tier:pro", "RolloutPercentage": 50 }
]
}
}
}
]
}
}
Production Staging Pattern: Follow this schedule: 1% → 5% → 10% → 25% → 50% → 100%. Encode it as pipeline stages with approval gates. Wait at least one full business cycle (24 hours) at each stage. Compare error rates and p99 latency between flag-ON and flag-OFF cohorts using KQL before advancing.
7.2 — User / Tenant-Based Targeting
Essential for beta programs, early-access customers, and debugging production issues against a specific tenant without a broader rollout:
{
"id": "V2PricingEngine",
"enabled": true,
"conditions": {
"client_filters": [{
"name": "Microsoft.Targeting",
"parameters": {
"Audience": {
"Users": ["alice@contoso.com", "qa-bot@internal.com"],
"Groups": [
{ "Name": "tenant:fabrikam-ltd", "RolloutPercentage": 100 },
{ "Name": "tenant:northwind-inc", "RolloutPercentage": 100 }
],
"DefaultRolloutPercentage": 0
}
}
}]
}
}
DefaultRolloutPercentage: 0 means no one outside the explicit list or groups sees the flag — your closed beta pattern.
7.3 — Environment-Based Rollout via Labels
Azure App Configuration's label system scopes a configuration value to an environment. The same flag key can be enabled: true in development and enabled: false in production:
# Development: AI Recommendations is fully on
az appconfig feature set \
--name my-appconfig --feature AiRecommendations --label development --yes
# Production: deliberately off until approved by release manager
az appconfig feature disable \
--name my-appconfig --feature AiRecommendations --label production --yes
// SDK automatically selects the correct label at startup
options.UseFeatureFlags(ff =>
{
ff.Label = Environment.GetEnvironmentVariable("AZURE_APP_CONFIG_LABEL");
// "development" | "staging" | "production"
});
The SDK merges labels: baseline values are the fallback, environment-specific labels override them. Your pipeline promotes a flag by setting the production label — no code changes, no redeployment.
7.4 — Kill Switches and Instant Rollback
A kill switch is a flag with enabled: false — but propagation speed is what makes it a kill switch rather than a slow rollback.
Polling (baseline): The SDK refreshes every 30 seconds. Worst case: 30 seconds to propagate across all replicas.
Push via Event Grid + Service Bus (optimisation): App Configuration emits change events to Event Grid → Service Bus → your services call IConfigurationRefresher.RefreshAsync() immediately. Typical propagation: under 2 seconds.
public class FeatureFlagRefreshConsumer : BackgroundService
{
private readonly IConfigurationRefresher _refresher;
private readonly ServiceBusProcessor _processor;
private readonly ILogger<FeatureFlagRefreshConsumer> _logger;
protected override async Task ExecuteAsync(CancellationToken ct)
{
_processor.ProcessMessageAsync += OnMessageAsync;
_processor.ProcessErrorAsync += OnErrorAsync;
await _processor.StartProcessingAsync(ct);
}
private async Task OnMessageAsync(ProcessMessageEventArgs args)
{
var evt = args.Message.Body.ToObjectFromJson<AppConfigChangeEvent>();
if (evt?.EventType == "Microsoft.AppConfiguration.KeyValueModified")
{
await _refresher.RefreshAsync(); // invalidates local cache immediately
_logger.LogInformation(
"Flag cache refreshed. Key={Key} CorrelationId={Id}",
evt.Key, args.Message.CorrelationId);
}
await args.CompleteMessageAsync(args.Message);
}
// FIX 4: OnErrorAsync was referenced but never implemented in the original.
// Always log + emit a metric here. Never rethrow — let the processor recover.
private Task OnErrorAsync(ProcessErrorEventArgs args)
{
_logger.LogError(
args.Exception,
"Service Bus error in flag refresh consumer. " +
"Source={Source} EntityPath={EntityPath}. " +
"Falling back to polling TTL.",
args.ErrorSource,
args.EntityPath);
// Emit a metric so your alert rule fires if this happens repeatedly
// e.g. _telemetry.TrackMetric("FeatureFlagRefresh.ServiceBusError", 1);
return Task.CompletedTask; // Do not throw — processor will retry
}
}
⚠️ Anti-Pattern Warning: Do not make push refresh a hard dependency. Service Bus has its own availability SLA. Push refresh is an optimisation that reduces typical propagation from 30s to <2s. Always keep the polling path active as the fallback.
8. Building a Custom Feature Filter — TenantFilter In Full
The built-in Microsoft.Targeting filter handles most multi-tenant use cases. But there are scenarios where you need filter logic that goes beyond what targeting expressions support — for example, checking a tenant's contract state from a database, applying geo-regulatory rules, or gating on a tenant's Azure subscription SKU.
Here is a complete, production-ready TenantFilter implementation:
The Filter Parameters Contract
// The parameters object that maps to the "parameters" block
// in the App Configuration feature flag JSON
public class TenantFilterParameters
{
// Explicit tenant IDs that should always have the flag ON
public List<string> AllowedTenants { get; set; } = [];
// Tenant IDs that should always have the flag OFF (denylist overrides allowlist)
public List<string> BlockedTenants { get; set; } = [];
// Optional: only allow tenants on specific subscription tiers
public List<string> RequiredTiers { get; set; } = [];
}
The Filter Implementation
[FilterAlias("TenantFilter")]
public class TenantFilter : IFeatureFilter
{
private readonly IHttpContextAccessor _http;
private readonly ITenantRepository _tenantRepo;
private readonly ILogger<TenantFilter> _logger;
public TenantFilter(
IHttpContextAccessor http,
ITenantRepository tenantRepo,
ILogger<TenantFilter> logger)
{
_http = http;
_tenantRepo = tenantRepo;
_logger = logger;
}
public async Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
{
var parameters = context.Parameters
.Get<TenantFilterParameters>() ?? new TenantFilterParameters();
var tenantId = _http.HttpContext?.User
.FindFirstValue("tid");
if (string.IsNullOrEmpty(tenantId))
{
_logger.LogDebug(
"TenantFilter: no tenant claim found for flag {Flag}. Returning false.",
context.FeatureName);
return false;
}
// Denylist always wins — an explicitly blocked tenant gets OFF regardless
if (parameters.BlockedTenants.Contains(tenantId, StringComparer.OrdinalIgnoreCase))
{
_logger.LogInformation(
"TenantFilter: tenant {TenantId} is blocked for flag {Flag}.",
tenantId, context.FeatureName);
return false;
}
// Explicit allowlist: if the list is non-empty and the tenant is in it, ON
if (parameters.AllowedTenants.Count > 0)
{
return parameters.AllowedTenants
.Contains(tenantId, StringComparer.OrdinalIgnoreCase);
}
// Tier gate: optionally restrict to tenants on specific subscription tiers
if (parameters.RequiredTiers.Count > 0)
{
var tenant = await _tenantRepo.GetAsync(tenantId);
if (tenant is null)
{
_logger.LogWarning(
"TenantFilter: tenant {TenantId} not found in repository. Returning false.",
tenantId);
return false;
}
return parameters.RequiredTiers
.Contains(tenant.SubscriptionTier, StringComparer.OrdinalIgnoreCase);
}
// No rules configured — default open (all tenants pass)
return true;
}
}
The Corresponding App Configuration JSON
{
"id": "BulkExportV2",
"enabled": true,
"conditions": {
"client_filters": [
{
"name": "TenantFilter",
"parameters": {
"AllowedTenants": [],
"BlockedTenants": ["tenant-under-legal-hold-123"],
"RequiredTiers": ["enterprise", "enterprise-plus"]
}
}
]
}
}
Registration
builder.Services
.AddFeatureManagement()
.AddFeatureFilter<TenantFilter>();
// TenantFilter depends on ITenantRepository — register it
builder.Services.AddScoped<ITenantRepository, CosmosTenantRepository>();
Performance Note:
TenantFilterhitsITenantRepositoryon every evaluation whenRequiredTiersis configured. Cache the tenant record inIMemoryCachewith a short TTL (60 seconds) rather than going to Cosmos DB on every request. The filter itself has no internal caching — that is your responsibility.
9. Variants and Experimentation — Beyond Boolean Flags
The Microsoft.FeatureManagement v4 SDK introduced variants — the ability to assign different values to different user segments, not just ON/OFF. This is the foundation of A/B testing and multivariate experimentation without reaching for a third-party platform.
Why Variants Matter
A boolean flag answers: should this user see the new feature? A variant answers: which version of the feature should this user see?
The canonical use case: testing two different checkout button colours, three different pricing display formats, or two versions of an AI recommendation algorithm — where the outcome metric differs between variants, not just whether the code path runs.
Defining a Variant Flag
{
"id": "CheckoutButtonVariant",
"enabled": true,
"variants": [
{
"name": "Control",
"configuration_value": "blue"
},
{
"name": "Treatment",
"configuration_value": "green"
}
],
"allocation": {
"default_when_enabled": "Control",
"percentile": [
{ "variant": "Control", "from": 0, "to": 50 },
{ "variant": "Treatment", "from": 50, "to": 100 }
]
},
"telemetry": {
"enabled": true
}
}
Evaluating a Variant in .NET
public class CheckoutController : ControllerBase
{
private readonly IVariantFeatureManager _variantManager;
[HttpGet("checkout")]
public async Task<IActionResult> GetCheckoutPage()
{
// GetVariantAsync returns the assigned variant for the current user
var variant = await _variantManager
.GetVariantAsync("CheckoutButtonVariant", HttpContext.RequestAborted);
var buttonColour = variant?.Configuration?.Value ?? "blue";
// Emit telemetry — this is how you measure which variant converts better
_telemetry.TrackEvent("CheckoutPage.Rendered", new Dictionary<string, string>
{
["Variant"] = variant?.Name ?? "default",
["ButtonColour"] = buttonColour,
["UserId"] = User.FindFirstValue(ClaimTypes.NameIdentifier),
});
return Ok(new { buttonColour });
}
}
The Honest Limitation — and What to Do About It
The Microsoft.FeatureManagement v4 SDK gives you variant assignment and basic telemetry emission. It does not give you statistical analysis, confidence intervals, p-values, or guardrail metric monitoring out of the box.
For a lightweight in-house analysis, you can wire the emitted CheckoutPage.Rendered events directly into a KQL query in Log Analytics to measure conversion lift between variants:
// Step 1: join variant assignment to conversion events
let assignments = customEvents
| where name == "CheckoutPage.Rendered"
| project UserId = tostring(customDimensions["UserId"]),
Variant = tostring(customDimensions["Variant"]),
SessionId = session_Id;
let conversions = customEvents
| where name == "Order.Placed"
| project UserId = tostring(customDimensions["UserId"]),
SessionId = session_Id;
// Step 2: compute conversion rate per variant
assignments
| join kind=leftouter conversions on UserId, SessionId
| summarize
Users = dcount(UserId),
Converted = dcountif(UserId, isnotempty(SessionId1))
by Variant
| extend ConversionRate = round(100.0 * Converted / Users, 2)
| project Variant, Users, Converted, ConversionRate
This tells you which variant converts better and at what sample size. It does not compute statistical significance automatically — for that, feed the raw counts into a two-proportion z-test (trivial in Python or R) or use Azure Experimentation (currently in preview), which handles the full experiment lifecycle including guardrail metrics and auto-stopping rules.
10. Consistency vs. Availability Trade-offs
Feature flag systems live at the intersection of the CAP theorem in the most practical, production-visible way. The wrong consistency model causes split-brain bugs that are extremely painful to debug.
The Staleness Window
During the 30-second refresh window, two replicas of the same service can have different views of a flag. Replica A has NewCheckout=ON; Replica B still has NewCheckout=OFF. If a user's retry goes to a different pod, they see inconsistent behaviour.
This is an accepted trade-off. Three mitigations:
- Front Door session affinity: Route a user to the same backend pod during a session. Masks replica staleness at the cost of slightly uneven load distribution.
- Idempotent design: Both code paths should produce equivalent state mutations. If both old and new checkout create the same order record, a mid-session switch is invisible to the user.
- Shorter TTL for kill switches: Maintain a separate configuration category for emergency flags with a 5-second TTL. This narrows the consistency window for safety-critical toggles specifically.
Cross-Service Flag Consistency — The Hard Part
In a microservices architecture, NewCheckout may be read by OrderService, InventoryService, and NotificationService. If it is enabled in OrderService but not yet propagated to InventoryService, you can create partially executed distributed transactions. This is the correctness hazard that kills you.
The solution: evaluate the flag exactly once at the system boundary that owns the transaction, then propagate the decision via request context — not the flag name:
// OrderController — evaluate at the API boundary, once, for this transaction
var useNewCheckout = await _features.IsEnabledAsync(FeatureFlags.NewCheckout);
// Propagate the DECISION — not a re-evaluation request
var client = _factory.CreateClient("InventoryService");
client.DefaultRequestHeaders.Add(
"X-Feature-NewCheckout", useNewCheckout ? "1" : "0");
// InventoryService reads the propagated decision — never re-evaluates the flag
// This ensures all services operate under one consistent flag state for this transaction
var useNewCheckout = httpContext.Request.Headers["X-Feature-NewCheckout"] == "1";
Core Principle: In a distributed transaction, the flag evaluation must happen exactly once, at the boundary that owns the transaction. Re-evaluating at each service boundary creates a distributed consistency hazard — each service might read a different cached state, splitting the transaction across different code paths.
11. Multi-Region Topology
This is the most architect-specific section of this article and the most commonly skipped in feature flag write-ups. If you are running in multiple Azure regions — say, eastus and westeurope — the following topology questions have direct production consequences.
Azure App Configuration's Replication Model
App Configuration is a single-region, geo-redundant service. When you create an App Configuration instance, you choose a primary region. Azure replicates data to a secondary region within the same geography for disaster recovery, but the secondary is not an active read replica — it is a failover target. There is no multi-master write capability and no automatic cross-region read distribution.
The implication: every replica in your westeurope AKS cluster polling the App Configuration instance you created in eastus is making a cross-region API call — adding 60–100ms of latency to the background refresh, and creating a dependency on cross-region network health for flag propagation.
The Recommended Multi-Region Topology
Create one App Configuration instance per region and use your CI/CD pipeline to synchronise flag state across instances:
[Release Pipeline — Azure DevOps]
│
├──► az appconfig feature set ... --name appconfig-eastus --label production
│
└──► az appconfig feature set ... --name appconfig-westeurope --label production
Each regional AKS cluster reads from the App Configuration instance in its own region. Cross-region latency is eliminated from the refresh path. Regional isolation means a flag change in eastus does not block because westeurope's App Configuration is unavailable.
// Each region reads from its own App Configuration instance
// AZURE_APPCONFIG_ENDPOINT is set per-region in the AKS pod environment
var appConfigEndpoint = Environment.GetEnvironmentVariable("AZURE_APPCONFIG_ENDPOINT");
// eastus pods: https://my-appconfig-eastus.azconfig.io
// westeurope pods: https://my-appconfig-westeurope.azconfig.io
builder.Configuration.AddAzureAppConfiguration(options =>
options.Connect(new Uri(appConfigEndpoint), new ManagedIdentityCredential())
.UseFeatureFlags(ff => { ff.Label = environmentLabel; })
);
Regional Drift — The New Problem
Dual-instance topology introduces a new failure mode: regional drift. If the DevOps pipeline fails after updating eastus but before updating westeurope, your two regions run under different flag states indefinitely — until someone notices or the pipeline retries.
Mitigations:
Pipeline atomicity: Make the multi-region flag update a single pipeline job with a fail-fast policy. If any region update fails, alert and do not proceed. Do not partially update.
Drift detection query: Run this KQL query hourly as a scheduled monitor alert. It compares flag evaluation telemetry across regions and alerts if the same flag has meaningfully different exposure rates in different regions — which is the fingerprint of drift:
customEvents
| where name == "FeatureFlag.Evaluated"
| extend Region = tostring(customDimensions["cloud_RoleInstance"]) // or custom region tag
| extend Flag = tostring(customDimensions["Flag"])
| extend Enabled = tobool(customDimensions["Enabled"])
| summarize ExposurePct = round(100.0 * countif(Enabled) / count(), 1)
by Flag, Region, bin(timestamp, 1h)
| where ExposurePct > 0
// Alert if two regions differ by more than 5 percentage points for the same flag
- App Configuration replica feature (preview):
Microsoft's geo-replica feature lets you configure a replica endpoint in a second region, managed entirely by App Configuration — no dual-instance pipeline synchronisation required. The SDK automatically fails over to the replica if the primary is unreachable.
The tradeoff versus the dual-instance approach:
| Geo-Replica (Preview) | Dual-Instance (GA) | |
|---|---|---|
| Operational overhead | Low — Microsoft manages replication | High — your pipeline must stay atomic |
| Drift risk | None — replication is managed | Real — partial pipeline failures cause drift |
| GA status | Preview (no production SLA guarantee yet) | GA |
| Failover control | Automatic, SDK-managed | Manual / pipeline-driven |
Recommendation: prefer dual-instance for production deployments today. Migrate to geo-replica when it reaches GA and the SLA is published. Monitor the App Configuration roadmap for updates.
Regional Outage Behaviour
If your primary region's App Configuration instance becomes unavailable:
- In-process cache: Services continue operating with their last known flag state until the TTL expires. After the TTL, flags fall back to embedded defaults.
- Push refresh via Service Bus: If the regional Service Bus namespace is also affected, push refresh fails silently. The polling fallback continues.
- Startup of new replicas: New pods starting during a regional App Configuration outage will use the embedded
appsettings.jsondefaults (see Section 6, Mitigation 2). This is why those defaults are non-negotiable — not a nice-to-have.
SLA Reality Check: Azure App Configuration's SLA is 99.9% (approximately 8.7 hours of downtime per year). For a system with a 99.95% availability target, App Configuration cannot be a hard runtime dependency. It must be a correction layer over embedded defaults — not the single source of truth that your application cannot start without.
12. Flag Schema Evolution — Changing Rules Mid-Rollout
This is a production gotcha that bites teams the first time they do it, and it has no coverage in the official documentation. The question is: if you change a flag's targeting configuration while it is live and serving traffic, what happens to users already in the "on" cohort?
The Consistency Hashing Contract
The Microsoft.Targeting filter uses MurmurHash3 over a string seed composed of {userId}\n{featureName}. A given user is assigned to a bucket number (0–99) deterministically. Whether they are in the "on" cohort depends on whether their bucket falls within the RolloutPercentage range.
What this means for schema changes:
| Change You Make | Effect on Existing Users |
|---|---|
Increase DefaultRolloutPercentage from 10 to 25 |
Users in buckets 10–24 are newly added to the "on" cohort. Users in buckets 0–9 stay on. No existing "on" users are turned off. Safe. |
Decrease DefaultRolloutPercentage from 25 to 10 |
Users in buckets 10–24 are removed from the "on" cohort. They will see the old experience after the cache refreshes. Potentially disruptive. |
| Add a new Group rule | No effect on users covered by the existing DefaultRolloutPercentage. Group rules are evaluated first; the default percentage is the fallback. |
| Change the feature flag name | All users lose their bucket assignment. The new flag name produces a different hash, assigning users to entirely different buckets. Never rename a live flag. |
Add a TenantFilter to a flag already using TargetingFilter |
Filter evaluation is AND logic: a user must pass all filters. Existing "on" users who do not pass the TenantFilter will be turned off. Breaking for affected users. |
The Safe Mid-Rollout Change Protocol
- Never decrease rollout percentage during active user sessions without a maintenance window or user communication.
- Never add a new filter to a live flag without first auditing what fraction of the current "on" cohort would be excluded.
- Never rename a live flag. Create a new flag, migrate traffic to it, then archive the old one.
- When adding a
TenantFilteralongsideTargetingFilter, audit the intersection: query App Insights for users currently seeing the flag-ON experience and verify theTenantFilterwould not exclude them.
// Before adding TenantFilter: audit which tenants currently have NewCheckout=ON
customEvents
| where name == "FeatureFlag.Evaluated"
| where customDimensions["Flag"] == "NewCheckout"
| where customDimensions["Enabled"] == "True"
| summarize UserCount = dcount(tostring(customDimensions["UserId"]))
by TenantId = tostring(customDimensions["TenantId"])
| order by UserCount desc
13. Failure Modes and Mitigations
A flag system that fails open (all flags enabled) or fails closed (all flags disabled) can be as catastrophic as the bug it was meant to control. Every failure mode needs a defined safe default.
| Failure Mode | Behaviour | Mitigation |
|---|---|---|
| App Configuration Unreachable at Startup | Application crashes during boot; Kubernetes marks pod unhealthy | Embed safe defaults in appsettings.json; configure startup timeout (see Section 6) |
| App Configuration Unreachable at Runtime | SDK serves stale in-process cache until TTL, then falls back to defaults | Safe defaults should be the old code path — always. Document this per flag. |
| Refresh Storm (429 Throttling) | 200 replicas refreshing on the same cycle hit rate limits; cache goes stale | Jitter refresh: CacheExpiration + Random(0, 10s). Exponential backoff on 429. |
| Service Bus Unavailable | Push refresh fails; propagation falls back to polling TTL | Push is an optimisation, not a dependency. Monitor dead-letter queues. |
| Targeting Filter Exception | IFeatureManager throws; uncaught exception fails the request |
Wrap all evaluations in try/catch. On exception: log, emit metric, return safe default. |
| Flag Name Typo | Flag not found → silently disabled. Feature never ships. | Use FeatureFlags.cs constants. CI contract test validates every constant (see Section 14). |
| Regional Drift | Two regions operate under different flag states after a partial pipeline run | Drift detection KQL query; atomic multi-region pipeline (see Section 11) |
| Cosmos DB Audit Write Failure | Audit trail incomplete | Audit writes must be async and non-blocking. Never block a flag change on an audit write. |
The Defensive Wrapper — Non-Negotiable
public static class FeatureManagerExtensions
{
public static async Task<bool> IsEnabledSafeAsync(
this IFeatureManager fm,
string feature,
ILogger logger,
bool defaultValue = false)
{
try
{
return await fm.IsEnabledAsync(feature);
}
catch (Exception ex)
{
logger.LogError(ex,
"Flag evaluation failed: {Feature}. Falling back to default={Default}",
feature, defaultValue);
Activity.Current?.SetTag("feature.eval.error", feature);
return defaultValue;
}
}
}
14. Testing Strategy — Unit, Integration, and Contract Tests
This section is absent from most feature flag articles and is the first practical question a team asks when adopting this pattern. There are three distinct testing layers, and conflating them leads to brittle, slow tests.
Layer 1 — Unit Testing Flag-Gated Business Logic
The goal: test the business logic on both sides of a flag branch, independently of the flag evaluation mechanism. Use an in-memory IFeatureManager fake, not Moq — it is simpler and more readable:
// FakeFeatureManager — set flags to specific values for a given test
public class FakeFeatureManager : IFeatureManager
{
private readonly Dictionary<string, bool> _flags;
public FakeFeatureManager(Dictionary<string, bool> flags)
=> _flags = flags;
public Task<bool> IsEnabledAsync(string feature)
=> Task.FromResult(_flags.TryGetValue(feature, out var val) && val);
public Task<bool> IsEnabledAsync<TContext>(string feature, TContext context)
=> IsEnabledAsync(feature);
public IAsyncEnumerable<string> GetFeatureNamesAsync()
=> _flags.Keys.ToAsyncEnumerable();
}
// Usage in xUnit tests
public class CheckoutServiceTests
{
[Fact]
public async Task PlaceOrder_WhenNewCheckoutEnabled_UsesNewPipeline()
{
var features = new FakeFeatureManager(
new() { [FeatureFlags.NewCheckout] = true });
var sut = new CheckoutService(features, _newPipeline, _legacyPipeline);
await sut.PlaceOrderAsync(TestCart.Build(), CancellationToken.None);
_newPipeline.Verify(p => p.ExecuteAsync(It.IsAny<Cart>(), It.IsAny<CancellationToken>()), Times.Once);
_legacyPipeline.Verify(p => p.ExecuteAsync(It.IsAny<Cart>(), It.IsAny<CancellationToken>()), Times.Never);
}
[Fact]
public async Task PlaceOrder_WhenNewCheckoutDisabled_UsesLegacyPipeline()
{
var features = new FakeFeatureManager(
new() { [FeatureFlags.NewCheckout] = false });
var sut = new CheckoutService(features, _newPipeline, _legacyPipeline);
await sut.PlaceOrderAsync(TestCart.Build(), CancellationToken.None);
_legacyPipeline.Verify(p => p.ExecuteAsync(It.IsAny<Cart>(), It.IsAny<CancellationToken>()), Times.Once);
}
}
Layer 2 — Integration Testing the Targeting Filter Logic
The goal: verify that your TenantFilter (or custom filters) produce the correct ON/OFF decision for given inputs. Use WebApplicationFactory and an in-memory App Configuration provider rather than hitting a real App Configuration instance:
public class TenantFilterIntegrationTests : IClassFixture<WebApplicationFactory<Program>>
{
private readonly WebApplicationFactory<Program> _factory;
[Theory]
[InlineData("tenant:fabrikam-ltd", true)] // in AllowedTenants
[InlineData("tenant:unknown-corp", false)] // not in AllowedTenants, RequiredTier not met
public async Task TenantFilter_EvaluatesCorrectly(string tenantGroup, bool expectedEnabled)
{
var client = _factory.WithWebHostBuilder(builder =>
{
builder.ConfigureAppConfiguration(config =>
{
// Override App Configuration with an in-memory provider
config.AddInMemoryCollection(new Dictionary<string, string?>
{
// Feature flag JSON encoded as IConfiguration keys
["FeatureManagement:BulkExportV2:EnabledFor:0:Name"]
= "TenantFilter",
["FeatureManagement:BulkExportV2:EnabledFor:0:Parameters:AllowedTenants:0"]
= "tenant:fabrikam-ltd",
});
});
}).CreateClient();
// Set the tenant claim on the test request
// (requires test auth middleware that reads X-Test-Tenant header)
client.DefaultRequestHeaders.Add("X-Test-Tenant", tenantGroup);
var response = await client.GetAsync("/api/feature/BulkExportV2/status");
var result = await response.Content.ReadFromJsonAsync<FeatureStatusResponse>();
Assert.Equal(expectedEnabled, result!.IsEnabled);
}
}
Layer 3 — Contract Tests: CI Validation That Constants Match App Configuration
The most important test most teams never write. A flag constant in FeatureFlags.cs that has no corresponding entry in App Configuration evaluates silently to false. This causes features to silently never ship without any error. Make it a build failure instead:
// FeatureFlagContractTests.cs — runs in CI against a real App Configuration instance
// Uses the test environment label, not production
public class FeatureFlagContractTests
{
private readonly ConfigurationClient _client;
public FeatureFlagContractTests()
{
var endpoint = Environment.GetEnvironmentVariable("APPCONFIG_TEST_ENDPOINT")!;
_client = new ConfigurationClient(new Uri(endpoint), new DefaultAzureCredential());
}
[Fact]
public async Task AllFlagConstants_MustExistInAppConfiguration()
{
// Discover all flag name constants via reflection
var declaredFlags = typeof(FeatureFlags)
.GetFields(BindingFlags.Public | BindingFlags.Static | BindingFlags.FlattenHierarchy)
.Where(f => f.IsLiteral && !f.IsInitOnly && f.FieldType == typeof(string))
.Select(f => (string)f.GetRawConstantValue()!)
.ToList();
var missingFlags = new List<string>();
foreach (var flag in declaredFlags)
{
try
{
var key = $".appconfig.featureflags/{flag}";
await _client.GetConfigurationSettingAsync(key, label: "test");
}
catch (RequestFailedException ex) when (ex.Status == 404)
{
missingFlags.Add(flag);
}
}
Assert.True(
missingFlags.Count == 0,
$"The following flag constants in FeatureFlags.cs have no corresponding entry " +
$"in App Configuration (label=test): {string.Join(", ", missingFlags)}");
}
}
CI Integration: Run Layer 3 tests in a dedicated
feature-flag-contractstage in your Azure DevOps pipeline, after infrastructure provisioning but before deployment. Gate the deployment on the contract test passing.
15. Observability
A feature flag without observability is not a controlled rollout — it is a controlled guess. You need to answer three questions from a dashboard in real time:
- What percentage of traffic is hitting the new code path right now?
- Is the new code path's error rate higher than the baseline?
- Is p99 latency regressing?
Structured Telemetry Pattern
var sw = Stopwatch.StartNew();
var flagEnabled = await _features.IsEnabledSafeAsync(FeatureFlags.NewCheckout, _logger);
_telemetry.TrackEvent("FeatureFlag.Evaluated", new Dictionary<string, string>
{
["Flag"] = FeatureFlags.NewCheckout,
["Enabled"] = flagEnabled.ToString(),
["UserId"] = _context.UserId,
["TenantId"] = _context.TenantId,
["CorrelationId"] = Activity.Current?.TraceId.ToString(),
["Environment"] = _env.EnvironmentName,
["Region"] = Environment.GetEnvironmentVariable("AZURE_REGION") ?? "unknown",
});
// Tag the span — correlated across all downstream services via W3C TraceContext
using var activity = ActivitySource.StartActivity("Checkout.PlaceOrder");
activity?.SetTag("feature.new_checkout", flagEnabled);
activity?.SetTag("user.tenant_id", _context.TenantId);
The Three KQL Queries You Need on Day One
// 1. Live exposure ratio — what % of traffic is on the new path?
customEvents
| where name == "FeatureFlag.Evaluated"
| where customDimensions["Flag"] == "NewCheckout"
| summarize
Total = count(),
Enabled = countif(customDimensions["Enabled"] == "True")
by bin(timestamp, 5m)
| extend ExposurePct = round(100.0 * Enabled / Total, 2)
| project timestamp, ExposurePct, Total
// 2. Error rate split by flag state — is the new path degraded?
requests
| join kind=leftouter (
customEvents
| where name == "FeatureFlag.Evaluated"
| project operation_Id, FlagEnabled = customDimensions["Enabled"]
) on operation_Id
| summarize
ErrorRate = round(100.0 * countif(success == false) / count(), 2)
by FlagEnabled = coalesce(FlagEnabled, "unknown")
// 3. P99 latency by flag state — is there a performance regression?
customMetrics
| where name == "feature.NewCheckout.latency_ms"
| summarize P99 = percentile(value, 99)
by Enabled = tostring(customDimensions["Enabled"]),
bin(timestamp, 10m)
| render timechart
Alert Rules
| Condition | Severity | Action |
|---|---|---|
| Error rate (flag=ON) − (flag=OFF) > 2% | Critical | PagerDuty alert; hold next rollout increment. Wire to pipeline if you want auto-pause: add a gate task that calls the Azure DevOps Approvals API to block the next stage. |
| P99 latency increase > 200ms (ON vs OFF) | High | Alert release manager; hold next percentage increment |
FeatureFlag.Evaluated events drop to 0 for 5 min |
High | SDK may be failing; all users on defaults |
| Kill-switch toggled in production | Audit | Notify CISO + release manager via Teams webhook |
| Flag cache age > 2× TTL | Medium | App Configuration polling may be throttled or failing |
| Regional exposure drift > 5% between regions | Medium | Possible regional drift; trigger pipeline re-sync |
16. Security and Governance
RBAC — Least Privilege, Applied Precisely
| Role | Assigned To | What They Can Do |
|---|---|---|
| App Configuration Data Reader | Service principal / Managed Identity of your compute | Read flag values only. Can never modify flags. |
| App Configuration Data Owner | Release managers (humans only, via Entra ID PIM) | Create, modify, delete flag configurations in production |
| App Configuration Contributor (ARM) | DevOps pipeline service principal | Infrastructure provisioning only — no data plane write access |
| App Configuration Reader (ARM) | Developers | View-only access in the portal; no flag modification |
⚠️ Critical: Service principals that run your code must never have write access to App Configuration. If a service can toggle its own flags, a compromised service can disable its own rate limiting or kill-switches. Data Reader only for runtime identities — no exceptions.
Immutable Audit Trail — Event Grid → Cosmos DB
[Function("FeatureFlagAudit")]
public async Task RunAsync(
[EventGridTrigger] EventGridEvent evt,
[CosmosDBOutput("featureflags", "audit",
Connection = "CosmosDbConnection")]
IAsyncCollector<AuditRecord> output)
{
var data = evt.Data.ToObjectFromJson<AppConfigKeyValueModifiedData>();
// Audit write is async, non-blocking, never throws to caller
await output.AddAsync(new AuditRecord
{
Id = Guid.NewGuid().ToString(),
FlagKey = data.Key,
Label = data.Label, // the environment
ChangedAt = evt.EventTime,
ChangedBy = data.ModifiedBy, // Entra ID object ID
EventType = evt.EventType, // Modified | Deleted
PartitionKey = data.Key
});
}
Flag Lifecycle Governance
Flags with no expiry date are technical debt with a detonator. Enforce these rules:
- Every flag must have a
plannedExpiryDatemetadata field set at creation time. - A nightly DevOps pipeline alerts (or auto-archives) flags past their expiry.
- Short-lived release flags must be cleaned up within 2 sprints.
- Long-lived ops/experiment flags require quarterly review approval.
- A CI check validates that every constant in
FeatureFlags.cshas a corresponding entry in App Configuration. Orphaned constants are a build failure.
17. Common Pitfalls and Anti-Patterns
🚫 Pitfall 1 — Flag Debt: The Silent Killer
After 18 months you will have 200 active flags, nested dependencies ("FlagB only makes sense if FlagA is also enabled"), and engineers who are afraid to delete anything. When FlagA gets killed, FlagB's behaviour is undefined — and nobody knows what code actually runs in production.
Rule: Each flag must be independently meaningful and independently safe to disable. If you find yourself writing if (flagA && flagB) in business logic, restructure your flags.
🚫 Pitfall 2 — Using Flags as Configuration Stores
Feature flags are binary or variant routing decisions, not configuration values. Do not use a flag to store a timeout value, a batch size, or a URL. Use App Configuration key-value pairs for configuration. Reserve the feature flags API for code path decisions.
🚫 Pitfall 3 — Random Percentage Evaluation (The Flickering Bug)
Always use deterministic, user-ID-based hashing for percentage rollouts — which Microsoft.Targeting does for you automatically. Never implement percentage rollout with Random.Next() < rolloutPercentage. This evaluates differently on every request for the same user, so they see the new feature on one page load and the old one on the next. The Microsoft.Targeting filter uses MurmurHash3 over the user ID to place each user in a stable bucket, guaranteeing the same experience across every request, pod, and replica.
🚫 Pitfall 4 — Evaluating Flags in Hot Loops
IsEnabledAsync is fast, but not free. Never call it per-item in a high-volume loop:
// ❌ WRONG — evaluates 10,000 times, same result every time
foreach (var item in items)
if (await _features.IsEnabledAsync(FeatureFlags.NewPipeline))
await ProcessNew(item);
// ✅ CORRECT — evaluate once, use the result inside the loop
var useNewPipeline = await _features.IsEnabledAsync(FeatureFlags.NewPipeline);
foreach (var item in items)
if (useNewPipeline)
await ProcessNew(item);
🚫 Pitfall 5 — Shipping Without Telemetry
If you cannot answer "what percentage of requests hit the new code path in the last 30 minutes" from a dashboard, your rollout is blind. Every flag evaluation on a business-critical path must emit telemetry. Non-negotiable.
🚫 Pitfall 6 — Environment Parity Failures
If a flag is on in staging but accidentally off in production due to a label misconfiguration, your staging validation means nothing. Add a smoke test post-deploy that asserts flag states in App Configuration match expected values per environment label.
🚫 Pitfall 7 — Flag Evaluation in the Data Access Layer
A flag should never influence how data is written to the database from inside your repository or data access layer. The flag decision belongs in the service layer. Coupling your persistence model to a temporary rollout concern creates schema migration nightmares when the flag is eventually deleted.
🚫 Pitfall 8 — Renaming a Live Flag
Renaming a flag that is currently serving a percentage rollout silently resets the consistent hash bucketing for every user. Users who were in the "on" cohort may move to "off" and vice versa, with no warning and no way to detect it from application logs. Never rename a live flag. Archive it and create a new one.
18. Production Readiness Checklist
Before declaring your system production-ready, every item below should be verified:
Bootstrap and Resilience
- Safe defaults in
appsettings.json— every flag has a safe default embedded; App Configuration is a correction layer, not a hard startup dependency - Startup timeout configured —
ConfigureStartupOptions.Timeoutset; application does not crash if App Configuration is slow at boot - Readiness probe wired — pod does not receive traffic until App Configuration cache is warm
- Background worker warm-up — non-HTTP workers call
RefreshAsync()onStartAsync
Flag Evaluation
- Managed Identity only — no connection strings in configuration; service principal has Data Reader, not Data Owner
- In-process cache active — zero network calls to App Configuration on the hot path; cache TTL tuned per flag category
ITargetingContextAccessorimplemented — UserId and Groups populated from JWT claims on every authenticated request- Defensive wrapper on all call sites — try/catch with safe default; no uncaught exception from flag evaluation can crash a request
FeatureFlags.csconstants in place — no magic strings at call sites
Kill Switch and Propagation
- Kill-switch propagation tested — verified flag disable reaches all replicas within 30s (polling) or <2s (push)
- Service Bus push refresh active — Event Grid → Service Bus →
RefreshAsync()pipeline deployed and tested - Cross-service propagation via header — flag evaluated once at the transaction boundary and propagated, never re-evaluated per downstream service
Multi-Region
- One App Configuration instance per region — no cross-region polling on the refresh path
- Atomic multi-region pipeline — flag changes update all regions in a single pipeline job; partial updates cause alert and halt
- Regional drift detection query — KQL monitor alert running hourly
Testing
- Unit tests cover both flag branches —
FakeFeatureManagerused; no Moq ofIFeatureManager - Integration tests cover custom filters —
TenantFilter(and any custom filters) have targeting logic tested viaWebApplicationFactory - Contract tests in CI —
FeatureFlagContractTestsruns in pipeline; every constant inFeatureFlags.csmust exist in App Configuration or the build fails
Observability and Governance
- Structured telemetry on critical paths —
FeatureFlag.Evaluatedevent with Flag, Enabled, UserId, CorrelationId, Environment, Region - KQL dashboards and alert rules live — error rate delta, p99 latency delta, evaluation error rate, regional drift all monitored
- Expiry dates on all flags — nightly pipeline alerts on expired flags; cleanup SLA enforced
- Audit trail in Cosmos DB — every flag change has an immutable record with changedBy (Entra ID OID), timestamp, and event type
- Environment parity smoke test — post-deploy check asserts flag states match expected values per environment label
Closing Thought
Feature flags are not an engineering convenience — they are a risk management instrument. A system that can expose a feature to 1% of users, measure its impact with real production telemetry, and roll it back in five seconds is fundamentally safer than one that cannot.
On the Microsoft stack, the primitives are all there: Azure App Configuration, the Feature Management SDK, Front Door, Service Bus, and Azure Monitor. The architecture described here wires them together into a coherent, production-grade exposure plane.
The hard part is not the technology. It is the discipline: treating flags as first-class engineering artifacts with owners, expiry dates, telemetry requirements, and approval workflows. That discipline is what separates teams that ship fearlessly from teams that ship fearfully.
If this was useful, share it with your team — especially the checklist. The KQL queries in Section 15 are copy-paste ready. The contract test in Section 14 should be in every CI pipeline that uses this pattern.
Found a mistake or have a trade-off to add? Drop a comment — I read them all.
