Partial Feature Rollout in Large-Scale Distributed System

Who this is for: Senior backend engineers and architects building cloud-native systems on Azure. This is not a "what is a feature flag" primer — it is a production-grade design walkthrough with real SDK code, real trade-offs, and honest discussion of failure modes, multi-region topology, testing strategy, and where the Microsoft-native approach wins and loses against alternatives.

Problem Statement
Key Requirements
Why Azure App Configuration — and When to Choose Something Else
High-Level Architecture
Feature Flag Evaluation Flow
The Cold-Start and Bootstrap Problem
Rollout Strategies
Building a Custom Feature Filter — TenantFilter In Full
Variants and Experimentation — Beyond Boolean Flags
Consistency vs. Availability Trade-offs
Multi-Region Topology
Flag Schema Evolution — Changing Rules Mid-Rollout
Failure Modes and Mitigations
Testing Strategy — Unit, Integration, and Contract Tests
Observability
Security and Governance
Common Pitfalls and Anti-Patterns
Production Readiness Checklist

1. Problem Statement

Shipping software used to mean one thing: deploy, and everyone gets it simultaneously. For a small team with a small user base, that is fine. For a system serving millions of requests per hour across dozens of microservices in multiple Azure regions, it is a liability.

The core problem is simultaneity. A NullReferenceException on an unhandled user configuration hits every user at once. A full rollback on AKS takes 10–30 minutes of degraded service. And a Git commit message does not satisfy the SOC 2 / ISO 27001 requirement to answer "what changed, when, who approved it, and what was the previous state."

A deployment is a mechanical act. A release is a business decision. Feature flags separate the two.

The solution is to ship code that is off by default, and then deliberately, measurably turn it on — for 1% of users, then a specific tenant, then 25%, then globally. Each step is observable, reversible, and audited.

2. Key Requirements

Before any code is written, five properties define the design envelope.

Property	Requirement	Why It Matters
Safety / Blast-Radius Control	Kill-switch propagation < 30s; exact percentage targeting	A flag you cannot disable in under a minute is not a safety mechanism — it is theatre
Scalability	10,000+ RPS per service, zero latency addition	Every API call evaluates flags; a slow evaluator becomes the bottleneck in every request path
Low Latency	In-process evaluation; no network call on hot path	Calling App Configuration on every request at 10K RPS would add 20–50ms per request and generate 10K API calls/second
Auditability	Immutable log: who, when, old value, new value	SOC 2 / ISO 27001 compliance; post-incident root-cause analysis
Governance	Approval workflows; no runtime write access	A service that can toggle its own flags is a compliance violation waiting to happen

One critical consequence of the Low Latency requirement: the local in-process cache is non-negotiable. You load flags from Azure App Configuration at startup and refresh them on a background timer (default: 30 seconds). Flag evaluation is a dictionary lookup — microseconds, not milliseconds. The 30-second refresh window is the consistency trade-off you are deliberately accepting.

3. Why Azure App Configuration — and When to Choose Something Else

Architects justify their tool choices. Here is the honest comparison.

The Alternatives

LaunchDarkly is the market leader in dedicated feature flag platforms. It has a significantly richer targeting engine (multi-variate flags, advanced rules, real-time streaming, built-in experimentation with statistical significance), a purpose-built SDK with sub-second propagation, and first-class support for mobile and client-side SDKs. If your primary requirement is sophisticated A/B experimentation with business-analyst-friendly tooling and you are not constrained to the Microsoft ecosystem, LaunchDarkly is the stronger product for that use case.

Unleash (open-source) and Flagsmith (open-source / SaaS) are credible alternatives if you need self-hosted control over flag data for compliance reasons and do not want to be coupled to any cloud vendor's proprietary service.

Why Azure App Configuration Wins in the Microsoft Ecosystem

Factor	Azure App Configuration	LaunchDarkly
Identity integration	Native Entra ID / Managed Identity — zero credential management	API key or OAuth; credential rotation required
Azure DevOps / GitHub Actions	First-class pipeline tasks; ARM and Bicep templates	Third-party action; separate secret management
Compliance boundary	All data stays in your Azure tenant, in your region, under your retention policies	Data leaves your tenant to LaunchDarkly's SaaS
Cost model	Included in your Azure spend / EA agreement	Per-seat SaaS pricing that scales uncomfortably at enterprise size
Propagation latency	2–30s (push + poll hybrid described in this article)	Sub-second streaming (genuine advantage)
Experimentation depth	Basic variant support (v4 SDK); no statistical engine	Full A/B with p-values, confidence intervals, guardrail metrics

The honest verdict: If you need statistical experimentation at scale (not just routing, but measuring lift with significance), LaunchDarkly or a dedicated experimentation platform like Azure Experimentation (preview) is the right call. If you need a governed, auditable, compliance-friendly feature flag system tightly integrated into an existing Azure estate, App Configuration is the right call. Most large Microsoft-stack enterprises need the latter.

4. High-Level Architecture

The system has four logical planes, each with a clear responsibility mapped to specific Azure services.

Plane	Azure Service	Responsibility
Control Plane	Azure App Configuration	Source of truth for flag definitions, filter configs, and environment labels
Control Plane	Azure Key Vault	Referenced secrets (never stored in App Configuration directly)
Data Plane	Microsoft.FeatureManagement SDK	In-process flag evaluation using a locally cached snapshot — no network call per request
Data Plane	Azure Front Door	Edge-level traffic routing: canary backends, header injection, geo-targeting
Data Plane	Azure Service Bus	Push-based flag-change propagation for near-instant cache invalidation (<2s)
Observation Plane	Azure Monitor + App Insights	Flag evaluation telemetry, per-flag error rates, latency metrics, and alerting
Observation Plane	Log Analytics Workspace	Centralised KQL-queryable audit and evaluation logs across all services and regions
Governance Plane	Azure DevOps / GitHub Actions	Flag lifecycle: creation, environment promotion, approval gates, scheduled expiry
Governance Plane	Azure Entra ID	Data Reader for service principals; Data Owner for release managers only

The Request Flow

[Client Request]
      │
      ▼
[Azure Front Door] ──── WAF + Rules Engine
      │                 (injects X-Feature-Variant header
      │                  or routes to canary backend pool)
      ▼
[AKS / App Service — .NET 8 API]
      │
      │  IFeatureManager.IsEnabledAsync("NewCheckout", ctx)
      ▼
[In-Process Cache] ──── refreshed every 30s from App Configuration
      │                 (background thread — NOT the hot path)
      │  Applies: TargetingFilter, PercentageFilter,
      │           TimeWindowFilter, TenantFilter (custom)
      ▼
[Decision: ON / OFF / Variant]
      │
      ▼
[Business Logic Branch] ──── telemetry event → Application Insights

Key Design Insight: Azure App Configuration is never on the hot path. The SDK maintains a local in-memory cache, refreshed asynchronously. Your 10,000 RPS service makes approximately 2 calls to App Configuration per minute per replica — not per request.

5. Feature Flag Evaluation Flow

Step 1 — SDK Registration in `Program.cs`

dotnet add package Microsoft.FeatureManagement.AspNetCore
dotnet add package Microsoft.Extensions.Configuration.AzureAppConfiguration

Wire up App Configuration with Managed Identity — no connection strings, no secrets in config:

// Always use Managed Identity — never a connection string in config
builder.Configuration.AddAzureAppConfiguration(options =>
{
    options
        .Connect(new Uri(appConfigEndpoint), new ManagedIdentityCredential())
        .Select(KeyFilter.Any, LabelFilter.Null)       // baseline (no label)
        .Select(KeyFilter.Any, environmentLabel)       // env override: "production"
        .UseFeatureFlags(ff =>
        {
            ff.Label = environmentLabel;
            ff.CacheExpirationInterval = TimeSpan.FromSeconds(30);
        })
        .ConfigureRefresh(refresh =>
        {
            // Refresh ALL flags when the sentinel key changes
            refresh
                .Register(".appconfig.featureflags", refreshAll: true)
                .SetCacheExpiration(TimeSpan.FromSeconds(30));
        });
});

builder.Services.AddAzureAppConfiguration();

builder.Services
    .AddFeatureManagement()
    .AddFeatureFilter<TargetingFilter>()
    .AddFeatureFilter<PercentageFilter>()
    .AddFeatureFilter<TimeWindowFilter>()
    .AddFeatureFilter<TenantFilter>();  // implemented in full in Section 8

// Middleware — drives the background refresh loop
app.UseAzureAppConfiguration();

Step 2 — Strongly Typed Flag Constants

Never use magic strings at call sites. A CI step validates that every constant in FeatureFlags.cs exists in App Configuration — if it does not, the build fails (see Section 14 for the contract test that enforces this):

// One source of truth for flag names.
// A CI contract test validates every constant exists in App Configuration.
public static class FeatureFlags
{
    public const string NewCheckout       = "NewCheckout";
    public const string V2PricingEngine   = "V2PricingEngine";
    public const string AiRecommendations = "AiRecommendations";
    public const string BulkExportV2      = "BulkExportV2";
}

Step 3 — `ITargetingContextAccessor` Implementation

This is where your identity model meets the flag system. Groups are how you target millions of users via a handful of rules — without exploding the flag configuration size:

public class HttpTargetingContextAccessor : ITargetingContextAccessor
{
    private readonly IHttpContextAccessor _http;

    public ValueTask<TargetingContext> GetContextAsync()
    {
        var user     = _http.HttpContext?.User;
        var userId   = user?.FindFirstValue(ClaimTypes.NameIdentifier) ?? "anonymous";
        var tenantId = user?.FindFirstValue("tid") ?? "default";
        var tier     = user?.FindFirstValue("subscription_tier") ?? "free";

        return ValueTask.FromResult(new TargetingContext
        {
            UserId = userId,
            // Groups let you target millions of users via a handful of rules
            Groups = [\("tenant:{tenantId}", \)"tier:{tier}"]
        });
    }
}

Step 4 — Evaluation at the Call Site

Centralise flag evaluation in the service layer — never in the data access layer or raw controllers:

public class CheckoutService
{
    private readonly IFeatureManager _features;

    public async Task<OrderResult> PlaceOrderAsync(Cart cart, CancellationToken ct)
    {
        // One evaluation. One consistent decision for this request.
        var useNewCheckout = await _features.IsEnabledAsync(FeatureFlags.NewCheckout);

        return useNewCheckout
            ? await _newCheckoutPipeline.ExecuteAsync(cart, ct)
            : await _legacyCheckoutPipeline.ExecuteAsync(cart, ct);
    }
}

6. The Cold-Start and Bootstrap Problem

This is the gap most feature flag articles skip entirely, and it is the first thing an architect asks in a design review: what happens before the cache is warm?

Quick Reference — The Four Mitigations

#	Mitigation	Protects Against
1	Startup timeout with graceful fallback	App Configuration unreachable at boot → pod crash
2	Embed safe defaults in `appsettings.json`	Cold start with no App Configuration response → unknown state
3	Readiness probe with `initialDelaySeconds`	Pod receiving traffic before cache is warm
4	Explicit `RefreshAsync()` in background workers	Non-HTTP workers never hitting the refresh middleware

The Problem

On first startup — or on a cold-start in a scale-to-zero environment like Azure Container Apps or Consumption-plan Functions — the AddAzureAppConfiguration call in Program.cs performs a synchronous, blocking load from App Configuration before the application accepts any traffic. Three failure scenarios emerge:

App Configuration is unreachable at boot. The application crashes during startup and Kubernetes marks the pod as unhealthy. No traffic is served, not even from the old replica set. Depending on your rolling update strategy, this can take down your entire deployment.
App Configuration responds slowly at boot. Cold start latency spikes. On Azure Container Apps with scale-to-zero, this adds directly to the first-request latency experienced by the user.
Partial flag state at startup. If App Configuration returns a subset of keys (due to a transient error mid-response), the local cache is inconsistently populated. Some flags are missing and silently fall back to their defaults.

The Mitigations

Mitigation 1: Startup timeout with a graceful fallback

Configure the initial load with an explicit timeout. If App Configuration does not respond within the timeout, start with embedded defaults — do not crash:

builder.Configuration.AddAzureAppConfiguration(options =>
{
    options
        .Connect(new Uri(appConfigEndpoint), new ManagedIdentityCredential())
        .UseFeatureFlags(ff => { ff.Label = environmentLabel; })
        .ConfigureStartupOptions(startupOptions =>
        {
            // Do not crash on startup if App Configuration is unreachable.
            // Start with embedded defaults; background refresh will correct state.
            startupOptions.Timeout = TimeSpan.FromSeconds(10);
        });
});

Mitigation 2: Embed safe defaults as appsettings.json fallback

The AddAzureAppConfiguration call merges on top of the existing IConfiguration. Pre-populate appsettings.json with all feature flags set to their safe default state (the old code path). If App Configuration is unreachable at startup, the service runs with known-safe defaults instead of crashing:

// appsettings.json — safe defaults for every flag
// These are OVERRIDDEN by App Configuration on successful load
{
  "FeatureManagement": {
    "NewCheckout":       false,
    "V2PricingEngine":   false,
    "AiRecommendations": false,
    "BulkExportV2":      false
  }
}

Mitigation 3: Health check for scale-to-zero environments

For Azure Container Apps and Azure Functions on Consumption plan, the cold-start problem compounds because App Configuration's Managed Identity token acquisition adds its own latency (50–200ms for the first token on a fresh instance). Expose a startup health probe and configure a generous initialDelaySeconds:

builder.Services.AddHealthChecks()
    .AddAzureAppConfiguration(
        name: "appconfig",
        tags: ["ready"]);

// Kubernetes / Container Apps readiness probe
// Do not mark the pod ready until App Configuration is warm
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready")
});

# AKS deployment — give the pod time to warm the cache before receiving traffic
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 5
  failureThreshold: 3

Mitigation 4: Pre-warm the cache in the background service, not the request pipeline

For long-running background workers (Azure Service Bus consumers, Hangfire workers), the UseAzureAppConfiguration() middleware is not available since there is no HTTP pipeline. Drive refresh explicitly via a hosted service:

public class AppConfigWarmupService : IHostedService
{
    private readonly IConfigurationRefresher _refresher;

    public async Task StartAsync(CancellationToken ct)
    {
        // Eagerly refresh on worker startup — do not wait for first poll interval
        await _refresher.RefreshAsync();
    }

    public Task StopAsync(CancellationToken ct) => Task.CompletedTask;
}

Scale-to-Zero Rule of Thumb: If your p99 cold-start budget is under 500ms, pre-populate all flag defaults in appsettings.json and treat App Configuration as an async correction layer — not a blocking startup dependency.

7. Rollout Strategies

7.1 — Percentage-Based Rollout

The workhorse of gradual exposure. The Microsoft.Targeting filter uses deterministic consistent hashing (MurmurHash3 over the user ID) to assign each user to a stable bucket. The same user always gets the same experience — no flickering between requests.

{
  "id": "NewCheckout",
  "description": "Gradual rollout of the new checkout flow — Sprint 42",
  "enabled": true,
  "conditions": {
    "client_filters": [
      {
        "name": "Microsoft.Targeting",
        "parameters": {
          "Audience": {
            "DefaultRolloutPercentage": 10,
            "Groups": [
              { "Name": "tier:enterprise", "RolloutPercentage": 100 },
              { "Name": "tier:pro",        "RolloutPercentage": 50  }
            ]
          }
        }
      }
    ]
  }
}

Production Staging Pattern: Follow this schedule: 1% → 5% → 10% → 25% → 50% → 100%. Encode it as pipeline stages with approval gates. Wait at least one full business cycle (24 hours) at each stage. Compare error rates and p99 latency between flag-ON and flag-OFF cohorts using KQL before advancing.

7.2 — User / Tenant-Based Targeting

Essential for beta programs, early-access customers, and debugging production issues against a specific tenant without a broader rollout:

{
  "id": "V2PricingEngine",
  "enabled": true,
  "conditions": {
    "client_filters": [{
      "name": "Microsoft.Targeting",
      "parameters": {
        "Audience": {
          "Users": ["alice@contoso.com", "qa-bot@internal.com"],
          "Groups": [
            { "Name": "tenant:fabrikam-ltd",  "RolloutPercentage": 100 },
            { "Name": "tenant:northwind-inc", "RolloutPercentage": 100 }
          ],
          "DefaultRolloutPercentage": 0
        }
      }
    }]
  }
}

DefaultRolloutPercentage: 0 means no one outside the explicit list or groups sees the flag — your closed beta pattern.

7.3 — Environment-Based Rollout via Labels

Azure App Configuration's label system scopes a configuration value to an environment. The same flag key can be enabled: true in development and enabled: false in production:

# Development: AI Recommendations is fully on
az appconfig feature set \
  --name my-appconfig --feature AiRecommendations --label development --yes

# Production: deliberately off until approved by release manager
az appconfig feature disable \
  --name my-appconfig --feature AiRecommendations --label production --yes

// SDK automatically selects the correct label at startup
options.UseFeatureFlags(ff =>
{
    ff.Label = Environment.GetEnvironmentVariable("AZURE_APP_CONFIG_LABEL");
    // "development" | "staging" | "production"
});

The SDK merges labels: baseline values are the fallback, environment-specific labels override them. Your pipeline promotes a flag by setting the production label — no code changes, no redeployment.

7.4 — Kill Switches and Instant Rollback

A kill switch is a flag with enabled: false — but propagation speed is what makes it a kill switch rather than a slow rollback.

Polling (baseline): The SDK refreshes every 30 seconds. Worst case: 30 seconds to propagate across all replicas.

Push via Event Grid + Service Bus (optimisation): App Configuration emits change events to Event Grid → Service Bus → your services call IConfigurationRefresher.RefreshAsync() immediately. Typical propagation: under 2 seconds.

public class FeatureFlagRefreshConsumer : BackgroundService
{
    private readonly IConfigurationRefresher _refresher;
    private readonly ServiceBusProcessor _processor;
    private readonly ILogger<FeatureFlagRefreshConsumer> _logger;

    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        _processor.ProcessMessageAsync += OnMessageAsync;
        _processor.ProcessErrorAsync   += OnErrorAsync;
        await _processor.StartProcessingAsync(ct);
    }

    private async Task OnMessageAsync(ProcessMessageEventArgs args)
    {
        var evt = args.Message.Body.ToObjectFromJson<AppConfigChangeEvent>();

        if (evt?.EventType == "Microsoft.AppConfiguration.KeyValueModified")
        {
            await _refresher.RefreshAsync();  // invalidates local cache immediately
            _logger.LogInformation(
                "Flag cache refreshed. Key={Key} CorrelationId={Id}",
                evt.Key, args.Message.CorrelationId);
        }
        await args.CompleteMessageAsync(args.Message);
    }

    // FIX 4: OnErrorAsync was referenced but never implemented in the original.
    // Always log + emit a metric here. Never rethrow — let the processor recover.
    private Task OnErrorAsync(ProcessErrorEventArgs args)
    {
        _logger.LogError(
            args.Exception,
            "Service Bus error in flag refresh consumer. " +
            "Source={Source} EntityPath={EntityPath}. " +
            "Falling back to polling TTL.",
            args.ErrorSource,
            args.EntityPath);

        // Emit a metric so your alert rule fires if this happens repeatedly
        // e.g. _telemetry.TrackMetric("FeatureFlagRefresh.ServiceBusError", 1);

        return Task.CompletedTask; // Do not throw — processor will retry
    }
}

⚠️ Anti-Pattern Warning: Do not make push refresh a hard dependency. Service Bus has its own availability SLA. Push refresh is an optimisation that reduces typical propagation from 30s to <2s. Always keep the polling path active as the fallback.

8. Building a Custom Feature Filter — TenantFilter In Full

The built-in Microsoft.Targeting filter handles most multi-tenant use cases. But there are scenarios where you need filter logic that goes beyond what targeting expressions support — for example, checking a tenant's contract state from a database, applying geo-regulatory rules, or gating on a tenant's Azure subscription SKU.

Here is a complete, production-ready TenantFilter implementation:

The Filter Parameters Contract

// The parameters object that maps to the "parameters" block
// in the App Configuration feature flag JSON
public class TenantFilterParameters
{
    // Explicit tenant IDs that should always have the flag ON
    public List<string> AllowedTenants { get; set; } = [];

    // Tenant IDs that should always have the flag OFF (denylist overrides allowlist)
    public List<string> BlockedTenants { get; set; } = [];

    // Optional: only allow tenants on specific subscription tiers
    public List<string> RequiredTiers { get; set; } = [];
}

The Filter Implementation

[FilterAlias("TenantFilter")]
public class TenantFilter : IFeatureFilter
{
    private readonly IHttpContextAccessor _http;
    private readonly ITenantRepository _tenantRepo;
    private readonly ILogger<TenantFilter> _logger;

    public TenantFilter(
        IHttpContextAccessor http,
        ITenantRepository tenantRepo,
        ILogger<TenantFilter> logger)
    {
        _http       = http;
        _tenantRepo = tenantRepo;
        _logger     = logger;
    }

    public async Task<bool> EvaluateAsync(FeatureFilterEvaluationContext context)
    {
        var parameters = context.Parameters
            .Get<TenantFilterParameters>() ?? new TenantFilterParameters();

        var tenantId = _http.HttpContext?.User
            .FindFirstValue("tid");

        if (string.IsNullOrEmpty(tenantId))
        {
            _logger.LogDebug(
                "TenantFilter: no tenant claim found for flag {Flag}. Returning false.",
                context.FeatureName);
            return false;
        }

        // Denylist always wins — an explicitly blocked tenant gets OFF regardless
        if (parameters.BlockedTenants.Contains(tenantId, StringComparer.OrdinalIgnoreCase))
        {
            _logger.LogInformation(
                "TenantFilter: tenant {TenantId} is blocked for flag {Flag}.",
                tenantId, context.FeatureName);
            return false;
        }

        // Explicit allowlist: if the list is non-empty and the tenant is in it, ON
        if (parameters.AllowedTenants.Count > 0)
        {
            return parameters.AllowedTenants
                .Contains(tenantId, StringComparer.OrdinalIgnoreCase);
        }

        // Tier gate: optionally restrict to tenants on specific subscription tiers
        if (parameters.RequiredTiers.Count > 0)
        {
            var tenant = await _tenantRepo.GetAsync(tenantId);
            if (tenant is null)
            {
                _logger.LogWarning(
                    "TenantFilter: tenant {TenantId} not found in repository. Returning false.",
                    tenantId);
                return false;
            }

            return parameters.RequiredTiers
                .Contains(tenant.SubscriptionTier, StringComparer.OrdinalIgnoreCase);
        }

        // No rules configured — default open (all tenants pass)
        return true;
    }
}

The Corresponding App Configuration JSON

{
  "id": "BulkExportV2",
  "enabled": true,
  "conditions": {
    "client_filters": [
      {
        "name": "TenantFilter",
        "parameters": {
          "AllowedTenants": [],
          "BlockedTenants": ["tenant-under-legal-hold-123"],
          "RequiredTiers":  ["enterprise", "enterprise-plus"]
        }
      }
    ]
  }
}

Registration

builder.Services
    .AddFeatureManagement()
    .AddFeatureFilter<TenantFilter>();

// TenantFilter depends on ITenantRepository — register it
builder.Services.AddScoped<ITenantRepository, CosmosTenantRepository>();

Performance Note: TenantFilter hits ITenantRepository on every evaluation when RequiredTiers is configured. Cache the tenant record in IMemoryCache with a short TTL (60 seconds) rather than going to Cosmos DB on every request. The filter itself has no internal caching — that is your responsibility.

9. Variants and Experimentation — Beyond Boolean Flags

The Microsoft.FeatureManagement v4 SDK introduced variants — the ability to assign different values to different user segments, not just ON/OFF. This is the foundation of A/B testing and multivariate experimentation without reaching for a third-party platform.

Why Variants Matter

A boolean flag answers: should this user see the new feature? A variant answers: which version of the feature should this user see?

The canonical use case: testing two different checkout button colours, three different pricing display formats, or two versions of an AI recommendation algorithm — where the outcome metric differs between variants, not just whether the code path runs.

Defining a Variant Flag

{
  "id": "CheckoutButtonVariant",
  "enabled": true,
  "variants": [
    {
      "name": "Control",
      "configuration_value": "blue"
    },
    {
      "name": "Treatment",
      "configuration_value": "green"
    }
  ],
  "allocation": {
    "default_when_enabled": "Control",
    "percentile": [
      { "variant": "Control",   "from": 0,  "to": 50 },
      { "variant": "Treatment", "from": 50, "to": 100 }
    ]
  },
  "telemetry": {
    "enabled": true
  }
}

Evaluating a Variant in .NET

public class CheckoutController : ControllerBase
{
    private readonly IVariantFeatureManager _variantManager;

    [HttpGet("checkout")]
    public async Task<IActionResult> GetCheckoutPage()
    {
        // GetVariantAsync returns the assigned variant for the current user
        var variant = await _variantManager
            .GetVariantAsync("CheckoutButtonVariant", HttpContext.RequestAborted);

        var buttonColour = variant?.Configuration?.Value ?? "blue";

        // Emit telemetry — this is how you measure which variant converts better
        _telemetry.TrackEvent("CheckoutPage.Rendered", new Dictionary<string, string>
        {
            ["Variant"]       = variant?.Name ?? "default",
            ["ButtonColour"]  = buttonColour,
            ["UserId"]        = User.FindFirstValue(ClaimTypes.NameIdentifier),
        });

        return Ok(new { buttonColour });
    }
}

The Honest Limitation — and What to Do About It

The Microsoft.FeatureManagement v4 SDK gives you variant assignment and basic telemetry emission. It does not give you statistical analysis, confidence intervals, p-values, or guardrail metric monitoring out of the box.

For a lightweight in-house analysis, you can wire the emitted CheckoutPage.Rendered events directly into a KQL query in Log Analytics to measure conversion lift between variants:

// Step 1: join variant assignment to conversion events
let assignments = customEvents
    | where name == "CheckoutPage.Rendered"
    | project UserId = tostring(customDimensions["UserId"]),
              Variant = tostring(customDimensions["Variant"]),
              SessionId = session_Id;

let conversions = customEvents
    | where name == "Order.Placed"
    | project UserId = tostring(customDimensions["UserId"]),
              SessionId = session_Id;

// Step 2: compute conversion rate per variant
assignments
| join kind=leftouter conversions on UserId, SessionId
| summarize
    Users     = dcount(UserId),
    Converted = dcountif(UserId, isnotempty(SessionId1))
  by Variant
| extend ConversionRate = round(100.0 * Converted / Users, 2)
| project Variant, Users, Converted, ConversionRate

This tells you which variant converts better and at what sample size. It does not compute statistical significance automatically — for that, feed the raw counts into a two-proportion z-test (trivial in Python or R) or use Azure Experimentation (currently in preview), which handles the full experiment lifecycle including guardrail metrics and auto-stopping rules.

10. Consistency vs. Availability Trade-offs

Feature flag systems live at the intersection of the CAP theorem in the most practical, production-visible way. The wrong consistency model causes split-brain bugs that are extremely painful to debug.

The Staleness Window

During the 30-second refresh window, two replicas of the same service can have different views of a flag. Replica A has NewCheckout=ON; Replica B still has NewCheckout=OFF. If a user's retry goes to a different pod, they see inconsistent behaviour.

This is an accepted trade-off. Three mitigations:

Front Door session affinity: Route a user to the same backend pod during a session. Masks replica staleness at the cost of slightly uneven load distribution.
Idempotent design: Both code paths should produce equivalent state mutations. If both old and new checkout create the same order record, a mid-session switch is invisible to the user.
Shorter TTL for kill switches: Maintain a separate configuration category for emergency flags with a 5-second TTL. This narrows the consistency window for safety-critical toggles specifically.

Cross-Service Flag Consistency — The Hard Part

In a microservices architecture, NewCheckout may be read by OrderService, InventoryService, and NotificationService. If it is enabled in OrderService but not yet propagated to InventoryService, you can create partially executed distributed transactions. This is the correctness hazard that kills you.

The solution: evaluate the flag exactly once at the system boundary that owns the transaction, then propagate the decision via request context — not the flag name:

// OrderController — evaluate at the API boundary, once, for this transaction
var useNewCheckout = await _features.IsEnabledAsync(FeatureFlags.NewCheckout);

// Propagate the DECISION — not a re-evaluation request
var client = _factory.CreateClient("InventoryService");
client.DefaultRequestHeaders.Add(
    "X-Feature-NewCheckout", useNewCheckout ? "1" : "0");

// InventoryService reads the propagated decision — never re-evaluates the flag
// This ensures all services operate under one consistent flag state for this transaction
var useNewCheckout = httpContext.Request.Headers["X-Feature-NewCheckout"] == "1";

Core Principle: In a distributed transaction, the flag evaluation must happen exactly once, at the boundary that owns the transaction. Re-evaluating at each service boundary creates a distributed consistency hazard — each service might read a different cached state, splitting the transaction across different code paths.

11. Multi-Region Topology

This is the most architect-specific section of this article and the most commonly skipped in feature flag write-ups. If you are running in multiple Azure regions — say, eastus and westeurope — the following topology questions have direct production consequences.

Azure App Configuration's Replication Model

App Configuration is a single-region, geo-redundant service. When you create an App Configuration instance, you choose a primary region. Azure replicates data to a secondary region within the same geography for disaster recovery, but the secondary is not an active read replica — it is a failover target. There is no multi-master write capability and no automatic cross-region read distribution.

The implication: every replica in your westeurope AKS cluster polling the App Configuration instance you created in eastus is making a cross-region API call — adding 60–100ms of latency to the background refresh, and creating a dependency on cross-region network health for flag propagation.

The Recommended Multi-Region Topology

Create one App Configuration instance per region and use your CI/CD pipeline to synchronise flag state across instances:

[Release Pipeline — Azure DevOps]
        │
        ├──► az appconfig feature set ... --name appconfig-eastus    --label production
        │
        └──► az appconfig feature set ... --name appconfig-westeurope --label production

Each regional AKS cluster reads from the App Configuration instance in its own region. Cross-region latency is eliminated from the refresh path. Regional isolation means a flag change in eastus does not block because westeurope's App Configuration is unavailable.

// Each region reads from its own App Configuration instance
// AZURE_APPCONFIG_ENDPOINT is set per-region in the AKS pod environment
var appConfigEndpoint = Environment.GetEnvironmentVariable("AZURE_APPCONFIG_ENDPOINT");
// eastus pods:     https://my-appconfig-eastus.azconfig.io
// westeurope pods: https://my-appconfig-westeurope.azconfig.io

builder.Configuration.AddAzureAppConfiguration(options =>
    options.Connect(new Uri(appConfigEndpoint), new ManagedIdentityCredential())
           .UseFeatureFlags(ff => { ff.Label = environmentLabel; })
);

Regional Drift — The New Problem

Dual-instance topology introduces a new failure mode: regional drift. If the DevOps pipeline fails after updating eastus but before updating westeurope, your two regions run under different flag states indefinitely — until someone notices or the pipeline retries.

Mitigations:

Pipeline atomicity: Make the multi-region flag update a single pipeline job with a fail-fast policy. If any region update fails, alert and do not proceed. Do not partially update.
Drift detection query: Run this KQL query hourly as a scheduled monitor alert. It compares flag evaluation telemetry across regions and alerts if the same flag has meaningfully different exposure rates in different regions — which is the fingerprint of drift:

customEvents
| where name == "FeatureFlag.Evaluated"
| extend Region  = tostring(customDimensions["cloud_RoleInstance"]) // or custom region tag
| extend Flag    = tostring(customDimensions["Flag"])
| extend Enabled = tobool(customDimensions["Enabled"])
| summarize ExposurePct = round(100.0 * countif(Enabled) / count(), 1)
  by Flag, Region, bin(timestamp, 1h)
| where ExposurePct > 0
// Alert if two regions differ by more than 5 percentage points for the same flag

App Configuration replica feature (preview):

Microsoft's geo-replica feature lets you configure a replica endpoint in a second region, managed entirely by App Configuration — no dual-instance pipeline synchronisation required. The SDK automatically fails over to the replica if the primary is unreachable.

The tradeoff versus the dual-instance approach:

	Geo-Replica (Preview)	Dual-Instance (GA)
Operational overhead	Low — Microsoft manages replication	High — your pipeline must stay atomic
Drift risk	None — replication is managed	Real — partial pipeline failures cause drift
GA status	Preview (no production SLA guarantee yet)	GA
Failover control	Automatic, SDK-managed	Manual / pipeline-driven

Recommendation: prefer dual-instance for production deployments today. Migrate to geo-replica when it reaches GA and the SLA is published. Monitor the App Configuration roadmap for updates.

Regional Outage Behaviour

If your primary region's App Configuration instance becomes unavailable:

In-process cache: Services continue operating with their last known flag state until the TTL expires. After the TTL, flags fall back to embedded defaults.
Push refresh via Service Bus: If the regional Service Bus namespace is also affected, push refresh fails silently. The polling fallback continues.
Startup of new replicas: New pods starting during a regional App Configuration outage will use the embedded appsettings.json defaults (see Section 6, Mitigation 2). This is why those defaults are non-negotiable — not a nice-to-have.

SLA Reality Check: Azure App Configuration's SLA is 99.9% (approximately 8.7 hours of downtime per year). For a system with a 99.95% availability target, App Configuration cannot be a hard runtime dependency. It must be a correction layer over embedded defaults — not the single source of truth that your application cannot start without.

12. Flag Schema Evolution — Changing Rules Mid-Rollout

This is a production gotcha that bites teams the first time they do it, and it has no coverage in the official documentation. The question is: if you change a flag's targeting configuration while it is live and serving traffic, what happens to users already in the "on" cohort?

The Consistency Hashing Contract

The Microsoft.Targeting filter uses MurmurHash3 over a string seed composed of {userId}\n{featureName}. A given user is assigned to a bucket number (0–99) deterministically. Whether they are in the "on" cohort depends on whether their bucket falls within the RolloutPercentage range.

What this means for schema changes:

Change You Make	Effect on Existing Users
Increase `DefaultRolloutPercentage` from 10 to 25	Users in buckets 10–24 are newly added to the "on" cohort. Users in buckets 0–9 stay on. No existing "on" users are turned off. Safe.
Decrease `DefaultRolloutPercentage` from 25 to 10	Users in buckets 10–24 are removed from the "on" cohort. They will see the old experience after the cache refreshes. Potentially disruptive.
Add a new Group rule	No effect on users covered by the existing `DefaultRolloutPercentage`. Group rules are evaluated first; the default percentage is the fallback.
Change the feature flag name	All users lose their bucket assignment. The new flag name produces a different hash, assigning users to entirely different buckets. Never rename a live flag.
Add a `TenantFilter` to a flag already using `TargetingFilter`	Filter evaluation is AND logic: a user must pass all filters. Existing "on" users who do not pass the `TenantFilter` will be turned off. Breaking for affected users.

The Safe Mid-Rollout Change Protocol

Never decrease rollout percentage during active user sessions without a maintenance window or user communication.
Never add a new filter to a live flag without first auditing what fraction of the current "on" cohort would be excluded.
Never rename a live flag. Create a new flag, migrate traffic to it, then archive the old one.
When adding a TenantFilter alongside TargetingFilter, audit the intersection: query App Insights for users currently seeing the flag-ON experience and verify the TenantFilter would not exclude them.

// Before adding TenantFilter: audit which tenants currently have NewCheckout=ON
customEvents
| where name == "FeatureFlag.Evaluated"
| where customDimensions["Flag"]    == "NewCheckout"
| where customDimensions["Enabled"] == "True"
| summarize UserCount = dcount(tostring(customDimensions["UserId"]))
  by TenantId = tostring(customDimensions["TenantId"])
| order by UserCount desc

13. Failure Modes and Mitigations

A flag system that fails open (all flags enabled) or fails closed (all flags disabled) can be as catastrophic as the bug it was meant to control. Every failure mode needs a defined safe default.

Failure Mode	Behaviour	Mitigation
App Configuration Unreachable at Startup	Application crashes during boot; Kubernetes marks pod unhealthy	Embed safe defaults in `appsettings.json`; configure startup timeout (see Section 6)
App Configuration Unreachable at Runtime	SDK serves stale in-process cache until TTL, then falls back to defaults	Safe defaults should be the old code path — always. Document this per flag.
Refresh Storm (429 Throttling)	200 replicas refreshing on the same cycle hit rate limits; cache goes stale	Jitter refresh: `CacheExpiration + Random(0, 10s)`. Exponential backoff on 429.
Service Bus Unavailable	Push refresh fails; propagation falls back to polling TTL	Push is an optimisation, not a dependency. Monitor dead-letter queues.
Targeting Filter Exception	`IFeatureManager` throws; uncaught exception fails the request	Wrap all evaluations in try/catch. On exception: log, emit metric, return safe default.
Flag Name Typo	Flag not found → silently disabled. Feature never ships.	Use `FeatureFlags.cs` constants. CI contract test validates every constant (see Section 14).
Regional Drift	Two regions operate under different flag states after a partial pipeline run	Drift detection KQL query; atomic multi-region pipeline (see Section 11)
Cosmos DB Audit Write Failure	Audit trail incomplete	Audit writes must be async and non-blocking. Never block a flag change on an audit write.

The Defensive Wrapper — Non-Negotiable

public static class FeatureManagerExtensions
{
    public static async Task<bool> IsEnabledSafeAsync(
        this IFeatureManager fm,
        string feature,
        ILogger logger,
        bool defaultValue = false)
    {
        try
        {
            return await fm.IsEnabledAsync(feature);
        }
        catch (Exception ex)
        {
            logger.LogError(ex,
                "Flag evaluation failed: {Feature}. Falling back to default={Default}",
                feature, defaultValue);

            Activity.Current?.SetTag("feature.eval.error", feature);

            return defaultValue;
        }
    }
}

14. Testing Strategy — Unit, Integration, and Contract Tests

This section is absent from most feature flag articles and is the first practical question a team asks when adopting this pattern. There are three distinct testing layers, and conflating them leads to brittle, slow tests.

Layer 1 — Unit Testing Flag-Gated Business Logic

The goal: test the business logic on both sides of a flag branch, independently of the flag evaluation mechanism. Use an in-memory IFeatureManager fake, not Moq — it is simpler and more readable:

// FakeFeatureManager — set flags to specific values for a given test
public class FakeFeatureManager : IFeatureManager
{
    private readonly Dictionary<string, bool> _flags;

    public FakeFeatureManager(Dictionary<string, bool> flags)
        => _flags = flags;

    public Task<bool> IsEnabledAsync(string feature)
        => Task.FromResult(_flags.TryGetValue(feature, out var val) && val);

    public Task<bool> IsEnabledAsync<TContext>(string feature, TContext context)
        => IsEnabledAsync(feature);

    public IAsyncEnumerable<string> GetFeatureNamesAsync()
        => _flags.Keys.ToAsyncEnumerable();
}

// Usage in xUnit tests
public class CheckoutServiceTests
{
    [Fact]
    public async Task PlaceOrder_WhenNewCheckoutEnabled_UsesNewPipeline()
    {
        var features = new FakeFeatureManager(
            new() { [FeatureFlags.NewCheckout] = true });

        var sut = new CheckoutService(features, _newPipeline, _legacyPipeline);

        await sut.PlaceOrderAsync(TestCart.Build(), CancellationToken.None);

        _newPipeline.Verify(p => p.ExecuteAsync(It.IsAny<Cart>(), It.IsAny<CancellationToken>()), Times.Once);
        _legacyPipeline.Verify(p => p.ExecuteAsync(It.IsAny<Cart>(), It.IsAny<CancellationToken>()), Times.Never);
    }

    [Fact]
    public async Task PlaceOrder_WhenNewCheckoutDisabled_UsesLegacyPipeline()
    {
        var features = new FakeFeatureManager(
            new() { [FeatureFlags.NewCheckout] = false });

        var sut = new CheckoutService(features, _newPipeline, _legacyPipeline);

        await sut.PlaceOrderAsync(TestCart.Build(), CancellationToken.None);

        _legacyPipeline.Verify(p => p.ExecuteAsync(It.IsAny<Cart>(), It.IsAny<CancellationToken>()), Times.Once);
    }
}

Layer 2 — Integration Testing the Targeting Filter Logic

The goal: verify that your TenantFilter (or custom filters) produce the correct ON/OFF decision for given inputs. Use WebApplicationFactory and an in-memory App Configuration provider rather than hitting a real App Configuration instance:

public class TenantFilterIntegrationTests : IClassFixture<WebApplicationFactory<Program>>
{
    private readonly WebApplicationFactory<Program> _factory;

    [Theory]
    [InlineData("tenant:fabrikam-ltd", true)]   // in AllowedTenants
    [InlineData("tenant:unknown-corp", false)]  // not in AllowedTenants, RequiredTier not met
    public async Task TenantFilter_EvaluatesCorrectly(string tenantGroup, bool expectedEnabled)
    {
        var client = _factory.WithWebHostBuilder(builder =>
        {
            builder.ConfigureAppConfiguration(config =>
            {
                // Override App Configuration with an in-memory provider
                config.AddInMemoryCollection(new Dictionary<string, string?>
                {
                    // Feature flag JSON encoded as IConfiguration keys
                    ["FeatureManagement:BulkExportV2:EnabledFor:0:Name"]
                        = "TenantFilter",
                    ["FeatureManagement:BulkExportV2:EnabledFor:0:Parameters:AllowedTenants:0"]
                        = "tenant:fabrikam-ltd",
                });
            });
        }).CreateClient();

        // Set the tenant claim on the test request
        // (requires test auth middleware that reads X-Test-Tenant header)
        client.DefaultRequestHeaders.Add("X-Test-Tenant", tenantGroup);

        var response = await client.GetAsync("/api/feature/BulkExportV2/status");
        var result   = await response.Content.ReadFromJsonAsync<FeatureStatusResponse>();

        Assert.Equal(expectedEnabled, result!.IsEnabled);
    }
}

Layer 3 — Contract Tests: CI Validation That Constants Match App Configuration

The most important test most teams never write. A flag constant in FeatureFlags.cs that has no corresponding entry in App Configuration evaluates silently to false. This causes features to silently never ship without any error. Make it a build failure instead:

// FeatureFlagContractTests.cs — runs in CI against a real App Configuration instance
// Uses the test environment label, not production
public class FeatureFlagContractTests
{
    private readonly ConfigurationClient _client;

    public FeatureFlagContractTests()
    {
        var endpoint = Environment.GetEnvironmentVariable("APPCONFIG_TEST_ENDPOINT")!;
        _client = new ConfigurationClient(new Uri(endpoint), new DefaultAzureCredential());
    }

    [Fact]
    public async Task AllFlagConstants_MustExistInAppConfiguration()
    {
        // Discover all flag name constants via reflection
        var declaredFlags = typeof(FeatureFlags)
            .GetFields(BindingFlags.Public | BindingFlags.Static | BindingFlags.FlattenHierarchy)
            .Where(f => f.IsLiteral && !f.IsInitOnly && f.FieldType == typeof(string))
            .Select(f => (string)f.GetRawConstantValue()!)
            .ToList();

        var missingFlags = new List<string>();

        foreach (var flag in declaredFlags)
        {
            try
            {
                var key = $".appconfig.featureflags/{flag}";
                await _client.GetConfigurationSettingAsync(key, label: "test");
            }
            catch (RequestFailedException ex) when (ex.Status == 404)
            {
                missingFlags.Add(flag);
            }
        }

        Assert.True(
            missingFlags.Count == 0,
            $"The following flag constants in FeatureFlags.cs have no corresponding entry " +
            $"in App Configuration (label=test): {string.Join(", ", missingFlags)}");
    }
}

CI Integration: Run Layer 3 tests in a dedicated feature-flag-contract stage in your Azure DevOps pipeline, after infrastructure provisioning but before deployment. Gate the deployment on the contract test passing.

15. Observability

A feature flag without observability is not a controlled rollout — it is a controlled guess. You need to answer three questions from a dashboard in real time:

What percentage of traffic is hitting the new code path right now?
Is the new code path's error rate higher than the baseline?
Is p99 latency regressing?

Structured Telemetry Pattern

var sw          = Stopwatch.StartNew();
var flagEnabled = await _features.IsEnabledSafeAsync(FeatureFlags.NewCheckout, _logger);

_telemetry.TrackEvent("FeatureFlag.Evaluated", new Dictionary<string, string>
{
    ["Flag"]          = FeatureFlags.NewCheckout,
    ["Enabled"]       = flagEnabled.ToString(),
    ["UserId"]        = _context.UserId,
    ["TenantId"]      = _context.TenantId,
    ["CorrelationId"] = Activity.Current?.TraceId.ToString(),
    ["Environment"]   = _env.EnvironmentName,
    ["Region"]        = Environment.GetEnvironmentVariable("AZURE_REGION") ?? "unknown",
});

// Tag the span — correlated across all downstream services via W3C TraceContext
using var activity = ActivitySource.StartActivity("Checkout.PlaceOrder");
activity?.SetTag("feature.new_checkout", flagEnabled);
activity?.SetTag("user.tenant_id", _context.TenantId);

The Three KQL Queries You Need on Day One

// 1. Live exposure ratio — what % of traffic is on the new path?
customEvents
| where name == "FeatureFlag.Evaluated"
| where customDimensions["Flag"] == "NewCheckout"
| summarize
    Total   = count(),
    Enabled = countif(customDimensions["Enabled"] == "True")
  by bin(timestamp, 5m)
| extend ExposurePct = round(100.0 * Enabled / Total, 2)
| project timestamp, ExposurePct, Total

// 2. Error rate split by flag state — is the new path degraded?
requests
| join kind=leftouter (
    customEvents
    | where name == "FeatureFlag.Evaluated"
    | project operation_Id, FlagEnabled = customDimensions["Enabled"]
  ) on operation_Id
| summarize
    ErrorRate = round(100.0 * countif(success == false) / count(), 2)
  by FlagEnabled = coalesce(FlagEnabled, "unknown")

// 3. P99 latency by flag state — is there a performance regression?
customMetrics
| where name == "feature.NewCheckout.latency_ms"
| summarize P99 = percentile(value, 99)
  by Enabled = tostring(customDimensions["Enabled"]),
     bin(timestamp, 10m)
| render timechart

Alert Rules

Condition	Severity	Action
Error rate (flag=ON) − (flag=OFF) > 2%	Critical	PagerDuty alert; hold next rollout increment. Wire to pipeline if you want auto-pause: add a gate task that calls the Azure DevOps Approvals API to block the next stage.
P99 latency increase > 200ms (ON vs OFF)	High	Alert release manager; hold next percentage increment
`FeatureFlag.Evaluated` events drop to 0 for 5 min	High	SDK may be failing; all users on defaults
Kill-switch toggled in production	Audit	Notify CISO + release manager via Teams webhook
Flag cache age > 2× TTL	Medium	App Configuration polling may be throttled or failing
Regional exposure drift > 5% between regions	Medium	Possible regional drift; trigger pipeline re-sync

16. Security and Governance

RBAC — Least Privilege, Applied Precisely

Role	Assigned To	What They Can Do
App Configuration Data Reader	Service principal / Managed Identity of your compute	Read flag values only. Can never modify flags.
App Configuration Data Owner	Release managers (humans only, via Entra ID PIM)	Create, modify, delete flag configurations in production
App Configuration Contributor (ARM)	DevOps pipeline service principal	Infrastructure provisioning only — no data plane write access
App Configuration Reader (ARM)	Developers	View-only access in the portal; no flag modification

⚠️ Critical: Service principals that run your code must never have write access to App Configuration. If a service can toggle its own flags, a compromised service can disable its own rate limiting or kill-switches. Data Reader only for runtime identities — no exceptions.

Immutable Audit Trail — Event Grid → Cosmos DB

[Function("FeatureFlagAudit")]
public async Task RunAsync(
    [EventGridTrigger] EventGridEvent evt,
    [CosmosDBOutput("featureflags", "audit",
        Connection = "CosmosDbConnection")]
    IAsyncCollector<AuditRecord> output)
{
    var data = evt.Data.ToObjectFromJson<AppConfigKeyValueModifiedData>();

    // Audit write is async, non-blocking, never throws to caller
    await output.AddAsync(new AuditRecord
    {
        Id           = Guid.NewGuid().ToString(),
        FlagKey      = data.Key,
        Label        = data.Label,       // the environment
        ChangedAt    = evt.EventTime,
        ChangedBy    = data.ModifiedBy,  // Entra ID object ID
        EventType    = evt.EventType,    // Modified | Deleted
        PartitionKey = data.Key
    });
}

Flag Lifecycle Governance

Flags with no expiry date are technical debt with a detonator. Enforce these rules:

Every flag must have a plannedExpiryDate metadata field set at creation time.
A nightly DevOps pipeline alerts (or auto-archives) flags past their expiry.
Short-lived release flags must be cleaned up within 2 sprints.
Long-lived ops/experiment flags require quarterly review approval.
A CI check validates that every constant in FeatureFlags.cs has a corresponding entry in App Configuration. Orphaned constants are a build failure.

17. Common Pitfalls and Anti-Patterns

🚫 Pitfall 1 — Flag Debt: The Silent Killer

After 18 months you will have 200 active flags, nested dependencies ("FlagB only makes sense if FlagA is also enabled"), and engineers who are afraid to delete anything. When FlagA gets killed, FlagB's behaviour is undefined — and nobody knows what code actually runs in production.

Rule: Each flag must be independently meaningful and independently safe to disable. If you find yourself writing if (flagA && flagB) in business logic, restructure your flags.

🚫 Pitfall 2 — Using Flags as Configuration Stores

Feature flags are binary or variant routing decisions, not configuration values. Do not use a flag to store a timeout value, a batch size, or a URL. Use App Configuration key-value pairs for configuration. Reserve the feature flags API for code path decisions.

🚫 Pitfall 3 — Random Percentage Evaluation (The Flickering Bug)

Always use deterministic, user-ID-based hashing for percentage rollouts — which Microsoft.Targeting does for you automatically. Never implement percentage rollout with Random.Next() < rolloutPercentage. This evaluates differently on every request for the same user, so they see the new feature on one page load and the old one on the next. The Microsoft.Targeting filter uses MurmurHash3 over the user ID to place each user in a stable bucket, guaranteeing the same experience across every request, pod, and replica.

🚫 Pitfall 4 — Evaluating Flags in Hot Loops

IsEnabledAsync is fast, but not free. Never call it per-item in a high-volume loop:

// ❌ WRONG — evaluates 10,000 times, same result every time
foreach (var item in items)
    if (await _features.IsEnabledAsync(FeatureFlags.NewPipeline))
        await ProcessNew(item);

// ✅ CORRECT — evaluate once, use the result inside the loop
var useNewPipeline = await _features.IsEnabledAsync(FeatureFlags.NewPipeline);
foreach (var item in items)
    if (useNewPipeline)
        await ProcessNew(item);

🚫 Pitfall 5 — Shipping Without Telemetry

If you cannot answer "what percentage of requests hit the new code path in the last 30 minutes" from a dashboard, your rollout is blind. Every flag evaluation on a business-critical path must emit telemetry. Non-negotiable.

🚫 Pitfall 6 — Environment Parity Failures

If a flag is on in staging but accidentally off in production due to a label misconfiguration, your staging validation means nothing. Add a smoke test post-deploy that asserts flag states in App Configuration match expected values per environment label.

🚫 Pitfall 7 — Flag Evaluation in the Data Access Layer

A flag should never influence how data is written to the database from inside your repository or data access layer. The flag decision belongs in the service layer. Coupling your persistence model to a temporary rollout concern creates schema migration nightmares when the flag is eventually deleted.

🚫 Pitfall 8 — Renaming a Live Flag

Renaming a flag that is currently serving a percentage rollout silently resets the consistent hash bucketing for every user. Users who were in the "on" cohort may move to "off" and vice versa, with no warning and no way to detect it from application logs. Never rename a live flag. Archive it and create a new one.

18. Production Readiness Checklist

Before declaring your system production-ready, every item below should be verified:

Bootstrap and Resilience

Safe defaults in appsettings.json — every flag has a safe default embedded; App Configuration is a correction layer, not a hard startup dependency
Startup timeout configured — ConfigureStartupOptions.Timeout set; application does not crash if App Configuration is slow at boot
Readiness probe wired — pod does not receive traffic until App Configuration cache is warm
Background worker warm-up — non-HTTP workers call RefreshAsync() on StartAsync

Flag Evaluation

Managed Identity only — no connection strings in configuration; service principal has Data Reader, not Data Owner
In-process cache active — zero network calls to App Configuration on the hot path; cache TTL tuned per flag category
ITargetingContextAccessor implemented — UserId and Groups populated from JWT claims on every authenticated request
Defensive wrapper on all call sites — try/catch with safe default; no uncaught exception from flag evaluation can crash a request
FeatureFlags.cs constants in place — no magic strings at call sites

Kill Switch and Propagation

Kill-switch propagation tested — verified flag disable reaches all replicas within 30s (polling) or <2s (push)
Service Bus push refresh active — Event Grid → Service Bus → RefreshAsync() pipeline deployed and tested
Cross-service propagation via header — flag evaluated once at the transaction boundary and propagated, never re-evaluated per downstream service

Multi-Region

One App Configuration instance per region — no cross-region polling on the refresh path
Atomic multi-region pipeline — flag changes update all regions in a single pipeline job; partial updates cause alert and halt
Regional drift detection query — KQL monitor alert running hourly

Testing

Unit tests cover both flag branches — FakeFeatureManager used; no Moq of IFeatureManager
Integration tests cover custom filters — TenantFilter (and any custom filters) have targeting logic tested via WebApplicationFactory
Contract tests in CI — FeatureFlagContractTests runs in pipeline; every constant in FeatureFlags.cs must exist in App Configuration or the build fails

Observability and Governance

Structured telemetry on critical paths — FeatureFlag.Evaluated event with Flag, Enabled, UserId, CorrelationId, Environment, Region
KQL dashboards and alert rules live — error rate delta, p99 latency delta, evaluation error rate, regional drift all monitored
Expiry dates on all flags — nightly pipeline alerts on expired flags; cleanup SLA enforced
Audit trail in Cosmos DB — every flag change has an immutable record with changedBy (Entra ID OID), timestamp, and event type
Environment parity smoke test — post-deploy check asserts flag states match expected values per environment label

Closing Thought

Feature flags are not an engineering convenience — they are a risk management instrument. A system that can expose a feature to 1% of users, measure its impact with real production telemetry, and roll it back in five seconds is fundamentally safer than one that cannot.

On the Microsoft stack, the primitives are all there: Azure App Configuration, the Feature Management SDK, Front Door, Service Bus, and Azure Monitor. The architecture described here wires them together into a coherent, production-grade exposure plane.

The hard part is not the technology. It is the discipline: treating flags as first-class engineering artifacts with owners, expiry dates, telemetry requirements, and approval workflows. That discipline is what separates teams that ship fearlessly from teams that ship fearfully.

If this was useful, share it with your team — especially the checklist. The KQL queries in Section 15 are copy-paste ready. The contract test in Section 14 should be in every CI pipeline that uses this pattern.

Found a mistake or have a trade-off to add? Drop a comment — I read them all.

Command Palette

Table of Contents