Error Handling and Retry Patterns: Resilience

Error Handling and Retry Patterns: Resilience

1. Introduction

Enterprise flows must assume failure as a routine event: APIs throttle, network calls timeout, schemas drift, and concurrent updates collide. Without engineered resilience, incidents produce duplicate records, half‑completed business transactions, and opaque logs. This expanded guide provides a layered resilience blueprint: taxonomy, scope orchestration, retry tuning, circuit breakers, idempotency, compensation, dead‑letter handling, structured logging, telemetry, alerting, testing, performance trade‑offs, KPIs, best practices, and advanced troubleshooting.

Key Objectives:

  1. Classify failures for targeted handling (transient vs permanent vs data vs logic vs concurrency vs security).
  2. Implement structured scopes (Validation, Core, Integrations, Compensation, Error Handler).
  3. Calibrate retry policies (fixed vs exponential vs custom jitter) to avoid resource exhaustion.
  4. Enforce circuit breaker with cooldown for sustained outage containment.
  5. Guarantee idempotency & deduplication across create/update operations.
  6. Apply compensating transactions to reverse partial side effects.
  7. Isolate unrecoverable failures via dead letter design for remediation.
  8. Standardize logging schema for analytics & support.
  9. Instrument telemetry for proactive monitoring & SLA adherence.
  10. Establish multi‑tier alerting & incident workflow.
  11. Validate negative paths with synthetic test harness.
  12. Measure resilience maturity & iterate continuously.

2. Failure Taxonomy & Classification Matrix

Type Example Detection Signal Strategy Escalation
Transient 429 throttling / 503 service unavailable HTTP status + Retry-After Exponential retry + jitter Warn if repeated
Permanent 404 missing record / 400 validation Status class (4xx non-retryable) Fail fast + notify Immediate ticket
Data Invalid JSON / null mandatory field Parse failure / empty check Pre-validate & sanitize Daily summary
Logic Misrouted branch / wrong condition Unexpected state variable Add guard conditions Design review
Concurrency 409 conflict HTTP 409 code Limited retry + merge logic Escalate if > threshold
Security 401 / 403 Auth failure status Terminate & credential rotation Security alert
Latency > SLA duration Duration metric Adaptive timeout & monitoring Performance analysis

Decision Flow:

if status in [429,503] -> transient-retry
elif status in [400,404] -> permanent-fail
elif status == 409 -> concurrency-retry-limited
elif status == 401 or 403 -> security-terminate
else if duration > threshold -> latency-monitor

3. Scope Orchestration & Try/Catch Pattern

Recommended scopes:

  1. Validation (input schema & guard conditions)
  2. CoreBusiness (primary transaction logic)
  3. ExternalIntegrations (API / connector calls)
  4. Compensation (reverse partial side effects)
  5. ErrorHandler (central logging, classification, alert routing)

ErrorHandler Run After: failed, timed out, or skipped. Non‑critical actions (notifications, metrics) may still execute for forensic completeness.

Standardized Error Object Compose:

{
	"runId": "@{workflow().run.name}",
	"flowName": "@{workflow().name}",
	"utcTimestamp": "@{utcNow()}",
	"statusCode": "@{coalesce(actions('CoreBusiness').error.code,'0')}",
	"errorType": "@{coalesce(actions('CoreBusiness').error.type,'unknown')}",
	"errorMessage": "@{coalesce(actions('CoreBusiness').error.message,'n/a')}",
	"retryAttempt": "@{int(coalesce(actions('CoreBusiness').retryCount,0))}"
}

Persist to Dataverse or log analytics; attach correlation header if present.

4. Retry Policies & Backoff Tuning

Policy Typical Use Example Config Pros Cons
Fixed Minor latency blips count=3 interval=PT30S Predictable timing Inefficient under heavy throttle
Exponential Throttling / transient outages count=4 interval=PT10S type=Exponential Reduces contention Longer total duration
Custom Jitter High competition endpoints Manual loop + Delay random(5,20)s Avoid herd effect Extra actions & complexity

Exponential sequence (base 10s): 10s, 20s, 40s, 80s. Add ±20% jitter for distributed fairness.

Guidelines:

  1. Restrict retries to transient codes (429, 503, occasional 500).
  2. Cap cumulative wait time (<5 minutes) for user‑facing flows.
  3. Capture success‑after‑retry metric to refine intervals.
  4. Consider dynamic backoff using Retry-After header when provided.
  5. Prefer single retry for non-critical read operations to minimize cost.
  6. For parallel branches, stagger delays using randomization to avoid synchronized bursts.
  7. Log each retry attempt with attempt number, scheduled delay, and outcome for analytics.

5. Circuit Breaker & Cooldown

Maintain failureCount variable. If failureCount >= threshold (e.g., 3):

  1. Send high priority alert.
  2. Terminate with status CircuitOpen.
  3. Record circuitOpenedAt timestamp.

Cooldown: On next run evaluate addMinutes(circuitOpenedAt,15). If still future, short‑circuit early to reduce pressure. Reset after first successful run.

Benefits: Prevents runaway retries, reduces cost, gives backend recovery window.

Enhancement: Add circuit state telemetry dashboard: openings per week, average open duration, mean time to reset. Target <2 openings per quarter; investigate if exceeded.

6. Idempotency & Deduplication

Risk Scenario Mitigation
Duplicate Create POST retried after timeout ExternalId / alternate key check
Double Notification Retry branch resends email Store notification hash + skip if exists
Parallel Update Two runs mutate same record Acquire logical lock flag row

Generate GUID idempotency key at start. Persist key with entity. On retry, confirm existence prevents duplication.

Advanced Pattern: Use composite alternate key (ExternalId + Region) for multi-tenant uniqueness. Maintain IdempotencyAudit table capturing key, outcome (Created|Skipped|Updated), durationMs, and timestamp to measure dedupe efficiency.

7. Compensation & Partial Rollback

Forward Action Compensating Action
Create Order Delete / Cancel Order
Reserve Inventory Release Inventory reservation
Debit Wallet Credit Wallet reversal
Send External Notification Send Cancellation Notice

Maintain array variable compensationSteps. After success of irreversible action push reversal descriptor. On failure iterate array. Log each reversal attempt & outcome.

Append correlation suffix -comp to reversal run IDs for grouping. If a compensation fails, enqueue entry into CompensationDeadLetter table for manual remediation and trend analysis.

8. Dead Letter / Quarantine Design

Permanent failures produce dead letter entry:

  1. Store payload + error metadata + correlation ID.
  2. Scheduled remediation flow processes backlog; resolves or escalates.
  3. Metrics: backlog size, clearance rate, average age.

Prevents infinite retry loops & preserves forensic context for analysis.

Retention Policy: Keep dead letter items for 60 days; nightly archival flow exports resolved items older than 30 days to cold storage before purge. Track deadLetterBacklogAgeP95 KPI.

9. Structured Logging Contract

Field Description
flowName Logical name
runId Unique run identifier
actionName Failing action
errorType Taxonomy classification
statusCode HTTP / internal code
errorMessage Sanitized friendly message
retryAttempt Attempt number
correlationId External trace linkage
durationMs Execution duration
timestampUtc Event time

Push to Dataverse or Log Analytics (HTTP Data Collector). Ensure no secrets in logs (sanitize headers).

Enrichment Fields: flowVersion, environmentName, triggerType, tenantId enabling multi-dimensional slicing. Maintain schema version number for evolution.

10. Telemetry & Metrics

Metric Target Purpose
Retry Success Rate >80% Validate policy effectiveness
Circuit Opens <2 / quarter Stability indicator
Dead Letter Clearance >90% / month Recovery efficacy
Duplicate Prevention Incidents <1 / month Idempotency health
Compensation Failure Rate <5% Reversal reliability
Mean Time to Repair (MTTR) <2h critical flows Operational responsiveness
Dead Letter Backlog Age P95 <14 days Timely remediation
Circuit Cooldown Utilization <30% runs during incident Minimize disruption

11. Alerting Strategy & Escalation

Level Trigger Channel SLA
Info Single transient auto‑recovered Dashboard N/A
Warning >3 transient retries Teams low priority 24h
Critical Circuit open / security failure Teams + Email + Pager Immediate
Major Incident Sustained outage >30m War room, ticket RCA 72h

Daily summary aggregator compiles failures by taxonomy; attaches CSV for operations review.

Weekly resilience review evaluates circuit openings, top transient endpoints, dead letter trends, compensation failure ratio, and proposes threshold adjustments.

12. Negative Path Testing Harness

Test cases:

  1. Force 429 using mock endpoint.
  2. Inject invalid JSON to Parse JSON.
  3. Simulate parallel 409 conflict updates.
  4. Trigger partial success then deliberate failure to engage compensation.
  5. Expire credential to validate security termination.

Record evidence (screenshots, run history export) for audit.

Scheduled synthetic harness flow invokes mock endpoints to force controlled 429, 409, and 500 responses; compares expected classification vs actual to detect regression.

13. Performance & Cost Trade‑Offs

Feature Overhead Mitigation
Retry Longer duration Limit attempts, specify transient only
Logging Additional writes Batch & compress fields
Compensation Extra loop actions Only for irreversible ops
Dead Letter Storage growth Purge resolved entries monthly

Target overhead <15% vs baseline run time.

Establish performance baseline: duplicate flow without resilience scopes; execute controlled batch (e.g., 50 runs) recording average duration, action count. Compute overhead delta and optimize if >15%.

14. KPIs & Maturity Model

Level Traits Focus
1 Reactive Manual failure handling Introduce scoped error handler
2 Structured Logging + retries Add idempotency & classification
3 Proactive Circuit breaker + dead letters Compensation logic
4 Optimized Metric dashboards + alerts Predictive anomaly detection
5 Autonomous Adaptive tuning Self‑healing via ML
6 Predictive Real-time anomaly prevention Automated threshold recalibration

15. Best Practices (DO / DON'T)

DO

  1. Branch by error type early.
  2. Employ exponential backoff with jitter.
  3. Implement idempotency keys.
  4. Keep logging schema consistent.
  5. Externalize thresholds (variables/env vars).
  6. Test negative paths regularly.
  7. Use correlation IDs end‑to‑end.
  8. Version compensation logic with flow changes.
  9. Set SLOs for resilience metrics.
  10. Review dead letter backlog weekly.
  11. Separate compensation logic into its dedicated scope.
  12. Maintain a resilience playbook for onboarding engineers.

DON'T

  1. Retry permanent 4xx errors.
  2. Omit logging on compensation failures.
  3. Hardcode retry counts inline.
  4. Ignore concurrency (409) scenarios.
  5. Allow silent circuit breaks without alert.
  6. Mix unrelated error handling styles across flows.
  7. Store sensitive content in logs.
  8. Assume single retry fixes all transients.
  9. Neglect cooldown after breaker open.
  10. Skip classification documentation.
  11. Assume default retry policy fits all connectors.
  12. Ignore latency spikes without adding instrumentation.

16. Troubleshooting (Enhanced)

Issue Symptom Root Cause Resolution Prevention
Endless retries Long durations Misclassified permanent error Add classification guard Taxonomy matrix
Silent failure No alert Run After misconfigured Configure dependencies Template scope pattern
Duplicate creates Multiple records No idempotency check Add existence query Alternate key strategy
Circuit never resets Always open Missing cooldown logic Implement timestamp check Cooldown variable
Dead letter growth Large table No purge process Add scheduled purge flow Retention policy variable
Missing correlation Hard tracing Header not injected Add correlation generator Policy injection
Compensation failures hidden No visibility No logging branch Log each reversal attempt Reversal logging schema
Latency spikes Slow actions Missing timeout configuration Add timeout or optimize API Monitor duration trend
Stale circuit state Circuit open too long Reset condition not triggered Manual reset + analyze root cause Circuit test scenario

17. Key Takeaways

Resilience transforms flows from brittle scripts into dependable automation assets. Implement layered scopes, tuned retries, circuit breakers, idempotency, compensation, dead letters, structured logging, telemetry, and alerting early—retroactive fixes cost more and risk production instability.

18. Next Steps

  1. Catalog high-impact flows and classify risk.
  2. Introduce standard error handling scope template.
  3. Implement logging schema & correlation ID.
  4. Tune retry policies on transient-prone actions.
  5. Add dead letter remediation process.
  6. Define quarterly resilience KPIs & review cadence.

19. References & Further Reading

Structured Logging

Log fields:

  • Flow name
  • Run ID
  • Action name
  • Error type
  • Timestamp
  • Retry count

Destination: Dataverse table or Azure Log Analytics (HTTP Data Collector API).

Alerting Strategy

  • Critical failures → Teams channel + email
  • Warning threshold (multiple transient errors) → Dashboard highlight
  • Daily summary report of failures by type

Testing Error Paths

  • Force connector failure using invalid endpoint
  • Simulate throttling with high-frequency runs
  • Validate error handler writes log row

Best Practices

  • Separate critical vs non-critical actions (allow continuation)
  • Centralize error formatting in one compose block
  • Use environment variables for thresholds
  • Document known transient patterns (service-specific)
  • Periodically review retry settings against SLA changes

Troubleshooting

Issue Cause Resolution
Endless retries No termination condition Add circuit breaker variable
Silent failure Missing run-after config Set error scope dependencies
Duplicate records No idempotency Add existence check before create
Hard to analyze logs Inconsistent schema Standardize logging contract

Key Takeaways

Intentional error handling transforms opaque failures into actionable operational insight and graceful degradation.

References

20. Implementation Examples

20.1 Correlation ID Generation

Compose action expression:

@{coalesce(triggerOutputs()?['headers']['x-correlation-id'], guid())}

If upstream system did not supply an x-correlation-id header, generate a GUID. Persist this value in every log row and external call header (Custom Connector policy injects it automatically).

20.2 Circuit Breaker State Variables

Environment variables (recommended) or solution-level variables:

  • CircuitFailureThreshold = 3
  • CircuitCooldownMinutes = 15
  • CircuitStateTable (Dataverse table logical name)

Retrieve current state:

Get a row by alternate key (flowName) → circuitOpenedAt, consecutiveFailures

Update logic (pseudo):

if(consecutiveFailures >= CircuitFailureThreshold) {
  if(utcNow() < addMinutes(circuitOpenedAt, CircuitCooldownMinutes)) {
    status = 'CircuitOpen'; terminate;
  }
}

20.3 Dataverse Logging Table Schema (ResilienceLog)

Column Type Notes
runId Text Alternate key candidate
flowName Text Index for analytics
actionName Text Failing or key action
errorType Choice Matches taxonomy enumeration
statusCode Whole Number HTTP or internal code
errorMessage Text (max 4k) Sanitized
correlationId Text Global trace
retryAttempt Whole Number 0 if initial
durationMs Whole Number Action execution duration
timestampUtc DateTime Logged moment
environmentName Text For multi-env rollups
flowVersion Text Semantic version (e.g., 2.4.1)
triggerType Choice Recurrence / HTTP / Dataverse
tenantId Text Optional multi-tenant scenarios
compensationApplied Two Options Yes / No
circuitState Choice Closed / Open / Cooldown
schemaVersion Whole Number Logging contract evolution

20.4 Log Analytics HTTP Data Collector Payload

{
	"records": [
		{
			"TimeGenerated": "@{utcNow()}",
			"runId": "@{workflow().run.name}",
			"flowName": "@{workflow().name}",
			"correlationId": "@{variables('correlationId')}",
			"errorType": "@{variables('errorType')}",
			"statusCode": "@{variables('statusCode')}",
			"retryAttempt": "@{variables('retryAttempt')}",
			"durationMs": "@{variables('durationMs')}",
			"circuitState": "@{variables('circuitState')}",
			"environmentName": "@{variables('environmentName')}",
			"flowVersion": "@{variables('flowVersion')}"
		}
	]
}

20.5 Compensation Loop Pseudo Implementation

Initialize Array compensationSteps []
After Create Order → Append { type: 'CancelOrder', id: orderId }
After Reserve Inventory → Append { type: 'ReleaseInventory', sku: sku, qty: qty }
After Debit Wallet → Append { type: 'CreditWallet', walletId: walletId, amount: amt }

On Failure:
  ForEach compensationSteps (reverse order)
    Switch type → invoke corresponding reversal connector
    Log success/failure
    If failure → push to CompensationDeadLetter

20.6 Idempotent Create Pattern

GET record by ExternalId
if(found) { skip create; log outcome='Skipped'; }
else { POST create; log outcome='Created'; }

Include IdempotencyAudit insert with fields: key, outcome, elapsedMs.

21. Dynamic Threshold Automation

Static limits become stale. Implement adaptive thresholds derived from rolling averages:

avgDurationLast24h = float(variables('totalDurationWindow')) / float(variables('countWindow'))
dynamicTimeout = mul(avgDurationLast24h, 2.5)

If currentDuration > dynamicTimeout classify as latency anomaly and emit proactive warning before user impact escalates.

Transient retry tuning example:

baseDelaySeconds = 10
jitter = rand(0,4)
attemptDelay = pow(2, attemptNumber) * baseDelaySeconds + jitter

Cap attemptDelay at 300 seconds to prevent extreme waits.

22. Advanced Telemetry Queries (Kusto)

Dead Letter Backlog Age P95:

ResilienceLog
| where deadLetter == true
| summarize backlogAgeP95 = percentile(datetime_diff('minute', timestampUtc, now()), 95)

Circuit Opens Trend:

ResilienceLog
| where circuitState == 'Open'
| summarize opens=count() by bin(timestampUtc, 1d)

Retry Success Rate:

ResilienceLog
| where retryAttempt > 0
| summarize successes = countif(errorType == 'None'), total = count() 
| extend rate = successes * 100.0 / total

Compensation Failure Ratio:

ResilienceLog
| where compensationApplied == 'Yes'
| summarize failures = countif(errorType != 'None'), total = count()
| extend failureRate = failures * 100.0 / total

23. Resilience Architecture (Text Diagram)

[Trigger]
  → Validation Scope
    → CoreBusiness Scope
      → ExternalIntegrations Scope (retries, idempotency)
        ↘ ErrorHandler Scope (logging, classification, alerting)
      → Compensation Scope (reverse actions cascade)
  → Circuit State Check → Short-Circuit Terminate (if open)
  → Dead Letter Remediation Flow (scheduled)

24. Governance & Operational Cadence

Cadence Activity Owner Artifact
Daily Review critical failures & dead letters Operations Dashboard snapshot
Weekly Resilience KPI review & tuning Engineering Lead KPI report
Monthly Pattern adoption audit Architecture Adoption matrix
Quarterly Threshold recalibration & taxonomy updates Architecture + Ops Resilience baseline doc
Annual ML predictive pilot evaluation Data Science Experiment summary

25. Pattern Catalog Quick Reference

Pattern Purpose Trigger Guard Rails
Retry (Exponential) Handle transient throttle 429/503 Max total wait <5m
Circuit Breaker Prevent cascade failures Repeated outage Cooldown enforced
Idempotency Key Avoid duplicate create Timeout & retry scenario Unique alternate key
Compensation Reverse partial side effects Downstream failure after commits Log reversal attempt
Dead Letter Isolate unrecoverable payload Permanent 4xx / logic error Purge schedule set
Structured Log Enable analytics All failures & key actions No secrets stored
Dynamic Threshold Adapt to performance drift Sustained latency increases Min/max bounds

26. Sample Environment Variable Set

Name Example Value Description
CircuitFailureThreshold 3 Failures before open
CircuitCooldownMinutes 15 Cooldown duration
MaxRetryAttempts 4 Global safety cap
BaseRetryDelaySeconds 10 Exponential seed
DeadLetterRetentionDays 60 Purge policy
WarningRetryCount 3 Alert threshold
MaxLatencyMs 5000 Static latency ceiling
LogSchemaVersion 2 Evolves contract
CompensationEnabled true Toggle reversal behavior
DynamicTimeoutMultiplier 2.5 Scales rolling mean

27. ML / Predictive Roadmap

Phase 1: Baseline metrics (already implemented).
Phase 2: Anomaly detection (Kusto query detects deviation >3σ from mean latency).
Phase 3: Predictive throttle forecasting (model uses time-series of 429 counts to pre-emptively reduce concurrency).
Phase 4: Automated policy recalibration (function updates environment variables based on sustained trends).
Phase 5: Self-healing orchestration (flows re-route to alternate connector or cached data source during predicted outages).

28. FAQ

Question Answer
Why not retry 400 errors? They signal client/data issues; retry wastes cost.
Should all flows implement circuit breaker? Only high-volume or business-critical; simple sporadic flows can skip.
How many log fields are too many? Favor essential analytics; keep schema lean (<20 columns) to control storage.
When to archive dead letters? After remediation + age threshold (e.g., 30 days) to cheaper storage.
Difference between idempotency and deduplication? Idempotency prevents duplicate side effects; deduplication filters identical payloads.
Can compensation replace formal rollback? No; it's best-effort reversal for distributed operations lacking atomic transactions.
What if compensation causes new errors? Log separately, escalate if pattern emerges, refine reversal logic.
How to test dynamic thresholds? Replay historical logs into test harness, verify classification boundaries.