SQL Server High Availability: Always On Availability Groups

SQL Server High Availability: Always On Availability Groups

Introduction

[Explain business continuity requirements; compare Always On AG vs failover clustering vs mirroring.]

Prerequisites

  • SQL Server Enterprise Edition (or Azure SQL Managed Instance)
  • Windows Server Failover Clustering (on-prem) or Azure VMs
  • Shared storage or distributed network name

Always On AG Architecture

Component Purpose HA Benefit
Primary Replica Read/write workload Active node
Secondary Replica(s) Synchronous/async copy Automatic failover target
Availability Group Listener Virtual network name Transparent connection redirect
Witness / Quorum Cluster arbitration Split-brain prevention

Step-by-Step Guide

Step 1: Enable Always On Feature

-- Run on each instance
ALTER SERVER CONFIGURATION SET HADR CLUSTER LOCAL;

Or via PowerShell:

Enable-SqlAlwaysOn -ServerInstance 'SQL01' -Force

Step 2: Create Availability Group

CREATE AVAILABILITY GROUP [MyAG]
FOR DATABASE [MyDatabase]
REPLICA ON
    'SQL01' WITH (
        ENDPOINT_URL = 'TCP://SQL01.contoso.com:5022',
        AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
        FAILOVER_MODE = AUTOMATIC,
        SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY)
    ),
    'SQL02' WITH (
        ENDPOINT_URL = 'TCP://SQL02.contoso.com:5022',
        AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
        FAILOVER_MODE = AUTOMATIC,
        SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY)
    );

Step 3: Create Listener

ALTER AVAILABILITY GROUP [MyAG]
ADD LISTENER 'MyAGListener' (
    WITH IP (('10.0.0.100', '255.255.255.0')),
    PORT = 1433
);

Step 4: Join Secondary Replicas

-- Run on SQL02
ALTER AVAILABILITY GROUP [MyAG] JOIN;
ALTER DATABASE [MyDatabase] SET HADR AVAILABILITY GROUP = [MyAG];

Step 5: Monitor Health

SELECT
    ag.name AS AvailabilityGroup,
    ar.replica_server_name AS Replica,
    ars.role_desc AS Role,
    ars.operational_state_desc AS State,
    drs.synchronization_state_desc AS SyncState,
    drs.synchronization_health_desc AS Health
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id;

Step 6: Configure Backup Strategy

ALTER AVAILABILITY GROUP [MyAG]
MODIFY REPLICA ON 'SQL02'
WITH (BACKUP_PRIORITY = 50);

Step 7: Read-Scale-Out Routing

ALTER AVAILABILITY GROUP [MyAG]
MODIFY REPLICA ON 'SQL02'
WITH (PRIMARY_ROLE(READ_ONLY_ROUTING_URL = 'TCP://SQL02.contoso.com:1433'));

ALTER AVAILABILITY GROUP [MyAG]
MODIFY REPLICA ON 'SQL01'
WITH (PRIMARY_ROLE(READ_ONLY_ROUTING_LIST = ('SQL02')));

Failover Scenarios

Manual Failover

ALTER AVAILABILITY GROUP [MyAG] FAILOVER;

Automatic Failover (Synchronous Mode)

[Cluster detects primary failure; promotes secondary automatically; listener redirects connections]

Forced Failover (Data Loss Possible)

ALTER AVAILABILITY GROUP [MyAG] FORCE_FAILOVER_ALLOW_DATA_LOSS;

Performance Tuning

  • Use synchronous commit for zero data loss within region
  • Use asynchronous commit for distant DR replicas
  • Offload read queries to secondary replicas
  • Monitor redo queue length

Monitoring & Alerting

-- Check redo queue size
SELECT
    database_name,
    redo_queue_size / 1024.0 AS RedoQueueSizeMB
FROM sys.dm_hadr_database_replica_states
WHERE is_local = 1 AND role_desc = 'SECONDARY';

Alert on high latency:

SELECT
    ag.name,
    ar.replica_server_name,
    drs.last_commit_time,
    DATEDIFF(SECOND, drs.last_commit_time, GETDATE()) AS SecondsBehind
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE drs.is_local = 0
  AND DATEDIFF(SECOND, drs.last_commit_time, GETDATE()) > 30;

Disaster Recovery Best Practices

  • Place third replica in separate datacenter/region
  • Use asynchronous mode for DR replica to avoid latency impact
  • Test failover regularly (manual exercises)
  • Document runbooks for DR scenarios

Troubleshooting

Issue: Listener not resolving
Solution: Check DNS registration, firewall rules, cluster network configuration

Issue: Secondary not synchronizing
Solution: Verify endpoint connectivity, certificate trust, SQL Server service accounts

Issue: High redo queue
Solution: Increase secondary replica compute resources, reduce primary write load

Best Practices

  • Use synchronous commit for local HA, async for DR
  • Offload reporting to secondary replicas
  • Automate AG health monitoring via SQL Agent or Azure Monitor
  • Regularly test failover procedures

Key Takeaways

  • Always On AG provides automatic failover with zero data loss (sync mode).
  • Read-scale-out improves workload distribution.
  • Listener abstraction simplifies connection management.
  • Monitoring redo queue and sync state ensures health visibility.

Next Steps

  • Implement distributed AG for multi-region failover
  • Integrate with Azure Backup for point-in-time restore

Additional Resources


Is your database ready for zero-downtime deployments?