SQL Server High Availability: Always On Availability Groups
Introduction
[Explain business continuity requirements; compare Always On AG vs failover clustering vs mirroring.]
Prerequisites
- SQL Server Enterprise Edition (or Azure SQL Managed Instance)
- Windows Server Failover Clustering (on-prem) or Azure VMs
- Shared storage or distributed network name
Always On AG Architecture
| Component | Purpose | HA Benefit |
|---|---|---|
| Primary Replica | Read/write workload | Active node |
| Secondary Replica(s) | Synchronous/async copy | Automatic failover target |
| Availability Group Listener | Virtual network name | Transparent connection redirect |
| Witness / Quorum | Cluster arbitration | Split-brain prevention |
Step-by-Step Guide
Step 1: Enable Always On Feature
-- Run on each instance
ALTER SERVER CONFIGURATION SET HADR CLUSTER LOCAL;
Or via PowerShell:
Enable-SqlAlwaysOn -ServerInstance 'SQL01' -Force
Step 2: Create Availability Group
CREATE AVAILABILITY GROUP [MyAG]
FOR DATABASE [MyDatabase]
REPLICA ON
'SQL01' WITH (
ENDPOINT_URL = 'TCP://SQL01.contoso.com:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY)
),
'SQL02' WITH (
ENDPOINT_URL = 'TCP://SQL02.contoso.com:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY)
);
Step 3: Create Listener
ALTER AVAILABILITY GROUP [MyAG]
ADD LISTENER 'MyAGListener' (
WITH IP (('10.0.0.100', '255.255.255.0')),
PORT = 1433
);
Step 4: Join Secondary Replicas
-- Run on SQL02
ALTER AVAILABILITY GROUP [MyAG] JOIN;
ALTER DATABASE [MyDatabase] SET HADR AVAILABILITY GROUP = [MyAG];
Step 5: Monitor Health
SELECT
ag.name AS AvailabilityGroup,
ar.replica_server_name AS Replica,
ars.role_desc AS Role,
ars.operational_state_desc AS State,
drs.synchronization_state_desc AS SyncState,
drs.synchronization_health_desc AS Health
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id;
Step 6: Configure Backup Strategy
ALTER AVAILABILITY GROUP [MyAG]
MODIFY REPLICA ON 'SQL02'
WITH (BACKUP_PRIORITY = 50);
Step 7: Read-Scale-Out Routing
ALTER AVAILABILITY GROUP [MyAG]
MODIFY REPLICA ON 'SQL02'
WITH (PRIMARY_ROLE(READ_ONLY_ROUTING_URL = 'TCP://SQL02.contoso.com:1433'));
ALTER AVAILABILITY GROUP [MyAG]
MODIFY REPLICA ON 'SQL01'
WITH (PRIMARY_ROLE(READ_ONLY_ROUTING_LIST = ('SQL02')));
Failover Scenarios
Manual Failover
ALTER AVAILABILITY GROUP [MyAG] FAILOVER;
Automatic Failover (Synchronous Mode)
[Cluster detects primary failure; promotes secondary automatically; listener redirects connections]
Forced Failover (Data Loss Possible)
ALTER AVAILABILITY GROUP [MyAG] FORCE_FAILOVER_ALLOW_DATA_LOSS;
Performance Tuning
- Use synchronous commit for zero data loss within region
- Use asynchronous commit for distant DR replicas
- Offload read queries to secondary replicas
- Monitor redo queue length
Monitoring & Alerting
-- Check redo queue size
SELECT
database_name,
redo_queue_size / 1024.0 AS RedoQueueSizeMB
FROM sys.dm_hadr_database_replica_states
WHERE is_local = 1 AND role_desc = 'SECONDARY';
Alert on high latency:
SELECT
ag.name,
ar.replica_server_name,
drs.last_commit_time,
DATEDIFF(SECOND, drs.last_commit_time, GETDATE()) AS SecondsBehind
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE drs.is_local = 0
AND DATEDIFF(SECOND, drs.last_commit_time, GETDATE()) > 30;
Disaster Recovery Best Practices
- Place third replica in separate datacenter/region
- Use asynchronous mode for DR replica to avoid latency impact
- Test failover regularly (manual exercises)
- Document runbooks for DR scenarios
Troubleshooting
Issue: Listener not resolving
Solution: Check DNS registration, firewall rules, cluster network configuration
Issue: Secondary not synchronizing
Solution: Verify endpoint connectivity, certificate trust, SQL Server service accounts
Issue: High redo queue
Solution: Increase secondary replica compute resources, reduce primary write load
Best Practices
- Use synchronous commit for local HA, async for DR
- Offload reporting to secondary replicas
- Automate AG health monitoring via SQL Agent or Azure Monitor
- Regularly test failover procedures
Key Takeaways
- Always On AG provides automatic failover with zero data loss (sync mode).
- Read-scale-out improves workload distribution.
- Listener abstraction simplifies connection management.
- Monitoring redo queue and sync state ensures health visibility.
Next Steps
- Implement distributed AG for multi-region failover
- Integrate with Azure Backup for point-in-time restore
Additional Resources
Is your database ready for zero-downtime deployments?