ClickHouse Replication Monitoring: Complete Guide to High Availability

Data replication is a cornerstone of high availability in distributed database systems. For ClickHouse, the ReplicatedMergeTree family of table engines provides robust replication capabilities that ensure data durability and fault tolerance. However, without proper monitoring, replication issues can silently accumulate, leading to data inconsistencies, service degradation, or even data loss. This comprehensive guide explores everything you need to know about monitoring ClickHouse replication effectively.

Understanding ClickHouse Replication

ClickHouse replication works differently from traditional database replication. Instead of replicating individual write operations, ClickHouse replicates at the data part level. When you insert data into a replicated table, ClickHouse creates a data part locally and then coordinates with other replicas through ZooKeeper (or ClickHouse Keeper) to ensure all replicas eventually have the same data.

This architecture provides several advantages:

Efficiency: Replicating entire parts is more efficient than replicating individual rows
Consistency: All replicas eventually converge to the same state
Fault tolerance: Any replica can accept writes and propagate them to others
Flexibility: Replicas can temporarily go offline and catch up later

The Role of ZooKeeper

ZooKeeper (or ClickHouse Keeper) serves as the coordination layer for replication. It maintains metadata about replicated tables, tracks which parts each replica has, and orchestrates the replication process. Understanding this dependency is crucial because ZooKeeper issues directly impact replication health.

Common ZooKeeper-related problems include:

Connection timeouts: Network issues between ClickHouse and ZooKeeper
Session expiration: ZooKeeper sessions timing out due to slow responses
Coordination failures: Split-brain scenarios or quorum loss

Key Replication Metrics to Monitor

Effective replication monitoring requires tracking several critical metrics. Each metric provides insight into different aspects of replication health.

Replication Lag

Replication lag measures how far behind a replica is from the leader. It's expressed in seconds and indicates the time delay between when data is written to the leader and when it appears on the replica. High replication lag can indicate:

Network bandwidth constraints between replicas
Disk I/O bottlenecks on the replica
Heavy insert load overwhelming the replication process
Large data parts taking longer to transfer

For most applications, replication lag under 60 seconds is acceptable. Critical systems should aim for single-digit seconds or less.

Queue Size and Composition

The replication queue contains operations waiting to be applied to a replica. Monitoring queue size and its composition (inserts vs. merges) helps identify bottlenecks:

Large insert queue: Indicates heavy write load or slow network transfers
Large merge queue: Suggests merge operations are falling behind, possibly due to disk I/O limitations
Stuck queue: Items remaining in queue for extended periods may indicate failed operations

Active Replicas

The count of active replicas versus total replicas is a fundamental health indicator. Fewer active replicas than expected signals potential issues:

Server hardware failures or maintenance
Network partitioning between replicas
ZooKeeper connectivity problems
ClickHouse process crashes or restarts

Lost Parts

Lost parts represent data that should exist but cannot be found on any replica. This is one of the most critical metrics because it indicates potential data loss. Lost parts can occur due to:

Disk corruption or failures
Improper node decommissioning
Severe replication failures during cluster incidents
Manual file system operations that bypassed ClickHouse

Read-Only Mode

When a replicated table enters read-only mode, it can no longer accept writes. This typically happens when:

ZooKeeper connection is lost or unstable
Replica falls too far behind and needs manual intervention
Disk space exhaustion prevents new writes
Internal consistency checks fail

Read-only mode is a critical alert condition that requires immediate attention.

Comprehensive Metrics Reference

Here's a complete reference of replication metrics you should monitor:

Metric	Description	Warning Threshold
Absolute Delay	Time behind the leader in seconds	> 60 seconds
Queue Size	Pending operations in replication queue	> 100 items
Inserts in Queue	Pending INSERT operations	> 50 items
Merges in Queue	Pending MERGE operations	> 50 items
Active Replicas	Number of replicas currently active	< expected count
Lost Parts	Data parts that cannot be found	> 0
Read-Only	Whether table is in read-only mode	true
Last Queue Update	Time since queue was last processed	> 5 minutes

Common Replication Issues and Solutions

Replica Lag Growing Continuously

When replica lag keeps increasing without recovery, the write rate may be exceeding the replication throughput. Solutions include:

Increase replication threads: Adjust max_replicated_fetches_network_bandwidth and related settings
Improve network bandwidth: Upgrade network infrastructure between replicas
Optimize insert patterns: Batch inserts to reduce the number of parts created
Scale horizontally: Add more shards to distribute the load

Stuck Replicas

A stuck replica shows no queue processing activity despite having items in the queue. This often indicates ZooKeeper issues:

Check ZooKeeper health: Verify ZooKeeper cluster is functioning properly
Review network connectivity: Ensure stable connections between ClickHouse and ZooKeeper
Restart if necessary: Sometimes a ClickHouse restart resolves stuck sessions
Use SYSTEM RESTART REPLICA: Force a replica restart without full server restart

Replica Inconsistency

When replicas have different data, you need to identify and resolve the inconsistency:

Use SYSTEM SYNC REPLICA: Force synchronization with other replicas
Use SYSTEM RESTORE REPLICA: Rebuild replica from scratch using other replicas' data
Verify checksums: Compare data checksums across replicas to identify discrepancies

Too Many Parts Warning

This error occurs when a table accumulates too many parts, often blocking further inserts. The root cause is usually:

Too frequent small inserts creating many tiny parts
Merge operations unable to keep up with insert rate
Insufficient background processing resources

Solutions include adjusting merge settings, batching inserts, and running OPTIMIZE TABLE during maintenance windows.

Setting Up Replication Alerts

Proactive alerting is essential for catching replication issues before they impact your applications. Here are recommended alert configurations:

Critical Alerts (Immediate Response Required)

Replication Unhealthy: Any table transitions to unhealthy state
Lost Parts > 0: Potential data loss detected
Read-Only Mode: Table cannot accept writes
ZooKeeper Errors: Coordination layer problems

Warning Alerts (Investigation Needed)

Replication Lag > 300s: Replica falling significantly behind
Queue Size > 100: Pending operations accumulating
Active Replicas < Expected: Missing replicas

Informational Alerts (Awareness)

Replication Lag > 60s: Minor lag that may resolve naturally
Table Not Exists: Replicated table disappeared (intentional or accidental)

Best Practices for Replication Management

Infrastructure Considerations

Dedicated networks: Use separate network interfaces for replication traffic when possible
Consistent hardware: Ensure all replicas have similar performance characteristics
Geographic distribution: Place replicas in different availability zones for disaster recovery
ZooKeeper placement: Deploy ZooKeeper separately from ClickHouse nodes with odd-numbered quorum

Operational Practices

Regular health checks: Monitor replication status as part of daily operations
Maintenance windows: Schedule heavy operations during low-traffic periods
Graceful node removal: Always properly detach replicas before decommissioning
Documentation: Maintain runbooks for common replication issues

Insert Optimization

Batch inserts: Group many rows into single insert operations
Use Buffer tables: Buffer small writes and flush periodically
Avoid too many partitions: Design partition keys to balance data distribution
Monitor part creation: Track how many parts each insert creates

Monitoring at Scale

As your ClickHouse deployment grows, monitoring becomes more complex. Consider these strategies for large-scale environments:

Aggregate Monitoring

Instead of monitoring every individual table, create aggregate views that surface the most critical issues:

Total unhealthy tables across all databases
Maximum replication lag across the cluster
Sum of lost parts across all tables
Tables with highest queue sizes

Automated Response

For common, well-understood issues, implement automated remediation:

Auto-restart stuck replicas after threshold duration
Automatic OPTIMIZE operations during off-peak hours
Alert escalation based on duration and severity
Capacity alerts when replication consistently struggles

Historical Analysis

Track replication metrics over time to identify trends and patterns:

Correlate lag spikes with application events or deployments
Identify time-of-day patterns in replication performance
Track improvement or degradation after configuration changes
Plan capacity based on historical growth trends

Required Permissions for Monitoring

To monitor replication status, your monitoring user needs SELECT permission on the system.replicas table. This table contains comprehensive information about all replicated tables including:

Database and table names
Replica identification (name, path)
Health status and operational mode
Queue information and statistics
Lag metrics and timing information

Conclusion

ClickHouse replication provides robust data redundancy and high availability, but it requires careful monitoring to ensure optimal operation. By tracking key metrics like replication lag, queue sizes, active replica counts, and lost parts, you can detect issues early and maintain a healthy replicated cluster.

Remember that replication monitoring is not a one-time setup but an ongoing practice. Regular review of metrics, alert thresholds, and operational procedures ensures your monitoring remains effective as your deployment evolves. Invest in comprehensive monitoring today to prevent costly outages and data issues tomorrow.

The key to successful replication management is visibility. With proper monitoring in place, you can confidently operate replicated ClickHouse clusters knowing that issues will be detected and addressed promptly, keeping your data safe and your applications running smoothly.

Ready to implement comprehensive ClickHouse replication monitoring? Try UptimeDock's ClickHouse monitoring solution for automatic replication detection, real-time health tracking, and intelligent alerts. Start your free trial today and ensure your replicated tables stay healthy and synchronized.