DATABASE MONITORING

ClickHouse Replication Monitoring: Complete Guide to High Availability

By Achie Barret  - January 16, 2026

Data replication is a cornerstone of high availability in distributed database systems. For ClickHouse, the ReplicatedMergeTree family of table engines provides robust replication capabilities that ensure data durability and fault tolerance. However, without proper monitoring, replication issues can silently accumulate, leading to data inconsistencies, service degradation, or even data loss. This comprehensive guide explores everything you need to know about monitoring ClickHouse replication effectively.

Understanding ClickHouse Replication

ClickHouse replication works differently from traditional database replication. Instead of replicating individual write operations, ClickHouse replicates at the data part level. When you insert data into a replicated table, ClickHouse creates a data part locally and then coordinates with other replicas through ZooKeeper (or ClickHouse Keeper) to ensure all replicas eventually have the same data.

This architecture provides several advantages:

  • Efficiency: Replicating entire parts is more efficient than replicating individual rows
  • Consistency: All replicas eventually converge to the same state
  • Fault tolerance: Any replica can accept writes and propagate them to others
  • Flexibility: Replicas can temporarily go offline and catch up later

The Role of ZooKeeper

ZooKeeper (or ClickHouse Keeper) serves as the coordination layer for replication. It maintains metadata about replicated tables, tracks which parts each replica has, and orchestrates the replication process. Understanding this dependency is crucial because ZooKeeper issues directly impact replication health.

Common ZooKeeper-related problems include:

  • Connection timeouts: Network issues between ClickHouse and ZooKeeper
  • Session expiration: ZooKeeper sessions timing out due to slow responses
  • Coordination failures: Split-brain scenarios or quorum loss

Key Replication Metrics to Monitor

Effective replication monitoring requires tracking several critical metrics. Each metric provides insight into different aspects of replication health.

Replication Lag

Replication lag measures how far behind a replica is from the leader. It's expressed in seconds and indicates the time delay between when data is written to the leader and when it appears on the replica. High replication lag can indicate:

  • Network bandwidth constraints between replicas
  • Disk I/O bottlenecks on the replica
  • Heavy insert load overwhelming the replication process
  • Large data parts taking longer to transfer

For most applications, replication lag under 60 seconds is acceptable. Critical systems should aim for single-digit seconds or less.

Queue Size and Composition

The replication queue contains operations waiting to be applied to a replica. Monitoring queue size and its composition (inserts vs. merges) helps identify bottlenecks:

  • Large insert queue: Indicates heavy write load or slow network transfers
  • Large merge queue: Suggests merge operations are falling behind, possibly due to disk I/O limitations
  • Stuck queue: Items remaining in queue for extended periods may indicate failed operations

Active Replicas

The count of active replicas versus total replicas is a fundamental health indicator. Fewer active replicas than expected signals potential issues:

  • Server hardware failures or maintenance
  • Network partitioning between replicas
  • ZooKeeper connectivity problems
  • ClickHouse process crashes or restarts

Lost Parts

Lost parts represent data that should exist but cannot be found on any replica. This is one of the most critical metrics because it indicates potential data loss. Lost parts can occur due to:

  • Disk corruption or failures
  • Improper node decommissioning
  • Severe replication failures during cluster incidents
  • Manual file system operations that bypassed ClickHouse

Read-Only Mode

When a replicated table enters read-only mode, it can no longer accept writes. This typically happens when:

  • ZooKeeper connection is lost or unstable
  • Replica falls too far behind and needs manual intervention
  • Disk space exhaustion prevents new writes
  • Internal consistency checks fail

Read-only mode is a critical alert condition that requires immediate attention.

Comprehensive Metrics Reference

Here's a complete reference of replication metrics you should monitor:

MetricDescriptionWarning Threshold
Absolute DelayTime behind the leader in seconds> 60 seconds
Queue SizePending operations in replication queue> 100 items
Inserts in QueuePending INSERT operations> 50 items
Merges in QueuePending MERGE operations> 50 items
Active ReplicasNumber of replicas currently active< expected count
Lost PartsData parts that cannot be found> 0
Read-OnlyWhether table is in read-only modetrue
Last Queue UpdateTime since queue was last processed> 5 minutes

Common Replication Issues and Solutions

Replica Lag Growing Continuously

When replica lag keeps increasing without recovery, the write rate may be exceeding the replication throughput. Solutions include:

  • Increase replication threads: Adjust max_replicated_fetches_network_bandwidth and related settings
  • Improve network bandwidth: Upgrade network infrastructure between replicas
  • Optimize insert patterns: Batch inserts to reduce the number of parts created
  • Scale horizontally: Add more shards to distribute the load

Stuck Replicas

A stuck replica shows no queue processing activity despite having items in the queue. This often indicates ZooKeeper issues:

  • Check ZooKeeper health: Verify ZooKeeper cluster is functioning properly
  • Review network connectivity: Ensure stable connections between ClickHouse and ZooKeeper
  • Restart if necessary: Sometimes a ClickHouse restart resolves stuck sessions
  • Use SYSTEM RESTART REPLICA: Force a replica restart without full server restart

Replica Inconsistency

When replicas have different data, you need to identify and resolve the inconsistency:

  • Use SYSTEM SYNC REPLICA: Force synchronization with other replicas
  • Use SYSTEM RESTORE REPLICA: Rebuild replica from scratch using other replicas' data
  • Verify checksums: Compare data checksums across replicas to identify discrepancies

Too Many Parts Warning

This error occurs when a table accumulates too many parts, often blocking further inserts. The root cause is usually:

  • Too frequent small inserts creating many tiny parts
  • Merge operations unable to keep up with insert rate
  • Insufficient background processing resources

Solutions include adjusting merge settings, batching inserts, and running OPTIMIZE TABLE during maintenance windows.

Setting Up Replication Alerts

Proactive alerting is essential for catching replication issues before they impact your applications. Here are recommended alert configurations:

Critical Alerts (Immediate Response Required)

  • Replication Unhealthy: Any table transitions to unhealthy state
  • Lost Parts > 0: Potential data loss detected
  • Read-Only Mode: Table cannot accept writes
  • ZooKeeper Errors: Coordination layer problems

Warning Alerts (Investigation Needed)

  • Replication Lag > 300s: Replica falling significantly behind
  • Queue Size > 100: Pending operations accumulating
  • Active Replicas < Expected: Missing replicas

Informational Alerts (Awareness)

  • Replication Lag > 60s: Minor lag that may resolve naturally
  • Table Not Exists: Replicated table disappeared (intentional or accidental)

Best Practices for Replication Management

Infrastructure Considerations

  • Dedicated networks: Use separate network interfaces for replication traffic when possible
  • Consistent hardware: Ensure all replicas have similar performance characteristics
  • Geographic distribution: Place replicas in different availability zones for disaster recovery
  • ZooKeeper placement: Deploy ZooKeeper separately from ClickHouse nodes with odd-numbered quorum

Operational Practices

  • Regular health checks: Monitor replication status as part of daily operations
  • Maintenance windows: Schedule heavy operations during low-traffic periods
  • Graceful node removal: Always properly detach replicas before decommissioning
  • Documentation: Maintain runbooks for common replication issues

Insert Optimization

  • Batch inserts: Group many rows into single insert operations
  • Use Buffer tables: Buffer small writes and flush periodically
  • Avoid too many partitions: Design partition keys to balance data distribution
  • Monitor part creation: Track how many parts each insert creates

Monitoring at Scale

As your ClickHouse deployment grows, monitoring becomes more complex. Consider these strategies for large-scale environments:

Aggregate Monitoring

Instead of monitoring every individual table, create aggregate views that surface the most critical issues:

  • Total unhealthy tables across all databases
  • Maximum replication lag across the cluster
  • Sum of lost parts across all tables
  • Tables with highest queue sizes

Automated Response

For common, well-understood issues, implement automated remediation:

  • Auto-restart stuck replicas after threshold duration
  • Automatic OPTIMIZE operations during off-peak hours
  • Alert escalation based on duration and severity
  • Capacity alerts when replication consistently struggles

Historical Analysis

Track replication metrics over time to identify trends and patterns:

  • Correlate lag spikes with application events or deployments
  • Identify time-of-day patterns in replication performance
  • Track improvement or degradation after configuration changes
  • Plan capacity based on historical growth trends

Required Permissions for Monitoring

To monitor replication status, your monitoring user needs SELECT permission on the system.replicas table. This table contains comprehensive information about all replicated tables including:

  • Database and table names
  • Replica identification (name, path)
  • Health status and operational mode
  • Queue information and statistics
  • Lag metrics and timing information

Conclusion

ClickHouse replication provides robust data redundancy and high availability, but it requires careful monitoring to ensure optimal operation. By tracking key metrics like replication lag, queue sizes, active replica counts, and lost parts, you can detect issues early and maintain a healthy replicated cluster.

Remember that replication monitoring is not a one-time setup but an ongoing practice. Regular review of metrics, alert thresholds, and operational procedures ensures your monitoring remains effective as your deployment evolves. Invest in comprehensive monitoring today to prevent costly outages and data issues tomorrow.

The key to successful replication management is visibility. With proper monitoring in place, you can confidently operate replicated ClickHouse clusters knowing that issues will be detected and addressed promptly, keeping your data safe and your applications running smoothly.

Ready to implement comprehensive ClickHouse replication monitoring? Try UptimeDock's ClickHouse monitoring solution for automatic replication detection, real-time health tracking, and intelligent alerts. Start your free trial today and ensure your replicated tables stay healthy and synchronized.
Get started
Start monitoring your website's availability and dive into all this powerful features right now!
Try Free
* Don't worry, we don't need your credit card for trial 😊