BI - Data Warehouse/ Lake/ Lakehouse

How to Monitor the Replication Status in PostgreSQL and Identify Lag

September 18, 2024

1697

Monitoring replication status and identifying lag issues in PostgreSQL is essential for ensuring that the standby (replica) is synchronized with the primary server, especially in critical systems that rely on high availability and disaster recovery. PostgreSQL provides several tools and system views to help monitor replication health, detect replication lag, and troubleshoot performance issues.

Key Methods for Monitoring Replication and Identifying Lag

1. pg_stat_replication View

The pg_stat_replication system view is one of the most commonly used tools for monitoring replication status. It shows detailed information about each connected replica, including replication lag.

Key Columns in `pg_stat_replication`

pid: Process ID of the WAL sender process.
usename: The role name of the replication connection.
application_name: Name of the standby.
client_addr: IP address of the standby.
state: The current state of the replication connection. Possible values include:
streaming: Replication is actively streaming WAL.
catchup: The standby is catching up with the primary.
startup: Standby is starting replication.
syncing: Standby is syncing with the primary.
write_lsn: The log sequence number (LSN) up to which the standby has written the WAL.
flush_lsn: The LSN up to which the standby has flushed the WAL to disk.
replay_lsn: The LSN up to which the standby has applied (replayed) the WAL.
write_lag/flush_lag/replay_lag: Shows replication lag metrics, measuring how long the standby takes to write, flush, and replay the WAL records.

Sample Query to Check Replication Status

SELECT
  pid,
  usename,
  application_name,
  client_addr,
  state,
  write_lsn,
  flush_lsn,
  replay_lsn,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

write_lag: Time difference between when the primary server writes the WAL and when the standby writes it.
flush_lag: Time difference between when the primary writes the WAL and when the standby flushes it to disk.
replay_lag: Time difference between when the primary writes the WAL and when the standby replays it.

2. pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn()

These functions return the current write-ahead log (WAL) location that the standby has received and replayed, which helps measure replication lag.

Measure Replication Lag in Terms of WAL

SELECT
  pg_last_wal_receive_lsn() AS received_lsn,
  pg_last_wal_replay_lsn() AS replayed_lsn,
  pg_current_wal_lsn() AS current_lsn,
  pg_current_wal_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;

received_lsn: The LSN up to which the standby has received WAL from the primary.
replayed_lsn: The LSN up to which the standby has applied WAL.
lag_bytes: The number of bytes of WAL that the standby is behind the primary. If this value is increasing, it indicates replication lag.

3. pg_stat_wal_receiver View

On the standby side, the pg_stat_wal_receiver view provides information about the WAL receiver process, which is responsible for receiving WAL data from the primary.

Key Columns in `pg_stat_wal_receiver`

status: The current status of the WAL receiver process (e.g., streaming or catchup).
received_lsn: The last WAL location received by the standby.
last_msg_send_time: The last time a message was sent from the primary.
last_msg_receipt_time: The last time a message was received from the primary.
latest_end_lsn: The latest WAL location written by the primary server.

Sample Query for Monitoring WAL Receiver

SELECT
  status,
  received_lsn,
  last_msg_send_time,
  last_msg_receipt_time,
  latest_end_lsn
FROM pg_stat_wal_receiver;

4. pg_stat_activity View

The pg_stat_activity view helps monitor connections and replication activity between the primary and standby. You can filter this view to check replication processes.

Sample Query for Monitoring Replication Processes

SELECT
  pid,
  usename,
  application_name,
  client_addr,
  backend_start,
  state
FROM pg_stat_activity
WHERE backend_type = 'walsender';

This query shows all WAL sender processes that are replicating data to connected standbys.

5. Monitoring Replication Lag with WAL Archiving (Asynchronous Replication)

In asynchronous replication setups, particularly those that use WAL archiving, you can measure lag by comparing the WAL file numbers on the primary and the standby.

Check the Current WAL File on the Primary

SELECT pg_current_wal_lsn();

Check the Latest Replayed WAL File on the Standby

On the standby:

SELECT pg_last_wal_replay_lsn();

By comparing these two values, you can identify if the standby is lagging and by how many WAL segments.

6. pg_replication_slots

Replication slots ensure that the primary retains WAL logs until the standby has confirmed receiving them. Monitoring replication slots can help identify potential lag caused by replication delay.

Query for Replication Slot Information

SELECT
  slot_name,
  plugin,
  slot_type,
  active,
  restart_lsn,
  confirmed_flush_lsn
FROM pg_replication_slots;

restart_lsn: The LSN from which the WAL will be retained.
confirmed_flush_lsn: The LSN up to which the replica has confirmed it has successfully received WAL data.

A growing difference between the restart_lsn and confirmed_flush_lsn indicates that the standby is falling behind.

7. Alerting and Monitoring Tools

Using third-party monitoring tools or custom alerting scripts can help detect lag and replication issues automatically:

pg_stat_monitor: An advanced PostgreSQL statistics monitoring extension.
pgAdmin: Provides a GUI-based interface to monitor replication.
Prometheus and Grafana: PostgreSQL exporters (such as postgres_exporter) can provide replication and lag metrics to be visualized in Grafana dashboards.
Nagios or Zabbix: Monitoring tools with PostgreSQL plugins that can monitor replication lag and raise alerts when it exceeds certain thresholds.

8. Network Latency and Performance Bottlenecks

In addition to database-specific metrics, replication lag can also occur due to network latency or performance issues on the standby. Monitoring system performance (CPU, memory, I/O) and network throughput is essential to fully understand the cause of replication lag.

Check Network Latency: Use tools like ping, traceroute, or dedicated network monitoring tools to ensure network connections between the primary and standby are healthy.
Check I/O Performance: On both primary and standby servers, monitor disk I/O with tools like iostat or vmstat to ensure that slow disk writes are not causing lag.

Steps to Troubleshoot Replication Lag

Check the Replication State: Use the pg_stat_replication view to verify the replication state and see if the standby is lagging (e.g., check write_lag, flush_lag, replay_lag).
Compare LSNs: Compare the current LSN of the primary with the LSNs on the standby (pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn()) to measure how far behind the standby is.
Review WAL Receiver and Sender Status: On the standby, check the pg_stat_wal_receiver view to ensure the standby is actively receiving WAL data from the primary.
Investigate System Resources: Check for network issues, disk I/O performance, or CPU/memory bottlenecks that may be causing replication delays.
Monitor Replication Slots: Use the pg_replication_slots view to ensure that WAL files are not piling up on the primary due to slow replication.
Set Alerts: Use monitoring tools (e.g., Prometheus, Grafana, Nagios) to set up alerts for replication lag so that you can respond quickly to potential issues.

Conclusion

By using system views such as pg_stat_replication, pg_stat_wal_receiver, and functions like pg_current_wal_lsn(), you can monitor the replication status in PostgreSQL and identify lag. Additionally, combining PostgreSQL’s built-in monitoring with external tools like Prometheus and Grafana allows for more comprehensive replication oversight. Properly identifying and addressing replication lag ensures that your standby databases remain in sync with the primary, helping to maintain high availability and data integrity.

How to Monitor the Replication Status in PostgreSQL and Identify Lag

Key Methods for Monitoring Replication and Identifying Lag

1. pg_stat_replication View

Key Columns in `pg_stat_replication`

Sample Query to Check Replication Status

2. pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn()

Measure Replication Lag in Terms of WAL

3. pg_stat_wal_receiver View

Key Columns in `pg_stat_wal_receiver`

Sample Query for Monitoring WAL Receiver

4. pg_stat_activity View

Sample Query for Monitoring Replication Processes

5. Monitoring Replication Lag with WAL Archiving (Asynchronous Replication)

Check the Current WAL File on the Primary

Check the Latest Replayed WAL File on the Standby

6. pg_replication_slots

Query for Replication Slot Information

7. Alerting and Monitoring Tools

8. Network Latency and Performance Bottlenecks

Steps to Troubleshoot Replication Lag

Conclusion

EDITOR PICKS

Estimation for Agile Developers While Status Reporting to Waterfall Managers

5 Major Reasons Why So Many Companies Fail At Social Media

Best Practices for Distributed Or Remote Teams in the Age of...

POPULAR POSTS

How to use business objects @Prompt Variable to build flexible universes...

How to Merge Data from Multiple Data Providers in WEBIntelligence (webi)

How to Calculate Number Of Days in a Month or Month...

POPULAR CATEGORY

Apache Kafka Fault Tolerance Strategies

What is the difference between a data engineer, an analytic engineer,...

Key Methods for Monitoring Replication and Identifying Lag

1. pg_stat_replication View

Key Columns in pg_stat_replication

Sample Query to Check Replication Status

2. pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn()

Measure Replication Lag in Terms of WAL

3. pg_stat_wal_receiver View

Key Columns in pg_stat_wal_receiver

Sample Query for Monitoring WAL Receiver

4. pg_stat_activity View

Sample Query for Monitoring Replication Processes

5. Monitoring Replication Lag with WAL Archiving (Asynchronous Replication)

Check the Current WAL File on the Primary

Check the Latest Replayed WAL File on the Standby

6. pg_replication_slots

Query for Replication Slot Information

7. Alerting and Monitoring Tools

8. Network Latency and Performance Bottlenecks

Steps to Troubleshoot Replication Lag

Conclusion

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

Key Columns in `pg_stat_replication`

Key Columns in `pg_stat_wal_receiver`