Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new check: streaming_delay #206

Closed
wants to merge 2 commits into from

Conversation

tobixen
Copy link
Contributor

@tobixen tobixen commented Nov 7, 2018

measuring how much a slave server is lagging behind its master.

This one is a bit similar to the existing streaming_delta check, except the delta check is supposed to be run from the master, and it measures the data delta in bytes, not the time delta in seconds. In some settings one would probably like to get a quick alert if the slave is significantly lagged compared to the master, even if the data delta size is small.

(requested by a customer of ours)

@tobixen tobixen force-pushed the feature_streaming_delta_time branch from 6f316f6 to 7856835 Compare November 7, 2018 21:56
@ioguix ioguix added this to the release 2.5 milestone Jan 29, 2019
@rjuju
Copy link
Member

rjuju commented Apr 22, 2019

Hi,

I just looked at this PR. Actually, we didn't implement it previously because the underlying function is not really helpful. It unfortunately doesn't return the replication lag, but the time since some data has been received and replayed. So if there's no write activity on the primary server, this service will probably trigger some false errors. If we were to accept this check, it'd have to be renamed to something like "received_activity_from_primary" or something that makes clear what's it's actually checking.

About the code:

  • there's no pg_last_wal_replay_timestamp function in postgres, pg_last_xact_replay_timestamp has never been renamed AFAIK
  • you document that UNKNOWN is returned if used on primary (and stand-alone server, but I'm not sure of what it means), but that's not the case. You should probably check for pg_is_in_recovery()

@ioguix ioguix modified the milestones: release 2.5, release 2.6 Nov 3, 2020
@ioguix
Copy link
Member

ioguix commented Nov 29, 2023

Hi,

So this PR has been softly rejected since 5 years already. Let's close it for good.

However, this feature request will be supported through some other means using *_lag fields that appeared in pg_stat_replication in v10. See #361 and future PR about it.

Cheers,

@tobixen
Copy link
Contributor Author

tobixen commented Nov 29, 2023

Oh ... this one went under my radar. I don't have responsibility for any primary-slave postgresql setups at the moment, so this is no priority for me, but I still think it may be useful for primary-slave environments where continuous write activity is expected. If the slave is lagging behind, then something is most likely wrong. I could blow the dust of this one and rename it - but if nobody finds it useful, then let it be :-)

@Krysztophe
Copy link
Collaborator

Any reason why you want the service on the secondary rather than the primary? Are the write_lag and replay_lag from pg_stat_replication enough?

@tobixen
Copy link
Contributor Author

tobixen commented Nov 29, 2023

Long time since I was playnig with this, but on the primary server you can only check that there exists a slave that is up-to-date. Theoretically things may be set up wrongly so that there is another slave connected to the master, so I think it's nice to check from the slave point of view that it's connected, too. I believe that in November 2018 I was playing around with disaster recovery situations where a slave was switched to master, new slaves were taken up or taken down, etc.

@ioguix
Copy link
Member

ioguix commented Nov 29, 2023

Hi,

but on the primary server you can only check that there exists a slave that is up-to-date

No, from pg_stat_replication on the primary you can check that :

  • the specified secondaries are connected
  • how much data each of them are lagging behind (write_lsn,flush_lsn,replay_lsn, we use them)
  • how long they are lagging behing(write_lag,flush_lag,replay_lag, we don't use them... yet)

so the idea is to use the *_lag fields, report them in perfdata and allow to set thresholds on them.

Theoretically things may be set up wrongly so that there is another slave connected to the master

Yes, but you can setup check_pga to explicitly check the streaming to some specific standbies using their application_name + remote IP address, using eg: --slave 'thisstandbyname 10.20.30.40' --slave 'thisanotherstandbyname 10.20.30.41'

@tobixen
Copy link
Contributor Author

tobixen commented Nov 29, 2023

I still feel it would be simpler and easier being able to check from a specific slave if it's connected to the master than to do it from the master, but I won't waste time arguing at that - and for the foreseeable future I'm not going to monitor any postgresql slave servers anyway :-)

@ioguix
Copy link
Member

ioguix commented Nov 29, 2023

I still feel it would be simpler and easier being able to check from a specific slave if it's connected to the master than to do it from the master

Sure, it would be possible, why not. However, you can not check the lag from the standby point of view.

and for the foreseeable future I'm not going to monitor any postgresql slave servers anyway :-)

Thank you very much for your past discussions and contributions Tobias!

Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants