Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PosgreSQLConnection.fetch_remote_status Issue With Large PostgreSQL #1056

Open
gariels opened this issue Jan 17, 2025 · 1 comment
Open

PosgreSQLConnection.fetch_remote_status Issue With Large PostgreSQL #1056

gariels opened this issue Jan 17, 2025 · 1 comment

Comments

@gariels
Copy link

gariels commented Jan 17, 2025

Hello, we been using barman for quite a while for our PostgreSQL instances,
which some of them can contain more than 27k of databases, most of them are
small databases around ~20mb each. Such setup comprises of a lot of files
inside the data directory.

After installing barman 3.12 into one of our barman server we get a lot of
client connection from barman, running this query:

SELECT sum(pg_tablespace_size(oid)) FROM pg_tablespace

I can see since barman 3.10 receive_wal method of barman.server.Server
added a preflight logic:

 ..

strategy = CheckStrategy()
self._check_wal_streaming_preflight(strategy, self.get_remote_status())
if strategy.has_error:
	output.error(
		"Impossible to start WAL streaming. Check the log "
		"for more details, or run 'barman check %s'" % self.config.name
	)
	return

 ..

self._check_wal_streaming_preflight(strategy, self.get_remote_status())

Where commit 315bdc8 introducing it.

Which executed before enumerating every self.archivers items to execute its
receive_wal.

This incurs costly I/O operation on database server because the logic at
fetch_remote_status of barman.postgres.PostgreSQLConnection which
eventually called from self.get_remote_status() would execute PostgreSQL
databases size SUM query, see

result["current_size"] = self.current_size

def current_size(self):

basically matching the query I found at our I/O-hobbled database server:

SELECT sum(pg_tablespace_size(oid)) FROM pg_tablespace

It would be nice if current_size property can be optional for
get_remote_status result because from my cursory look into the code I cannot
find where current_size used within the context receiving WAL from database
server.

Unfortunately the problem exacerbated because this preflight checking logic
executed outside of ServerWalReceiveLock context, which would cause multiple
calls to database server.

Luckily our existing barman version in use majority are before 3.10 release
which introduce this change.

I would like to propose moving preflight check into ServerWalReceiveLock so
that only one barman client connected to database server running
SELECT sum(pg_tablespace_size(oid)) FROM pg_tablespace. I made modification
to server.py at Server.receive_wal:

def receive_wal(self, reset=False):
    """
    Enable the reception of WAL files using streaming protocol.

    Usually started by barman cron command.
    Executing this manually, the barman process will not terminate but
    will continuously receive WAL files from the PostgreSQL server.

    :param reset: When set, resets the status of receive-wal
    """
    # Execute the receive-wal command only if streaming_archiver
    # is enabled
    if not self.config.streaming_archiver:
        output.error(
            "Unable to start receive-wal process: "
            "streaming_archiver option set to 'off' in "
            "barman configuration file"
        )
        return

    try:
        # Take care of the receive-wal lock.
        # Only one receiving process per server is permitted
        with ServerWalReceiveLock(
            self.config.barman_lock_directory, self.config.name
        ):
            # Use the default CheckStrategy to silently check WAL streaming
            # conditions are met and write errors to the log file.
            strategy = CheckStrategy()
            self._check_wal_streaming_preflight(strategy, self.get_remote_status())
            if strategy.has_error:
                output.error(
                    "Impossible to start WAL streaming. Check the log "
                    "for more details, or run 'barman check %s'" % self.config.name
                )
                return

            if not reset:
                output.info("Starting receive-wal for server %s", self.config.name)

            try:
                # Only the StreamingWalArchiver implementation
                # does something.
                # WARNING: This codes assumes that there is only one
                # StreamingWalArchiver in the archivers list.
                for archiver in self.archivers:
                    archiver.receive_wal(reset)
            except ArchiverFailure as e:
                output.error(e)

    except LockFileBusy:
        # If another process is running for this server,
        if reset:
            output.error(
                "Unable to reset the status of receive-wal "
                "for server %s. Process is still running" % self.config.name
            )
        else:
            output.error(
                "Another receive-wal process is already running "
                "for server %s." % self.config.name
            )

This eliminate multiple clients from barman connecting to database server only
to be stuck and thrashing the I/O.

I would also propose get_remote_status (and fetch_remote_status) to be aware
of certain context so that only relevant information fetched from database
server (which eventually used by subsequent code). An example would be related
to my issue, ie. not resolving for database size when the context is to receive
WAL.

@martinmarques
Copy link
Contributor

Thank you very much for the thorough message @gariels !

We'll have an internal conversation on this and get back here, but I think your logic here makes a lot of sense. We should be more mindful of what data we need to gather from the server and try to cache it when possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants