Skip to content

MDEV-39101 Make BACKUP SERVER mutually exclusive with itself and BACKUP STAGE#4892

Open
mariadb-andrzejjarzabek wants to merge 9 commits intoMariaDB:MDEV-14992from
mariadb-andrzejjarzabek:MDEV-39101
Open

MDEV-39101 Make BACKUP SERVER mutually exclusive with itself and BACKUP STAGE#4892
mariadb-andrzejjarzabek wants to merge 9 commits intoMariaDB:MDEV-14992from
mariadb-andrzejjarzabek:MDEV-39101

Conversation

@mariadb-andrzejjarzabek
Copy link
Copy Markdown

BACKUP SERVER blocks when another BACKUp SERVER is in progress or a BACKUP STAGE has been started. Takes advantage of existing mechanism using MDL which BACKUP STAGE uses to block concurrent backups.

dr-m added 8 commits March 27, 2026 11:45
This introduces a basic driver Sql_cmd_backup, storage engine interfaces,
and basic copying of InnoDB data files.
On Windows, we pass a target directory name; elsewhere, we pass a
target directory handle.

TODO: Copy the InnoDB log and prevent concurrent log_t::set_archive().

TODO: Implement proper mutual exclusion with buf_page_t::flush().

TODO: Implement other storage engine interfaces.

TODO: Implement locking to exclude concurrent BACKUP STAGE.

innodb_log_recovery_start: Make settable, for incremental backup.
Alas, the space.get_create_lsn() < checkpoint logic does not work yet.
Implement file-level InnoDB page locking for backup.

buf_flush_list_space(): Check for concurrent backup before writing each
page. This is inefficient, but this function may be invoked from multiple
threads concurrently, and it cannot be changed easily, especially for
fil_crypt_thread().

TODO: Implement finer-grained locking around copying page ranges.
@mariadb-andrzejjarzabek
Copy link
Copy Markdown
Author

Note - this solution uses MDL backup lock which makes it block on concurrent BACKUP SERVER and BACKUP STAGE as required. However BACKUP STAGE also upgrades that lock for various actions like flushing and DDL logging. As written, the various actions performed by BACKUP SERVER do not have access to the MDL request/ticket. If they should attempt to use the MDL lock themselves to perform the backup, we should be mindful of possible interactions.

@Thirunarayanan Thirunarayanan requested a review from dr-m April 2, 2026 12:11
…UP STAGE

BACKUP SERVER blocks when another BACKUp SERVER is in progress or a BACKUP
STAGE has been started. Takes advantage of existing mechanism using MDL
which BACKUP STAGE uses to block concurrent backups.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@gkodinov gkodinov added External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. MariaDB Corporation and removed External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. labels Apr 3, 2026
Copy link
Copy Markdown
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. In a few hours, I plan to update the MDEV-14992 branch soon with a minimal version of this initial part, without any DEBUG_SYNC instrumentation, and with a test that works with CMAKE_BUILD_TYPE=RelWithDebInfo.

As far as I understand from MDEV-36025, acquiring only MDL_BACKUP_START is not going to be sufficient. You should quickly find this out when implementing backup for ENGINE=Aria (MDEV-39092).

Before the backup_end step, we must acquire more locks. The locks can and should be released before the backup_finalize step, to avoid unnecessary waits.

Comment on lines +25 to 27
#include <my_global.h>

class THD;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, my_global.h is supposed to be included first in every file that includes these files. You added an #include for this file to sql/sql_backup.cc, which used to do #include "my_global.h" as the very first thing. If we retain that order, no changes to the sql directory other than sql/sql_backup.cc will be necessary.

Comment on lines 570 to 574
inline void set_type(enum_mdl_type type_arg)
{
DBUG_ASSERT(ticket == NULL);
DBUG_ASSERT(ticket == nullptr);
type= type_arg;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is shorter and more idiomatic to write !ticket.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I sympathise with nullptr, I'd avoid unrelated changes here and everywhere else.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to back out this change, however it seems to me a healthy principle for header files to include (even if transitively) all definitions and declarations they require, rather than count on files that include them to also include other files in specified order. If the convention is to include "my_global.h" before other includes, then headers relying on my_global.h should themselves include it? However since it's neither related not needed, I'm leaving it out.

Comment on lines +16 to 18
#include "debug_sync.h"
#include "mdl.h"
#include "my_global.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the #include been added after the my_global.h, then no changes to those files would be necessary.

Comment on lines -80 to -84
bool fail = my_mkdir(target.str, 0755, MYF(MY_WME));

#ifndef _WIN32
int dir= open(target.str, O_DIRECTORY);
if (dir < 0)
int dir;
if(!fail)
{
my_error(EE_CANT_MKDIR, MYF(ME_BELL), target.str, errno);
return true;
dir = open(target.str, O_DIRECTORY);
if (dir < 0)
{
my_error(EE_CANT_MKDIR, MYF(ME_BELL), target.str, errno);
fail= true;
}
}
#endif

bool fail= plugin_foreach_with_mask(thd, backup_start,
MYSQL_STORAGE_ENGINE_PLUGIN,
PLUGIN_IS_DELETED|PLUGIN_IS_READY,
IF_WIN(const_cast<char*>(target.str),
reinterpret_cast<void*>(dir)));
if (!fail)
fail= plugin_foreach_with_mask(thd, backup_step,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This excessive reindentation exercise, which is violating some of our white space rules (there should be no space before assignment operators, and there should be space after if) would have been prevented by making use of a simple goto label:

  if (my_mkdir(target.str, 0755, MYF(MY_WME)))
  {
  err_exit:
    thd->mdl_context.release_lock(mdl_request.ticket);
    return true;
  }
//
  if (dir < 0)
  {
    my_error(EE_CANT_MKDIR, MYF(ME_BELL), target.str, errno);
    goto err_exit;
  }

Simply, each subsequent return true will be replaced with goto err_exit.

Comment on lines +1 to +3
# Check that BACKUP SERVER blocks when another BACKUP SERVER is in progress

--source include/have_debug_sync.inc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is covering more than the scenario that is listed first. It’s customary to have a common setup section before any tests start. I think that a better name for this test would be main.backup_server_locking. I was thinking if --suite=backup would be a more suitable place for this test, but then I rememberered that some targets, such as AppVeyor, only cover --suite=main. It’s good to have wider coverage for this kind of a test.

Usually, if we include have_debug_sync.inc we also include have_debug.inc so that the test will be filtered out quicker when the tests are being run against a non-debug-instrumented server. It would be even better to remove the DEBUG_SYNC dependency or to make it optional. At least some of the locking can be demonstrated without using any DEBUG_SYNC.

Comment on lines +5 to +8
--connect (con1,localhost,root,,test)
--connect (con2,localhost,root,,test)

--connection con1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connect implies a switch of the current connection. Hence, only the first of these three statements should be needed here. We’re also missing count_sessions.inc at the start of the test and wait_until_count_sessions.inc at the end of the test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count_sessions.inc is broken and we're gradually moving away from it. See e.g. 4654b4f, we're now fixing affected test rather than adding wait_until_count_sessions.inc everywhere.

One of the biggest problems was check testcase connection, which may affect value recorded by count_sessions.inc

Comment on lines +79 to +85
--connection con2
--reap

--connection default

--disconnect con1
--disconnect con2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connections should be closed as soon as possible, so that they can be closed faster on the server side and the test can complete faster, even after adding the missing wait_until_count_sessions.inc.


--connection con2
--replace_result $MYSQLTEST_VARDIR <MYSQLTEST_VARDIR>
--send_eval BACKUP SERVER TO '$MYSQLTEST_VARDIR/another_directory'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no rmdir for either other_directory or another_directory. Tests are not supposed to leave any garbage files or directories behind.

Comment on lines +92 to +119
plugin_foreach_with_mask(thd, backup_finalize, MYSQL_STORAGE_ENGINE_PLUGIN,
PLUGIN_IS_DELETED|PLUGIN_IS_READY, nullptr);
plugin_foreach_with_mask(thd, backup_finalize, MYSQL_STORAGE_ENGINE_PLUGIN,
PLUGIN_IS_DELETED|PLUGIN_IS_READY, nullptr);
#ifndef _WIN32
close(dir);
close(dir);
#endif

}
thd->mdl_context.release_lock(mdl_request.ticket);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock must be released before backup_finalize to avoid unnecessary conflicts. Quoting the MDEV-14992 description:

backup_finalize after BACKUP STAGE END released locks
InnoDB:

  • copy and trim the last 0 to 2 “hot” ib_*.log files
  • if backup_start had set innodb_log_archive=ON:
  1. restore innodb_log_archive, innodb_log_file_size
  2. delete any ib_*.log files that had been created outside the backup directory

RocksDB:

  • in MDEV-38362, stream the files that were hard-linked in backup_end

The streaming backup (MDEV-38362) would likely make use of a working directory for the backup, to facilitate efficient copying of the innodb_log_archive=ON files as well as any ENGINE=RocksDB files.

During this stage, we are working on our private copy in dir, or deleting ib_*.log files from the server data directory. These operations are assumed to be invisible to any other threads. There is no need to block any operations on the server data directory at this final processing step.

@dr-m
Copy link
Copy Markdown
Contributor

dr-m commented Apr 8, 2026

I force-pushed #4817 and included the non-debug test main.backup_server_locking that demonstrates some mutual exclusion without depending on any DEBUG_SYNC. It could make more sense to rely on some DEBUG_SYNC in a storage engine, say, when implementing code and tests for backing up ENGINE=Aria and ENGINE=MyISAM files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants