Skip to content

MDEV-36025: backup taken from a replica with optimistic parallel replication fails to restore most of the time#4888

Open
hemantdangi-gc wants to merge 1 commit into10.11from
10.11_MDEV-36025
Open

MDEV-36025: backup taken from a replica with optimistic parallel replication fails to restore most of the time#4888
hemantdangi-gc wants to merge 1 commit into10.11from
10.11_MDEV-36025

Conversation

@hemantdangi-gc
Copy link
Copy Markdown
Contributor

@hemantdangi-gc hemantdangi-gc commented Apr 1, 2026

Issue:
Backups taken from a replica running optimistic parallel replication can
restore into a server that aborts on startup with:

Found N prepared transactions! ...

In the MDEV-36025 reproducer the application never issues XA SQL, but on
startup InnoDB reports several transactions “in the XA prepared state” and the
server aborts. These internal XA transactions created by optimistic parallel
replication on the replica are not covered by MDEV-742 and end up prepared after
restore, causing the “Found N prepared transactions” startup failure. This is
reproducible by mariabackup.xa_prepared_on_restore testcase, which fails with
'Found N prepared transactions'.

Solution:
Port the MDEV-21168 fix to MariaDB 10.6.

Add SRV_OPERATION_RESTORE_ROLLBACK_XA server operation mode and
--rollback-xa option (enabled by default) to mariabackup --prepare.
This automatically rolls back prepared XA transactions during prepare,
since the backup does not contain the binary log needed to resolve them.

Prevent incompatible combination of --rollback_xa and --export options.
The combination creates mmap state inconsistency in InnoDB's MTR system,
leading to crash.

@andrelkin
Copy link
Copy Markdown
Contributor

However, MDEV-742 does not solve the problem for internal XA transactions, as MDEV-36025 demonstrates

How it demonstrates?
I believe it is then covered by an mtr test. Could you please point to that file and its block?

At any rate the commit message should be more verbose in this part. Please describe that scenario.

@hemantdangi-gc
Copy link
Copy Markdown
Contributor Author

hemantdangi-gc commented Apr 3, 2026

However, MDEV-742 does not solve the problem for internal XA transactions, as MDEV-36025 demonstrates

How it demonstrates? I believe it is then covered by an mtr test. Could you please point to that file and its block?

At any rate the commit message should be more verbose in this part. Please describe that scenario.

The commit 5836191 (MDEV-21168) was deliberately NOT ported to 10.5+. It added an optional --rollback-xa flag to mariabackup in 10.4 only, with this note in the commit message: "The fix MUST NOT be ported on 10.5+, as MDEV-742 fix solves the issue for slaves.""

The mariabackup.xa_prepared_on_restore test with MDEV-36025 'Found n prepared transactions' error, passes after porting MDEV-21168.

I am saying here MDEV-742 didn't fixed needed issue, and so we do have to port MDEV-21168, to handle MDEV-36025 error. I wanted to put a reason in commit message why MDEV-21168 is needed so added this line.

@hemantdangi-gc hemantdangi-gc requested a review from dr-m April 3, 2026 10:09
@andrelkin
Copy link
Copy Markdown
Contributor

andrelkin commented Apr 3, 2026

@hemantdangi-gc , whatever MDEV-742 failed to fix, that issue just has to be described in this ticket in all detail in the PR.
So could you please take care to cover why

MDEV-742 does not solve the problem for internal XA transactions, as MDEV-36025 demonstrates
concern of how specifically the internal XA (aka normal) transaction were not covered by MDEV-21168, to cover in the PR description at necessary length and its shorter (if possible) version is good for the commit message.

I thought I would see that failure scenario in some test, and that's exactly what a good commit message must point to.

The solution section needs to be structured better too.
Sure it start with a reference to an existing work

Port the MDEV-21168 fix
and in the following expand on how and why that work fixes the MDEV-36025 report.

As MDEV-36025 is reported for slave, the refined issue description must either confirm this is the slave side indeed or exonerate 😄 the good old slave (the blame is on the general server therefore).

PS. If you need to discuss the technical side of the issue I'll be available from next Tue.

…ication fails

            to restore most of the time

Issue:
Backups taken from a replica running optimistic parallel replication can
restore into a server that aborts on startup with:

  Found N prepared transactions! ...

In the MDEV-36025 reproducer the application never issues XA SQL, but on
startup InnoDB reports several transactions “in the XA prepared state” and the
server aborts. These internal XA transactions created by optimistic parallel
replication on the replica are not covered by MDEV-742 and end up prepared after
restore, causing the “Found N prepared transactions” startup failure. This is
reproducible by mariabackup.xa_prepared_on_restore testcase, which fails with
'Found N prepared transactions'.

Solution:
Port the MDEV-21168 fix to MariaDB 10.6.

Add SRV_OPERATION_RESTORE_ROLLBACK_XA server operation mode and
--rollback-xa option (enabled by default) to mariabackup --prepare.
This automatically rolls back prepared XA transactions during prepare,
since the backup does not contain the binary log needed to resolve them.

Prevent incompatible combination of --rollback_xa and --export options.
The combination creates mmap state inconsistency in InnoDB's MTR system,
leading to crash.
@hemantdangi-gc
Copy link
Copy Markdown
Contributor Author

@hemantdangi-gc , whatever MDEV-742 failed to fix, that issue just has to be described in this ticket in all detail in the PR. So could you please take care to cover my

MDEV-742 does not solve the problem for internal XA transactions, as MDEV-36025 demonstrates
concern of how specifically the internal XA (aka normal) transaction were not covered by MDEV-21168, to cover in the PR description at necessary length and its shorter (if possible) version is good for the commit message.

I thought I would see that failure scenario in some test, and that's exactly what a good commit message must point to.

Added testcase detail in commit message now:
This is reproducible by mariabackup.xa_prepared_on_restore testcase, which fails with 'Found N prepared transactions'.

The solution section needs to be structured better too. Sure it start with a reference to an existing work

Port the MDEV-21168 fix
and in the following expand on how and why that work fixes the MDEV-36025 report.

As MDEV-36025 is reported for slave, the refined issue description must either confirm this is the slave side indeed or exonerate 😄 the good old slave (the blame is on the general server therefore).

I have added replica usage in commit message now:
Backups taken from a replica running optimistic parallel replication can restore into a server that aborts on startup with:

PS. If you need to discuss the technical side of the issue I'll be available from next Tue.

I have revised the commit message based on your suggestions, and have removed wrong expectation from MDEV-742 to resolve internal XA transaction issue. Please review and suggest if you have any further update.

@andrelkin
Copy link
Copy Markdown
Contributor

andrelkin commented Apr 7, 2026

@hemantdangi-gc, thanks for the mariabackup.xa_prepared_on_restore references!

For it I find that the issue is present in the general server.
This attempted piece of the problem description

Backups taken from a replica running optimistic parallel replication can restore
naturally is misleading. And the faithful test exposes the general server.

Note MDEV-21168 removed, at least was supposed to do so, the user XA related option. Normal trx:s were not targeted.
And why?
I expect the normal prepared trx:s were handled automatically - to rollback - at least in the base of MDEV-21168 patch.
But my expectation might be wrong or later a bug was introduced, or MDEV-21168 tried to make it work so, but failed, or - perhaps that the case - MDEV-21168 removed the option for both the normal trx and the user trx.
Please explore that further on your own.

Once again, the bug agenda is that the prepared state BEGIN-...-COMMIT normal trx must be automatically rolled back. The normal trx can be identified by its xid having "mysql" string prefix as part of its identifier.
If I am right in my guessing (see perhaps), MDEV-21168 seems to have missed to implement a proper solution of
how to recover prepared user XA trx:s. Normally they must be restored with their prepared state.
The removed option just needed to filter the user XA out of rollback at backup --prepare.

Comment on lines -6985 to +7064
innodb_shutdown();
/* Without buf_flush_sync(), the rolled-back changes would exist only
in the buffer pool and be lost on shutdown, leaving the data files in
an inconsistent state.
In the innodb_preshutdown(), the condition was updated to include
SRV_OPERATION_RESTORE_ROLLBACK_XA so it waits for transactions when
srv_fast_shutdown == 0. The innodb_preshutdown() is called by
innodb_shutdown(), which will wait for any active transactions to
finish and shut down purge and undo background sources for
SRV_OPERATION_RESTORE_ROLLBACK_XA. */
if (xtrabackup_rollback_xa)
buf_flush_sync_batch(0, false);

innodb_shutdown();

innodb_free_param();
innodb_free_param();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is rather confusing. Why can’t we invoke the higher-level function log_make_checkpoint() here? Can we issue a some messages that would indicate that the backup has now diverged from the server it was copied from? And mention the new checkpoint LSN?

Comment on lines +1486 to +1492
{"rollback-xa", OPT_XTRA_ROLLBACK_XA,
"Rollback prepared XA transactions on --prepare. Enabled by default; "
"use --skip-rollback-xa to disable. "
"After preparing target directory with this option "
"it can no longer be a base for incremental backup.",
(G_PTR *) &xtrabackup_rollback_xa, (G_PTR *) &xtrabackup_rollback_xa, 0,
GET_BOOL, NO_ARG, 1, 0, 0, 0, 0, 0},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think it is acceptable to change the default behaviour in stable release series. The need to change a large number of existing tests in a stable release series should be a warning sign to any reviewer.

Furthermore, I don’t think it is acceptable to break incremental backup by default, in any release.

--exec $MYSQLD_LAST_CMD

--let SEARCH_PATTERN= Found .* prepared transactions!
--source include/search_pattern_in_file.inc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is the general server, not slave.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants