Skip to content

Test for-next (regular, GH kvm)#1624

Open
kdave wants to merge 10000 commits intoci-kvmfrom
for-next
Open

Test for-next (regular, GH kvm)#1624
kdave wants to merge 10000 commits intoci-kvmfrom
for-next

Conversation

@kdave
Copy link
Member

@kdave kdave commented Mar 5, 2026

Keep this open, the build tests are hosted on github CI.

@kdave
Copy link
Member Author

kdave commented Mar 5, 2026

The KVM tests seem to be waiting for the build workflow and don't even start and show up in the list. Ordering could be possible.

@kdave
Copy link
Member Author

kdave commented Mar 5, 2026

Re #1623 .

@kdave kdave changed the title Test for-next (regular, GH kvm) 2 Test for-next (regular, GH kvm) Mar 5, 2026
@kdave kdave closed this Mar 5, 2026
@kdave kdave reopened this Mar 5, 2026
@kdave kdave force-pushed the ci-kvm branch 2 times, most recently from 69fc6c9 to 98bf7e7 Compare March 5, 2026 23:30
Kuen-Han Tsai and others added 23 commits March 11, 2026 16:21
This reverts commit fde0634.

This commit is being reverted as part of a series-wide revert.

By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.

This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.

Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.

All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).

Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-2-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 56a512a.

This commit is being reverted as part of a series-wide revert.

By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.

This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.

Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.

All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).

Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-3-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…_device"

This reverts commit 0c09811.

This commit is being reverted as part of a series-wide revert.

By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.

This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.

Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.

All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).

Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-4-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 7a7930c.

This commit is being reverted as part of a series-wide revert.

By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.

This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.

Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.

All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).

Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-5-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit e065c6a.

This commit is being reverted as part of a series-wide revert.

By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.

This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.

Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.

All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).

Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-6-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The network device outlived its parent gadget device during
disconnection, resulting in dangling sysfs links and null pointer
dereference problems.

A prior attempt to solve this by removing SET_NETDEV_DEV entirely [1]
was reverted due to power management ordering concerns and a NO-CARRIER
regression.

A subsequent attempt to defer net_device allocation to bind [2] broke
1:1 mapping between function instance and network device, making it
impossible for configfs to report the resolved interface name. This
results in a regression where the DHCP server fails on pmOS.

Use device_move to reparent the net_device between the gadget device and
/sys/devices/virtual/ across bind/unbind cycles. This preserves the
network interface across USB reconnection, allowing the DHCP server to
retain their binding.

Introduce gether_attach_gadget()/gether_detach_gadget() helpers and use
__free(detach_gadget) macro to undo attachment on bind failure. The
bind_count ensures device_move executes only on the first bind.

[1] https://lore.kernel.org/lkml/f2a4f9847617a0929d62025748384092e5f35cce.camel@crapouillou.net/
[2] https://lore.kernel.org/linux-usb/795ea759-7eaf-4f78-81f4-01ffbf2d7961@ixit.cz/

Fixes: 40d133d ("usb: gadget: f_ncm: convert to new function interface with backward compatibility")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-7-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 1366cd2.

The fwnode_usb_role_switch_get() returns NULL only if no connection is
found, returns ERR_PTR(-EPROBE_DEFER) if connection is found but deferred
probe is needed, or a valid pointer of usb_role_switch.

When switching from a NULL check to IS_ERR_OR_NULL(), usb_role_switch_get()
returns NULL and overwrites the ERR_PTR(-EPROBE_DEFER) returned by
fwnode_usb_role_switch_get(). This causes the deferred probe indication to
be lost, preventing the USB role switch from ever being retrieved.

Fixes: 1366cd2 ("tcpm: allow looking for role_sw device in the main node")
Cc: stable <stable@kernel.org>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Tested-by: Arnaud Ferraris <arnaud.ferraris@collabora.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://patch.msgid.link/20260309074313.2809867-2-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
usb_role_switch_is_parent() was walking up to the parent node and checking
for the "usb-role-switch" property regardless of the type of the passed
fwnode. This could cause unrelated device nodes to be probed as potential
role switch parent, leading to spurious matches and "-EPROBE_DEFER" being
returned infinitely.

Till now only Type-B connector node will have a parent node which may
present "usb-role-switch" property and register the role switch device.
For Type-C connector node, its parent node will always be a Type-C chip
device which will never register the role switch device. However, it may
still present a non-boolean "usb-role-switch = <&usb_controller>" property
for historical compatibility.

So restrict the helper to only operate on Type-B connector when attempting
to get the role switch from parent node.

Fixes: 6fadd72 ("usb: roles: get usb-role-switch from parent")
Cc: stable <stable@kernel.org>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Tested-by: Arnaud Ferraris <arnaud.ferraris@collabora.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://patch.msgid.link/20260309074313.2809867-3-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The LPVO USB GPIB adapter apparently uses an FTDI 8U232AM with the
default PID, but this device id is already handled by the ftdi_sio
serial driver.

Stop binding to the default PID to avoid breaking existing setups with
FTDI 8U232AM.

Anyone using this driver should blacklist the ftdi_sio driver and add
the device id manually through sysfs (e.g. using udev rules).

Fixes: fce7951 ("staging: gpib: Add LPVO DIY USB GPIB driver")
Fixes: e6ab504 ("staging: gpib: Destage gpib")
Cc: Dave Penkler <dpenkler@gmail.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305151729.10501-2-johan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The DmaGspMem pointer accessor methods (gsp_write_ptr, gsp_read_ptr,
cpu_read_ptr, cpu_write_ptr, advance_cpu_read_ptr,
advance_cpu_write_ptr) dereference a raw pointer to DMA memory, creating
an intermediate reference before calling volatile read/write methods.

This is undefined behavior since DMA memory can be concurrently modified
by the device.

Fix this by moving the implementations into a gsp_mem module in fw.rs
that uses the dma_read!() / dma_write!() macros, making the original
methods on DmaGspMem thin forwarding wrappers.

An alternative approach would have been to wrap the shared memory in
Opaque, but that would have required even more unsafe code.

Since the gsp_mem module lives in fw.rs (to access firmware-specific
binding field names), GspMem, Msgq and their relevant fields are
temporarily widened to pub(super). This will be reverted once IoView
projections are available.

Cc: Gary Guo <gary@garyguo.net>
Closes: https://lore.kernel.org/nouveau/DGUT14ILG35P.1UMNRKU93JUM1@kernel.org/
Fixes: 75f6b1d ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20260309225408.27714-1-dakr@kernel.org
[ Use pub(super) where possible; replace bitwise-and with modulo
  operator analogous to [1]. - Danilo ]
Link: https://lore.kernel.org/all/20260129-nova-core-cmdq1-v3-1-2ede85493a27@nvidia.com/ [1]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Some bootloaders like recent versions of U-Boot may install some DMI
properties with empty values rather than not populate them. This manages
to make its way through the validator and cleanup resulting in a rogue
hyphen being appended to the card longname.

Fixes: 4e01e5d ("ASoC: improve the DMI long card code in asoc-core")
Signed-off-by: Casey Connolly <casey.connolly@linaro.org>
Link: https://patch.msgid.link/20260306174707.283071-2-casey.connolly@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>
…l/git/powerpc/linux

Pull powerpc fixes from Madhavan Srinivasan:
 - Correct MSI allocation tracking
 - Always use 64 bits PTE for powerpc/e500
 - Fix inline assembly for clang build on PPC32
 - Fixes for clang build issues in powerpc64/ftrace
 - Fixes for powerpc64/bpf JIT and tailcall support
 - Cleanup MPC83XX devicetrees
 - Fix keymile vendor prefix
 - Fix to use big-endian types for crash variables

Thanks to Abhishek Dubey, Christophe Leroy (CS GROUP), Hari Bathini,
Heiko Schocher, J. Neuschäfer, Mahesh Salgaonkar, Nam Cao, Nilay Shroff,
Rob Herring (Arm), Saket Kumar Bhaskar, Sourabh Jain, Stan Johnson, and
Venkat Rao Bagalkote.

* tag 'powerpc-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (23 commits)
  powerpc/pseries: Correct MSI allocation tracking
  powerpc: dts: mpc83xx: Add unit addresses to /memory
  powerpc: dts: mpc8315erdb: Add missing #cells properties to SPI bus
  powerpc: dts: mpc8315erdb: Rename LED nodes to comply with schema
  powerpc: dts: mpc8315erdb: Use IRQ_TYPE_* macros
  powerpc: dts: mpc8313erdb: Use IRQ_TYPE_* macros
  powerpc: 83xx: km83xx: Fix keymile vendor prefix
  dt-bindings: powerpc: Add Freescale/NXP MPC83xx SoCs
  powerpc64/bpf: fix kfunc call support
  powerpc64/bpf: fix handling of BPF stack in exception callback
  powerpc64/bpf: remove BPF redzone protection in trampoline stack
  powerpc64/bpf: use consistent tailcall offset in trampoline
  powerpc64/bpf: fix the address returned by bpf_get_func_ip
  powerpc64/bpf: do not increment tailcall count when prog is NULL
  powerpc64/ftrace: workaround clang recording GEP in __patchable_function_entries
  powerpc64/ftrace: fix OOL stub count with clang
  powerpc64: make clang cross-build friendly
  powerpc/crash: adjust the elfcorehdr size
  powerpc/kexec/core: use big-endian types for crash variables
  powerpc/prom_init: Fixup missing #size-cells on PowerMac media-bay nodes
  ...
…rnel/git/remoteproc/linux

Pull remoteproc fixes from Bjorn Andersson:

 - Correct the early return from the i.MX remoteproc prepare
   operation, which prevented the platform-specific prepare
   function from being reached

 - Ensure that the Mediatek SCP clock is released during system
   suspend after the recent refactoring to avoid issues with the
   clock framework's prepare lock.

 - Correct the type of the subsys_name_len field in the sysmon
   event QMI message, as the recent introduction of big endian
   support in the QMI encoder highlighted the type mismatch and
   resulted in a failure to encode the message

 - Roll back the devm_ioremap_resource_wc() to a devm_ioremap_wc()
   in the Qualcomm WCNSS remoteproc driver, after reports that
   requesting this resource fails on some platforms

* tag 'rproc-v7.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
  remoteproc: imx_rproc: Fix unreachable platform prepare_ops
  remoteproc: mediatek: Unprepare SCP clock during system suspend
  remoteproc: sysmon: Correct subsys_name_len type in QMI request
  remoteproc: qcom_wcnss: Fix reserved region mapping failure
When refill_sheaf() partially fills one sheaf (e.g., fills 5 objects
but need to fill 10), it will update sheaf->size and return -ENOMEM.
However, the callers (alloc_full_sheaf() and __pcs_replace_empty_main())
directly call free_empty_sheaf() on failure, which only does kfree(sheaf),
causing the partially allocated objects memory in sheaf->objects[] leaked.

Fix this by calling sheaf_flush_unused() before free_empty_sheaf() to
free objects of sheaf->objects[]. And also add a WARN_ON() in
free_empty_sheaf() to catch any future cases where a non-empty sheaf is
being freed.

Fixes: ed30c4a ("slab: add optimized sheaf refill from partial list")
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Link: https://patch.msgid.link/20260311093617.4155965-1-wangqing7171@gmail.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
…kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 7.0, take #2

- Fix a couple of low-severity bugs in our S2 fault handling path,
  affecting the recently introduced LS64 handling and the even more
  esoteric handling of hwpoison in a nested context

- Address yet another syzkaller finding in the vgic initialisation,
  were we would end-up destroying an uninitialised vgic, with nasty
  consequences

- Address an annoying case of pKVM failing to boot when some of the
  memblock regions that the host is faulting in are not page-aligned

- Inject some sanity in the NV stage-2 walker by checking the limits
  against the advertised PA size, and correctly report the resulting
  faults

- Drop an unnecessary ISB when emulating an EL2 S1 address translation
… into HEAD

KVM/riscv fixes for 7.0, take #1

- Prevent speculative out-of-bounds access using array_index_nospec()
  in APLIC interrupt handling, ONE_REG regiser access, AIA CSR access,
  float register access, and PMU counter access
- Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(),
  kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr()
- Fix potential null pointer dereference in kvm_riscv_vcpu_aia_rmw_topei()
- Fix off-by-one array access in SBI PMU
- Skip THP support check during dirty logging
- Fix error code returned for Smstateen and Ssaia ONE_REG interface
- Check host Ssaia extension when creating AIA irqchip
… into HEAD

KVM generic changes for 7.0

 - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being
   unnecessary and confusing, triggered compiler warnings due to
   -Wflex-array-member-not-at-end.

 - Document that vcpu->mutex is take outside of kvm->slots_lock and
   kvm->slots_arch_lock, which is intentional and desirable despite being
   rather unintuitive.
…kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 7.0, take #3

- Correctly handle deeactivation of out-of-LRs interrupts by
  starting the EOIcount deactivation walk *after* the last irq
  that made it into an LR. This avoids deactivating irqs that
  are in the LRs and that the vcpu hasn't deactivated yet.

- Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when
  pKVM is already enabled -- not only thhis isn't possible (pKVM
  will reject the call), but it is also useless: this can only
  happen for a CPU that has already booted once, and the capability
  will not change.
Increase 'maxnode' when using 'get_mempolicy' syscall in guest_memfd
mmap and NUMA policy tests to fix a failure on one Intel GNR platform.

On a CXL-capable platform, the memory affinity of CXL memory regions may
not be covered by the SRAT.  Since each CXL memory region is enumerated
via a CFMWS table, at early boot the kernel parses all CFMWS tables to
detect all CXL memory regions and assigns a 'faked' NUMA node for each
of them, starting from the highest NUMA node ID enumerated via the SRAT.

This increases the 'nr_node_ids'.  E.g., on the aforementioned Intel GNR
platform which has 4 NUMA nodes and 18 CFMWS tables, it increases to 22.

This results in the 'get_mempolicy' syscall failure on that platform,
because currently 'maxnode' is hard-coded to 8 but the 'get_mempolicy'
syscall requires the 'maxnode' to be not smaller than the 'nr_node_ids'.

Increase the 'maxnode' to the number of bits of 'nodemask', which is
'unsigned long', to fix this.

This may not cover all systems.  Perhaps a better way is to always set
the 'nodemask' and 'maxnode' based on the actual maximum NUMA node ID on
the system, but for now just do the simple way.

Reported-by: Yi Lai <yi1.lai@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221014
Closes: https://lore.kernel.org/all/bug-221014-28872@https.bugzilla.kernel.org%2F
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Link: https://patch.msgid.link/20260302205158.178058-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
… type

Fix a build error in kvmppc_e500_tlb_init() that was introduced by the
conversion to use kzalloc_objs(), as KVM confusingly uses the size of the
structure that is one and only field in tlbe_priv:

  arch/powerpc/kvm/e500_mmu.c:923:33: error: assignment to 'struct tlbe_priv *'
    from incompatible pointer type 'struct tlbe_ref *' [-Wincompatible-pointer-types]
  923 |         vcpu_e500->gtlb_priv[0] = kzalloc_objs(struct tlbe_ref,
      |                                 ^

KVM has been flawed since commit 0164c0f ("KVM: PPC: e500: clear up
confusion between host and guest entries"), but the issue went unnoticed
until kmalloc_obj() came along and enforced types, as "struct tlbe_priv"
was just a wrapper of "struct tlbe_ref" (why on earth the two ever existed
separately...).

Fixes: 69050f8 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Cc: Kees Cook <kees@kernel.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://patch.msgid.link/20260303190339.974325-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Complete the ~13 year journey started by commit 47bf379
("kvm/ppc/e500: eliminate tlb_refs"), and actually remove "struct
tlbe_ref".

No functional change intended (verified disassembly of e500_mmu.o and
e500_mmu_host.o is identical before and after).

Link: https://patch.msgid.link/20260303190339.974325-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM incorrectly synthesizes CPUID bits for KVM-only leaves, as the
following branch in kvm_cpu_cap_init() is never taken:

    if (leaf < NCAPINTS)
        kvm_cpu_caps[leaf] &= kernel_cpu_caps[leaf];

This means that bits set via SYNTHESIZED_F() for KVM-only leaves are
unconditionally set. This for example can cause issues for SEV-SNP
guests running on Family 19h CPUs, as TSA_SQ_NO and TSA_L1_NO are
always enabled by KVM in 80000021[ECX]. When userspace issues a
SNP_LAUNCH_UPDATE command to update the CPUID page for the guest, SNP
firmware will explicitly reject the command if the page sets sets these
bits on vulnerable CPUs.

To fix this, check in SYNTHESIZED_F() that the corresponding X86
capability is set before adding it to to kvm_cpu_cap_features.

Fixes: 31272ab ("KVM: SVM: Advertise TSA CPUID bits to guests")
Link: https://lore.kernel.org/all/20260208164233.30405-1-clopez@suse.de/
Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://patch.msgid.link/20260209153108.70667-2-clopez@suse.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In KVM guests with Hyper-V hypercalls enabled, the hypercalls
HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX
allow a guest to request invalidation of portions of a virtual TLB.
For this, the hypercall parameter includes a list of GVAs that are supposed
to be invalidated.

Currently, only the base GVA is checked to be canonical. In reality, this
check needs to be performed for the entire range of GVAs, as checking only
the base GVA enables guests running on Intel hardware to trigger a
WARN_ONCE in the host (see Fixes commit below).

Move the check for non-canonical addresses to be performed for every GVA
of the supplied range to avoid the splat, and to be more in line with the
Hyper-V specification, since, although unlikely, a range starting with an
invalid GVA may still contain GVAs that are valid.

Fixes: fa787ac ("KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush")
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/00a7a31b-573b-4d92-91f8-7d7e2f88ea48@tum.de
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
maharmstone and others added 11 commits March 18, 2026 08:05
There is a potential use-after-free in move_existing_remap(): we're calling
btrfs_put_block_group() on dest_bg, then passing it to
btrfs_add_block_group_free_space() a few lines later.

Fix this by getting the BG at the start of the function and putting it
near the end. This also means we're not doing a lookup twice for the
same thing.

Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125123908.2096548-1-clm@meta.com/
Fixes: bbea42d ("btrfs: move existing remaps before relocating block group")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…unks()

Fix a potential segfault in balance_remap_chunks(): if we quit early
because btrfs_inc_block_group_ro() fails, all the remaining items in the
chunks list will still have their bg value set to NULL. It's thus not
safe to dereference this pointer without checking first.

Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125120717.1578828-1-clm@meta.com/
Fixes: 81e5a45 ("btrfs: allow balancing remap tree")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce checks for FREE_SPACE_INFO item, which include:

- Key alignment check
  The objectid is the logical bytenr of the chunk/bg, and offset is the
  length of the chunk/bg, thus they should all be aligned to the fs
  block size.

- Item size check
  The FREE_SPACE_INFO should a fix size.

- Flags check
  The flags member should have no other flags than
  BTRFS_FREE_SPACE_USING_BITMAPS.

  For future expansion, introduce a new macro
  BTRFS_FREE_SPACE_FLAGS_MASK for such checks.

  And since we're here, the BTRFS_FREE_SPACE_USING_BITMAPS should not
  use unsigned long long, as the flags is only 32 bits wide.
  So fix that to use unsigned long.

- Extent count check
  That member shows how many free space bitmap/extent items there are
  inside the chunk/bg.

  We know the chunk size (from key->offset), thus there should be at
  most (key->offset >> sectorsize_bits) blocks inside the chunk.
  Use that value as the upper limit and if that counter is larger than
  that, there is a high chance it's a bitflip in high bits.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce FREE_SPACE_EXTENT checks, which include:

- The key alignment check
  The objectid is the logical bytenr of the free space, and offset is the
  length of the free space, thus they should all be aligned to the fs
  block size.

- The item size check
  The FREE_SPACE_EXTENT item should have a size of zero.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce checks for FREE_SPACE_BITMAP item, which include:

- Key alignment check
  Same as FREE_SPACE_EXTENT, the objectid is the logical bytenr of the
  free space, and offset is the length of the free space, so both
  should be aligned to the fs block size.

- Non-zero range check
  A zero key->offset would describe an empty bitmap, which is invalid.

- Item size check
  The item must hold exactly DIV_ROUND_UP(key->offset >> sectorsize_bits,
  BITS_PER_BYTE) bytes.  A mismatch indicates a truncated or otherwise
  corrupt bitmap item; without this check, the bitmap loading path would
  walk past the end of the leaf and trigger a NULL dereference in
  assert_eb_folio_uptodate().

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When recovering relocation at mount time, merge_reloc_root() and
btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against
an impossible state: a non-zero drop_progress combined with a zero
drop_level in a root_item, which can be triggered:

------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1545!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545
Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000
Call Trace:
 merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861
 btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195
 btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130
 open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640
 btrfs_fill_super fs/btrfs/super.c:987 [inline]
 btrfs_get_tree_super fs/btrfs/super.c:1951 [inline]
 btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline]
 btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128
 vfs_get_tree+0x9a/0x370 fs/super.c:1758
 fc_mount fs/namespace.c:1199 [inline]
 do_new_mount_fc fs/namespace.c:3642 [inline]
 do_new_mount fs/namespace.c:3718 [inline]
 path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
 do_mount fs/namespace.c:4041 [inline]
 __do_sys_mount fs/namespace.c:4229 [inline]
 __se_sys_mount fs/namespace.c:4206 [inline]
 __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
 ...
RIP: 0033:0x7f969c9a8fde
Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f
---[ end trace 0000000000000000 ]---

The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic
metadata fuzzing tool that corrupts btrfs metadata at runtime.

[CAUSE]
A non-zero drop_progress.objectid means an interrupted
btrfs_drop_snapshot() left a resume point on disk, and in that case
drop_level must be greater than 0 because the checkpoint is only
saved at internal node levels.

Although this invariant is enforced when the kernel writes the root
item, it is not validated when the root item is read back from disk.
That allows on-disk corruption to provide an invalid state with
drop_progress.objectid != 0 and drop_level == 0.

When relocation recovery later processes such a root item,
merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The
same invalid metadata can also trigger the corresponding BUG_ON() in
btrfs_drop_snapshot().

[FIX]
Fix this by validating the root_item invariant in tree-checker when
reading root items from disk: if drop_progress.objectid is non-zero,
drop_level must also be non-zero. Reject such malformed metadata with
-EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot()
and triggers the BUG_ON.

After the fix, the same corruption is correctly rejected by tree-checker
and the BUG_ON is no longer triggered.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
check.

IS_ERR_OR_NULL() already uses likely(!ptr) internally. checkpatch does
not like nesting it:
> WARNING: nested (un)?likely() calls, IS_ERR_OR_NULL already uses
> unlikely() internally
Remove the explicit use of likely().

Change generated with coccinelle.

Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
read_extent_buffer_pages_nowait() returns immediately when an extent
buffer is already marked uptodate. On that cache-hit path,
the caller supplied btrfs_tree_parent_check is not re-run.

This can let read_tree_root_path() accept a cached tree block whose
actual header level/owner does not match the expected value derived from
the parent.

E.g. a corrupted root item that points to a tree block which doesn't
even belong to that root, and has mismatching level/owner.

But that tree block is already read and cached, later the corrupted tree
root got read from disk and hit the cached tree block.

Fix this by re-validating cached extent buffers against the supplied
btrfs_tree_parent_check on the uptodate path, and make
read_tree_root_path() pass its check to btrfs_buffer_uptodate().

This makes cache hits and fresh reads follow the same tree-parent
verification rules, and turns the corruption into a read failure instead
of constructing an inconsistent root object.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Resolve the conflict with extent_buffer_uptodate() helper, handle
  transid mismatch case ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike other flags used in btrfs, BTRFS_ORDERED_* macros are different as
they cannot be directly used as flags.

They are defined as bit values, thus they should be utilized with
bit operations, not directly with logical operations.

Unfortunately sometimes I forgot this and passed the incorrect flags
to alloc_ordered_extent() and hit weird bugs.

Enhance the type checks in alloc_ordered_extent():

- Make sure there is one and only one bit set for exclusive type flags
  There are four exclusive type flags, REGULAR, NOCOW, PREALLOC and
  COMPRESSED.
  So introduce a new macro, BTRFS_ORDERED_EXCLUSIVE_FLAGS, to cover
  above flags.

  Add an ASSERT() to check one and only one of those exclusive flags can
  be set for alloc_ordered_extent().

- Re-order the type bit numbers to the end of the enum
  This is make it much harder to get a valid false negative.

  E.g., with the old code BTRFS_ORDERED_REGULAR starts at zero, we can
  have the following flags passing the bit uniqueness check:

  * BTRFS_ORDERED_NOCOW
    Be treated as BTRFS_ORDERED_REGULAR (1 == 1UL << 0).

  * BTRFS_ORDERED_PREALLOC
    Be treated as BTRFS_ORDERED_NOCOW (2 == 1UL << 1).

  * BTRFS_ORDERED_DIRECT
    Be treated as BTRFS_ORDERED_PREALLOC (4 == 1UL << 2).

  Now all those types start at 8, passing any of those bit numbers as
  flags directly will not pass the ASSERT().

- Add a static assert to avoid overflow
  To make sure all BTRFS_ORDERED_* flags can fit into an unsigned long.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During development of a new feature, I triggered that btrfs_panic()
inside insert_ordered_extent() and spent quite some unnecessary before
noticing I'm passing incorrect flags when creating a new ordered extent.

Unfortunately the existing error message is not providing much help.

Enhance the output to provide file offset, num bytes and flags of both
existing and new ordered extents.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That parameter was introduced by commit b9fab91 ("Btrfs: avoid
sleeping in verify_parent_transid while atomic").
At that time we needed to lock the extent buffer range inside the io tree
to avoid content changes, thus it could sleep.

But that behavior is no longer there, as later commit 9e2aff9
("btrfs: stop using lock_extent in btrfs_buffer_uptodate") dropped the
io tree lock.

We can remove the @atomic parameter safely now.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since commit 3d74a75 ("btrfs: zlib: introduce zlib_compress_bio()
helper"), there are some reports about different crashes in zlib
compression path. One of the symptoms is list corruption like the
following:

  list_del corruption. next->prev should be fffffbb340204a08, but was ffff8d6517cb7de0. (next=fffffbb3402d62c8)
  ------------[ cut here ]------------
  kernel BUG at lib/list_debug.c:65!
  Oops: invalid opcode: 0000 [#1] SMP NOPTI
  CPU: 1 UID: 0 PID: 21436 Comm: kworker/u16:7 Not tainted 7.0.0-rc2-jcg+ #1 PREEMPT
  Hardware name: LENOVO 10VGS02P00/3130, BIOS M1XKT57A 02/10/2022
  Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
  RIP: 0010:__list_del_entry_valid_or_report+0xec/0xf0
  Call Trace:
   <TASK>
   btrfs_alloc_compr_folio+0xae/0xc0 [btrfs]
   zlib_compress_bio+0x39d/0x6a0 [btrfs]
   btrfs_compress_bio+0x2e3/0x3d0 [btrfs]
   compress_file_range+0x2b0/0x660 [btrfs]
   btrfs_work_helper+0xdb/0x3e0 [btrfs]
   process_one_work+0x192/0x3d0
   worker_thread+0x19a/0x310
   kthread+0xdf/0x120
   ret_from_fork+0x22e/0x310
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  ---[ end trace 0000000000000000 ]---

Other symptoms include VM_BUG_ON() during folio_put() but it's rarer.

David Sterba firstly reported this during his CI runs but unfortunately
I'm unable to hit it.

Meanwhile zstd/lzo doesn't seem to have the same problem.

[CAUSE]
During zlib_compress_bio() every time the output buffer is full, we
queue the full folio into the compressed bio, and allocate a new folio
as the output folio.

After the input has finished, we loop through zlib_deflate() with
Z_FINISH to flush all output.

And when that is done, we still need to check if the last folio has any
content, and if so we still need to queue that part into the compressed
bio.

The problem is in the final folio handling, if the final folio is full
(for x86_64 the folio size is 4K), the length to queue is calculated by

  u32 cur_len = offset_in_folio(out_folio, workspace->strm.total_out);

But since total_out is 4K aligned, the resulted @cur_len will be 0, then
we hit the bio_add_folio(), which has a quirk that if bio_add_folio()
got an length 0, it will still queue the folio into the bio, but return
false.

In that case we go to out: tag, which calls btrfs_free_compr_folio() to
release @out_folio, which may put the out folio into the btrfs global
pool list.

On the other hand, that @out_folio is already added to the
compressed bio, and will later be released again by
cleanup_compressed_bio(), which results double release.

And if this time we still need to put the folio into the btrfs global
pool list, it will result a list corruption because it's already in the
list.

[FIX]
Instead of offset_inside_folio(), directly use the difference between
strm.total_out and bi_size.
So that if the last folio is completely full, we can still properly
queue the full folio other than queueing zero byte.

Fixes: 3d74a75 ("btrfs: zlib: introduce zlib_compress_bio() helper")
Reported-by: David Sterba <dsterba@suse.com>
Reported-by: Jean-Christophe Guillain <jean-christophe@guillain.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221176
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Goldwyn Rodrigues and others added 16 commits March 18, 2026 12:18
…_sync_file()

If overlay is used on top of btrfs, dentry->d_sb translates to overlay's
super block and fsid assignment will lead to a crash.

Use file_inode(file)->i_sb to always get btrfs_sb.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…o tree

When we are clearing all the bits from the last record that contains the
target range (i.e. the record starts before our target range and ends
beyond it), we are doing a lot of unnecessary work:

1) Allocating a prealloc state if we don't have one already;

2) Adjust that last record's start offset to the end of our range and
   make the prealloc state have a range going from the original start
   offset of that last record to the end offset of our target range and
   the same bits as the last record. Then we insert the prealloc extent
   in the rbtree - this is done in split_state();

3) Remove our prealloc state from the rbtree since all the bits were
   cleared - this is done in clear_state_bit().

This is only wasting time when we can simply trim the last record so
that it's start offset is adjust to the end of the target range. So
optimize for that case and avoid the prealloc state allocation, insertion
and deletion from the rbtree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
…rting

When extent_io_tree_panic() is called we get a stace trace that is not
very useful since the error message reports the location inside the
extent_io_tree_panic() function and not in the caller of the function.

Example:

  [ 7830.424291] BTRFS critical (device sdb): panic in extent_io_tree_panic:334: extent io tree error on add_extent_changeset state start 4083712 end 4112383 (errno=1 unknown)
  [ 7830.426816] ------------[ cut here ]------------
  [ 7830.427581] kernel BUG at fs/btrfs/extent-io-tree.c:334!
  [ 7830.428495] Oops: invalid opcode: 0000 [#1] SMP PTI
  [ 7830.429318] CPU: 5 UID: 0 PID: 1451600 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full)
  [ 7830.430899] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [ 7830.432771] RIP: 0010:extent_io_tree_panic+0x41/0x43 [btrfs]
  [ 7830.433815] Code: 75 0a 48 8b (...)
  [ 7830.436849] RSP: 0018:ffffd2334f4a3b68 EFLAGS: 00010246
  [ 7830.437668] RAX: 0000000000000000 RBX: 00000000003ebfff RCX: 0000000000000000
  [ 7830.438801] RDX: ffffffffc08d4368 RSI: ffffffffbb6ce475 RDI: ffff896501d6b780
  [ 7830.439671] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff
  [ 7830.440575] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000
  [ 7830.441458] R13: ffff896547374c08 R14: 00000000003effff R15: ffff896547374c08
  [ 7830.442333] FS:  00007f3e252af0c0(0000) GS:ffff896c6185d000(0000) knlGS:0000000000000000
  [ 7830.443326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 7830.444047] CR2: 00007f3e252ad000 CR3: 0000000113b0a004 CR4: 0000000000370ef0
  [ 7830.444905] Call Trace:
  [ 7830.445229]  <TASK>
  [ 7830.445557]  btrfs_clear_extent_bit_changeset.cold+0x43/0x80 [btrfs]
  [ 7830.446543]  btrfs_clear_record_extent_bits+0x19/0x20 [btrfs]
  [ 7830.447308]  qgroup_free_reserved_data+0xf9/0x170 [btrfs]
  [ 7830.448040]  btrfs_buffered_write+0x368/0x8e0 [btrfs]
  [ 7830.448707]  btrfs_direct_write+0x1a5/0x480 [btrfs]
  [ 7830.449396]  btrfs_do_write_iter+0x18c/0x210 [btrfs]
  [ 7830.450167]  vfs_write+0x21f/0x450
  [ 7830.450662]  ksys_write+0x5f/0xd0
  [ 7830.451092]  do_syscall_64+0xe9/0xf20
  [ 7830.451610]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Change extent_io_tree_panic() to a macro so that we get a report that
gives the exact place where the error happens.

Example after this change:

  [63677.406061] BTRFS critical (device sdc): panic in btrfs_clear_extent_bit_changeset:744: extent io tree error on add_extent_changeset state start 1818624 end 1830911 (errno=1 unknown)
  [63677.410055] ------------[ cut here ]------------
  [63677.410910] kernel BUG at fs/btrfs/extent-io-tree.c:744!
  [63677.411918] Oops: invalid opcode: 0000 [#1] SMP PTI
  [63677.413032] CPU: 0 UID: 0 PID: 13028 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full)
  [63677.415139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [63677.417283] RIP: 0010:btrfs_clear_extent_bit_changeset.cold+0xcd/0x10c [btrfs]
  [63677.418676] Code: 8b 37 48 8b (...)
  [63677.421917] RSP: 0018:ffffd2290a417b30 EFLAGS: 00010246
  [63677.422824] RAX: 0000000000000000 RBX: 00000000001befff RCX: 0000000000000000
  [63677.424320] RDX: ffffffffc0970348 RSI: ffffffffa92ce475 RDI: ffff8897ded9dc80
  [63677.429772] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff
  [63677.430787] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000
  [63677.431818] R13: ffff8897966655d8 R14: 00000000001bffff R15: ffff8897966655d8
  [63677.432764] FS:  00007f5c074c50c0(0000) GS:ffff889ef3b1d000(0000) knlGS:0000000000000000
  [63677.433940] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [63677.434787] CR2: 00007f5c074c3000 CR3: 000000014b9de002 CR4: 0000000000370ef0
  [63677.435960] Call Trace:
  [63677.436432]  <TASK>
  [63677.436838]  btrfs_clear_record_extent_bits+0x19/0x20 [btrfs]
  [63677.437980]  qgroup_free_reserved_data+0xf9/0x170 [btrfs]
  [63677.439070]  btrfs_buffered_write+0x368/0x8e0 [btrfs]
  [63677.439889]  btrfs_do_write_iter+0x1a8/0x210 [btrfs]
  [63677.441460]  do_iter_readv_writev+0x145/0x240
  [63677.446309]  vfs_writev+0x120/0x3b0
  [63677.446878]  ? __do_sys_newfstat+0x33/0x60
  [63677.447759]  ? do_pwritev+0x8a/0xd0
  [63677.449119]  do_pwritev+0x8a/0xd0
  [63677.452342]  do_syscall_64+0xe9/0xf20
  [63677.452961]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
It's unexpected to ever call extent_io_tree_panic() so surround with
'unlikely' every if statement condition that leads to it, making it
explicit to a reader and to hint the compiler to potentially generate
better code.

On x86_64, using gcc 14.2.0-19 from Debian, this resulted in a slightly
decrease of the btrfs module's text size.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1999832	 174320	  15592	2189744	 2169b0	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1999768	 174320	  15592	2189680	 216970	fs/btrfs/btrfs.ko

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Currently add_extent_changeset() always returns the return value from its
call to ulist_add(), which can return an error, 0 or 1. There are no
callers that care about the difference between 0 and 1 and all except one
of them, check for negative values and ignore other values, but there is
another caller (btrfs_clear_extent_bit_changeset()) that must set its
'ret' variable to 0 after calling add_extent_changeset(), so that it
does not return an unexpected value of 1 to its caller.

So change add_extent_changeset() to only return errors or 0, avoiding
that caller (and any future callers) from having to deal with a return
value of 1.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
There's no need to call BUG_ON(), instead call extent_io_tree_panic(),
which also calls BUG(), but it prints an additional error message with
some useful information before hitting BUG().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
The argument is used as a boolean but it's defined as an integer.
Switch it to a boolean for better readability.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
There's no need to pass the 'wake' parameter, we can determine if we have
to wake up waiters by checking if EXTENT_LOCK_BITS is set in the bits to
clear. So simplify things and remove the parameter.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Whenever clearing the extent lock bits of an extent state record, we
unconditionally call wake_up() on the state's waitqueue. Most of the
time there are no waiters on the queue so we are just wasting time calling
wake_up(), since that requires locking and unlocking the queue's spinlock,
disable and re-enable interrupts, function calls, and other minor overhead
while we are holding a critical section delimited by the extent io tree's
spinlock.

So call wake_up() only if there are waiters on an extent state's wait
queue.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
There's no need to free the cached extent state record while holding the
io tree's spinlock, it's just making the critical section longer than it
needs to be. So just do it after unlocking the io tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
We are not expected ever to split an extent state record that is not in
the rbtree, as every record we pass to split_state() was found by
iterating the rbtree, so if that ever happens it means we are not holding
the extent io tree's spinlock or we have some memory corruption.

Instead of simply warning in case the extent state record passed to
split_state() is not in the rbtree, panic as this is a serious problem.
Also tag as unlikely the case where the record is not in the rbtree.

This also makes a tiny reduction the btrfs module's text size.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2000080	 174328	  15592	2190000	 216ab0	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2000064	 174328	  15592	2189984	 216aa0	fs/btrfs/btrfs.ko

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
When we are clearing all the bits from the first record that contains the
target range and that record ends at or before our target range but starts
before our target range, we are doing a lot of unnecessary work:

1) Allocating a prealloc state if we don't have one already;

2) Adjust that record's start offset to the start of our range and
   make the prealloc state have a range going from the original start
   offset of that first record to the start offset of our target range,
   and with the same bits as that first record. Then we insert the
   prealloc extent in the rbtree - this is done in split_state();

3) Remove our adjusted first state from the rbtree since all the bits
   were cleared - this is done in clear_state_bit().

This is only wasting time when we can simply trim that first record, so
that it represents the range from its start offset to the start offset of
our target range. So optimize for that case and avoid the prealloc state
allocation, insertion and deletion from the rbtree.

This patch is the last patch of a patchset comprised of the following
patches (in descending order):

  btrfs: optimize clearing all bits from first extent record in an io tree
  btrfs: panic instead of warn when splitting extent state not in the tree
  btrfs: free cached state outside critical section in wait_extent_bit()
  btrfs: avoid unnecessary wake ups on io trees when there are no waiters
  btrfs: remove wake parameter from clear_state_bit()
  btrfs: change last argument of add_extent_changeset() to boolean
  btrfs: use extent_io_tree_panic() instead of BUG_ON()
  btrfs: make add_extent_changeset() only return errors or success
  btrfs: tag as unlikely branches that call extent_io_tree_panic()
  btrfs: turn extent_io_tree_panic() into a macro for better error reporting
  btrfs: optimize clearing all bits from the last extent record in an io tree

The following fio script was used to measure performance before and after
applying all the patches:

  $ cat ./fio-io-uring-2.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS=""

  if [ $# -ne 3 ]; then
      echo "Use $0 NUM_JOBS FILE_SIZE RUN_TIME"
      exit 1
  fi

  NUM_JOBS=$1
  FILE_SIZE=$2
  RUN_TIME=$3

  cat <<EOF > /tmp/fio-job.ini
  [io_uring_rw]
  rw=randwrite
  fsync=0
  fallocate=none
  group_reporting=1
  direct=1
  ioengine=io_uring
  fixedbufs=1
  iodepth=64
  bs=4K
  filesize=$FILE_SIZE
  runtime=$RUN_TIME
  time_based
  filename=foobar
  directory=$MNT
  numjobs=$NUM_JOBS
  thread
  EOF

  echo performance | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  echo
  echo "Using config:"
  echo
  cat /tmp/fio-job.ini
  echo

  umount $MNT &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  fio /tmp/fio-job.ini

  umount $MNT

When running this script on a 12 cores machine using a 16G null block
device the results were the following:

Before patchset:

  $ ./fio-io-uring-2.sh 12 8G 60
  (...)
  WRITE: bw=74.8MiB/s (78.5MB/s), 74.8MiB/s-74.8MiB/s (78.5MB/s-78.5MB/s), io=4504MiB (4723MB), run=60197-60197msec

After patchset:

  $ ./fio-io-uring-2.sh 12 8G 60
  (...)
  WRITE: bw=82.2MiB/s (86.2MB/s), 82.2MiB/s-82.2MiB/s (86.2MB/s-86.2MB/s), io=4937MiB (5176MB), run=60027-60027msec

Also, using bpftrace to collect the duration (in nanoseconds) of all the
btrfs_clear_extent_bit_changeset() calls done during that fio test and
then making an histogram from that data, held the following results:

Before patchset:

  Count: 6304804
  Range:  0.000 - 7587172.000; Mean: 2011.308; Median: 1219.000; Stddev: 17117.533
  Percentiles:  90th: 1888.000; 95th: 2189.000; 99th: 16104.000
        0.000 -    8.098:        7 |
        8.098 -   40.385:       20 |
       40.385 -  187.254:      146 |
      187.254 -  855.347:   742048 #######
      855.347 - 3894.426:  5462542 #####################################################
     3894.426 - 17718.848:   41489 |
    17718.848 - 80604.558:   46085 |
    80604.558 - 366664.449:  11285 |
   366664.449 - 1667918.122:   961 |
  1667918.122 - 7587172.000:   113 |

After patchset:

  Count: 6282879
  Range:  0.000 - 6029290.000; Mean: 1896.482; Median: 1126.000; Stddev: 15276.691
  Percentiles:  90th: 1741.000; 95th: 2026.000; 99th: 15713.000
        0.000 - 60.014:         12 |
       60.014 - 217.984:        63 |
      217.984 - 784.949:    517515 #####
      784.949 - 2819.823:  5632335 #####################################################
     2819.823 - 10123.127:   55716 #
    10123.127 - 36335.184:   46034 |
    36335.184 - 130412.049:  25708 |
   130412.049 - 468060.350:   4824 |
   468060.350 - 1679903.189:   549 |
  1679903.189 - 6029290.000:    84 |

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
…eeing

They have no more users since commit a649684 ("btrfs: fix start
transaction qgroup rsv double free"), so remove them.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
…a_prealloc()

Since __btrfs_qgroup_free_meta() is only called by
btrfs_qgroup_free_meta_prealloc(), which is a simple inline wrapper, get
rid of the later and rename __btrfs_qgroup_free_meta() to the later.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
…ve_meta_prealloc()

Since __btrfs_qgroup_reserve_meta() is only called by
btrfs_qgroup_reserve_meta_prealloc(), which is a simple inline wrapper,
get rid of the later and rename __btrfs_qgroup_reserve_meta() to the
later.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
There's only one caller outside qgroup.c of btrfs_qgroup_reserve_meta()
and we have btrfs_qgroup_reserve_meta_prealloc() is a wrapper around
that function. Make that caller use btrfs_qgroup_reserve_meta_prealloc()
and unexport btrfs_qgroup_reserve_meta(), simplifying the external API.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.