Test for-next (regular, GH kvm) by kdave · Pull Request #1624 · btrfs/linux

kdave · 2026-03-05T18:37:18Z

Keep this open, the build tests are hosted on github CI.

kdave · 2026-03-05T18:47:56Z

The KVM tests seem to be waiting for the build workflow and don't even start and show up in the list. Ordering could be possible.

kdave · 2026-03-05T18:48:59Z

Re #1623 .

This reverts commit fde0634. This commit is being reverted as part of a series-wide revert. By deferring the net_device allocation to the bind() phase, a single function instance will spawn multiple network devices if it is symlinked to multiple USB configurations. This causes regressions for userspace tools (like the postmarketOS DHCP daemon) that rely on reading the interface name (e.g., "usb0") from configfs. Currently, configfs returns the template "usb%d", causing the userspace network setup to fail. Crucially, because this patch breaks the 1:1 mapping between the function instance and the network device, this naming issue cannot simply be patched. Configfs only exposes a single 'ifname' attribute per instance, making it impossible to accurately report the actual interface name when multiple underlying network devices can exist for that single instance. All configurations tied to the same function instance are meant to share a single network device. Revert this change to restore the 1:1 mapping by allocating the network device at the instance level (alloc_inst). Reported-by: David Heidelberg <david@ixit.cz> Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/ Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind") Cc: stable <stable@kernel.org> Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-2-ea2afbc7d9b2@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

This reverts commit 56a512a. This commit is being reverted as part of a series-wide revert. By deferring the net_device allocation to the bind() phase, a single function instance will spawn multiple network devices if it is symlinked to multiple USB configurations. This causes regressions for userspace tools (like the postmarketOS DHCP daemon) that rely on reading the interface name (e.g., "usb0") from configfs. Currently, configfs returns the template "usb%d", causing the userspace network setup to fail. Crucially, because this patch breaks the 1:1 mapping between the function instance and the network device, this naming issue cannot simply be patched. Configfs only exposes a single 'ifname' attribute per instance, making it impossible to accurately report the actual interface name when multiple underlying network devices can exist for that single instance. All configurations tied to the same function instance are meant to share a single network device. Revert this change to restore the 1:1 mapping by allocating the network device at the instance level (alloc_inst). Reported-by: David Heidelberg <david@ixit.cz> Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/ Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind") Cc: stable <stable@kernel.org> Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-3-ea2afbc7d9b2@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

…_device" This reverts commit 0c09811. This commit is being reverted as part of a series-wide revert. By deferring the net_device allocation to the bind() phase, a single function instance will spawn multiple network devices if it is symlinked to multiple USB configurations. This causes regressions for userspace tools (like the postmarketOS DHCP daemon) that rely on reading the interface name (e.g., "usb0") from configfs. Currently, configfs returns the template "usb%d", causing the userspace network setup to fail. Crucially, because this patch breaks the 1:1 mapping between the function instance and the network device, this naming issue cannot simply be patched. Configfs only exposes a single 'ifname' attribute per instance, making it impossible to accurately report the actual interface name when multiple underlying network devices can exist for that single instance. All configurations tied to the same function instance are meant to share a single network device. Revert this change to restore the 1:1 mapping by allocating the network device at the instance level (alloc_inst). Reported-by: David Heidelberg <david@ixit.cz> Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/ Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind") Cc: stable <stable@kernel.org> Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-4-ea2afbc7d9b2@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

This reverts commit 7a7930c. This commit is being reverted as part of a series-wide revert. By deferring the net_device allocation to the bind() phase, a single function instance will spawn multiple network devices if it is symlinked to multiple USB configurations. This causes regressions for userspace tools (like the postmarketOS DHCP daemon) that rely on reading the interface name (e.g., "usb0") from configfs. Currently, configfs returns the template "usb%d", causing the userspace network setup to fail. Crucially, because this patch breaks the 1:1 mapping between the function instance and the network device, this naming issue cannot simply be patched. Configfs only exposes a single 'ifname' attribute per instance, making it impossible to accurately report the actual interface name when multiple underlying network devices can exist for that single instance. All configurations tied to the same function instance are meant to share a single network device. Revert this change to restore the 1:1 mapping by allocating the network device at the instance level (alloc_inst). Reported-by: David Heidelberg <david@ixit.cz> Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/ Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind") Cc: stable <stable@kernel.org> Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-5-ea2afbc7d9b2@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

This reverts commit e065c6a. This commit is being reverted as part of a series-wide revert. By deferring the net_device allocation to the bind() phase, a single function instance will spawn multiple network devices if it is symlinked to multiple USB configurations. This causes regressions for userspace tools (like the postmarketOS DHCP daemon) that rely on reading the interface name (e.g., "usb0") from configfs. Currently, configfs returns the template "usb%d", causing the userspace network setup to fail. Crucially, because this patch breaks the 1:1 mapping between the function instance and the network device, this naming issue cannot simply be patched. Configfs only exposes a single 'ifname' attribute per instance, making it impossible to accurately report the actual interface name when multiple underlying network devices can exist for that single instance. All configurations tied to the same function instance are meant to share a single network device. Revert this change to restore the 1:1 mapping by allocating the network device at the instance level (alloc_inst). Reported-by: David Heidelberg <david@ixit.cz> Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/ Fixes: 56a512a ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind") Cc: stable <stable@kernel.org> Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-6-ea2afbc7d9b2@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

The network device outlived its parent gadget device during disconnection, resulting in dangling sysfs links and null pointer dereference problems. A prior attempt to solve this by removing SET_NETDEV_DEV entirely [1] was reverted due to power management ordering concerns and a NO-CARRIER regression. A subsequent attempt to defer net_device allocation to bind [2] broke 1:1 mapping between function instance and network device, making it impossible for configfs to report the resolved interface name. This results in a regression where the DHCP server fails on pmOS. Use device_move to reparent the net_device between the gadget device and /sys/devices/virtual/ across bind/unbind cycles. This preserves the network interface across USB reconnection, allowing the DHCP server to retain their binding. Introduce gether_attach_gadget()/gether_detach_gadget() helpers and use __free(detach_gadget) macro to undo attachment on bind failure. The bind_count ensures device_move executes only on the first bind. [1] https://lore.kernel.org/lkml/f2a4f9847617a0929d62025748384092e5f35cce.camel@crapouillou.net/ [2] https://lore.kernel.org/linux-usb/795ea759-7eaf-4f78-81f4-01ffbf2d7961@ixit.cz/ Fixes: 40d133d ("usb: gadget: f_ncm: convert to new function interface with backward compatibility") Cc: stable <stable@kernel.org> Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-7-ea2afbc7d9b2@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

This reverts commit 1366cd2. The fwnode_usb_role_switch_get() returns NULL only if no connection is found, returns ERR_PTR(-EPROBE_DEFER) if connection is found but deferred probe is needed, or a valid pointer of usb_role_switch. When switching from a NULL check to IS_ERR_OR_NULL(), usb_role_switch_get() returns NULL and overwrites the ERR_PTR(-EPROBE_DEFER) returned by fwnode_usb_role_switch_get(). This causes the deferred probe indication to be lost, preventing the USB role switch from ever being retrieved. Fixes: 1366cd2 ("tcpm: allow looking for role_sw device in the main node") Cc: stable <stable@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Tested-by: Arnaud Ferraris <arnaud.ferraris@collabora.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://patch.msgid.link/20260309074313.2809867-2-xu.yang_2@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

usb_role_switch_is_parent() was walking up to the parent node and checking for the "usb-role-switch" property regardless of the type of the passed fwnode. This could cause unrelated device nodes to be probed as potential role switch parent, leading to spurious matches and "-EPROBE_DEFER" being returned infinitely. Till now only Type-B connector node will have a parent node which may present "usb-role-switch" property and register the role switch device. For Type-C connector node, its parent node will always be a Type-C chip device which will never register the role switch device. However, it may still present a non-boolean "usb-role-switch = <&usb_controller>" property for historical compatibility. So restrict the helper to only operate on Type-B connector when attempting to get the role switch from parent node. Fixes: 6fadd72 ("usb: roles: get usb-role-switch from parent") Cc: stable <stable@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Tested-by: Arnaud Ferraris <arnaud.ferraris@collabora.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://patch.msgid.link/20260309074313.2809867-3-xu.yang_2@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

The LPVO USB GPIB adapter apparently uses an FTDI 8U232AM with the default PID, but this device id is already handled by the ftdi_sio serial driver. Stop binding to the default PID to avoid breaking existing setups with FTDI 8U232AM. Anyone using this driver should blacklist the ftdi_sio driver and add the device id manually through sysfs (e.g. using udev rules). Fixes: fce7951 ("staging: gpib: Add LPVO DIY USB GPIB driver") Fixes: e6ab504 ("staging: gpib: Destage gpib") Cc: Dave Penkler <dpenkler@gmail.com> Cc: stable <stable@kernel.org> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260305151729.10501-2-johan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

The DmaGspMem pointer accessor methods (gsp_write_ptr, gsp_read_ptr, cpu_read_ptr, cpu_write_ptr, advance_cpu_read_ptr, advance_cpu_write_ptr) dereference a raw pointer to DMA memory, creating an intermediate reference before calling volatile read/write methods. This is undefined behavior since DMA memory can be concurrently modified by the device. Fix this by moving the implementations into a gsp_mem module in fw.rs that uses the dma_read!() / dma_write!() macros, making the original methods on DmaGspMem thin forwarding wrappers. An alternative approach would have been to wrap the shared memory in Opaque, but that would have required even more unsafe code. Since the gsp_mem module lives in fw.rs (to access firmware-specific binding field names), GspMem, Msgq and their relevant fields are temporarily widened to pub(super). This will be reverted once IoView projections are available. Cc: Gary Guo <gary@garyguo.net> Closes: https://lore.kernel.org/nouveau/DGUT14ILG35P.1UMNRKU93JUM1@kernel.org/ Fixes: 75f6b1d ("gpu: nova-core: gsp: Add GSP command queue bindings and handling") Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Link: https://patch.msgid.link/20260309225408.27714-1-dakr@kernel.org [ Use pub(super) where possible; replace bitwise-and with modulo operator analogous to [1]. - Danilo ] Link: https://lore.kernel.org/all/20260129-nova-core-cmdq1-v3-1-2ede85493a27@nvidia.com/ [1] Signed-off-by: Danilo Krummrich <dakr@kernel.org>

Some bootloaders like recent versions of U-Boot may install some DMI properties with empty values rather than not populate them. This manages to make its way through the validator and cleanup resulting in a rogue hyphen being appended to the card longname. Fixes: 4e01e5d ("ASoC: improve the DMI long card code in asoc-core") Signed-off-by: Casey Connolly <casey.connolly@linaro.org> Link: https://patch.msgid.link/20260306174707.283071-2-casey.connolly@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org>

…l/git/powerpc/linux Pull powerpc fixes from Madhavan Srinivasan: - Correct MSI allocation tracking - Always use 64 bits PTE for powerpc/e500 - Fix inline assembly for clang build on PPC32 - Fixes for clang build issues in powerpc64/ftrace - Fixes for powerpc64/bpf JIT and tailcall support - Cleanup MPC83XX devicetrees - Fix keymile vendor prefix - Fix to use big-endian types for crash variables Thanks to Abhishek Dubey, Christophe Leroy (CS GROUP), Hari Bathini, Heiko Schocher, J. Neuschäfer, Mahesh Salgaonkar, Nam Cao, Nilay Shroff, Rob Herring (Arm), Saket Kumar Bhaskar, Sourabh Jain, Stan Johnson, and Venkat Rao Bagalkote. * tag 'powerpc-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (23 commits) powerpc/pseries: Correct MSI allocation tracking powerpc: dts: mpc83xx: Add unit addresses to /memory powerpc: dts: mpc8315erdb: Add missing #cells properties to SPI bus powerpc: dts: mpc8315erdb: Rename LED nodes to comply with schema powerpc: dts: mpc8315erdb: Use IRQ_TYPE_* macros powerpc: dts: mpc8313erdb: Use IRQ_TYPE_* macros powerpc: 83xx: km83xx: Fix keymile vendor prefix dt-bindings: powerpc: Add Freescale/NXP MPC83xx SoCs powerpc64/bpf: fix kfunc call support powerpc64/bpf: fix handling of BPF stack in exception callback powerpc64/bpf: remove BPF redzone protection in trampoline stack powerpc64/bpf: use consistent tailcall offset in trampoline powerpc64/bpf: fix the address returned by bpf_get_func_ip powerpc64/bpf: do not increment tailcall count when prog is NULL powerpc64/ftrace: workaround clang recording GEP in __patchable_function_entries powerpc64/ftrace: fix OOL stub count with clang powerpc64: make clang cross-build friendly powerpc/crash: adjust the elfcorehdr size powerpc/kexec/core: use big-endian types for crash variables powerpc/prom_init: Fixup missing #size-cells on PowerMac media-bay nodes ...

…rnel/git/remoteproc/linux Pull remoteproc fixes from Bjorn Andersson: - Correct the early return from the i.MX remoteproc prepare operation, which prevented the platform-specific prepare function from being reached - Ensure that the Mediatek SCP clock is released during system suspend after the recent refactoring to avoid issues with the clock framework's prepare lock. - Correct the type of the subsys_name_len field in the sysmon event QMI message, as the recent introduction of big endian support in the QMI encoder highlighted the type mismatch and resulted in a failure to encode the message - Roll back the devm_ioremap_resource_wc() to a devm_ioremap_wc() in the Qualcomm WCNSS remoteproc driver, after reports that requesting this resource fails on some platforms * tag 'rproc-v7.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux: remoteproc: imx_rproc: Fix unreachable platform prepare_ops remoteproc: mediatek: Unprepare SCP clock during system suspend remoteproc: sysmon: Correct subsys_name_len type in QMI request remoteproc: qcom_wcnss: Fix reserved region mapping failure

When refill_sheaf() partially fills one sheaf (e.g., fills 5 objects but need to fill 10), it will update sheaf->size and return -ENOMEM. However, the callers (alloc_full_sheaf() and __pcs_replace_empty_main()) directly call free_empty_sheaf() on failure, which only does kfree(sheaf), causing the partially allocated objects memory in sheaf->objects[] leaked. Fix this by calling sheaf_flush_unused() before free_empty_sheaf() to free objects of sheaf->objects[]. And also add a WARN_ON() in free_empty_sheaf() to catch any future cases where a non-empty sheaf is being freed. Fixes: ed30c4a ("slab: add optimized sheaf refill from partial list") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Link: https://patch.msgid.link/20260311093617.4155965-1-wangqing7171@gmail.com Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.0, take #2 - Fix a couple of low-severity bugs in our S2 fault handling path, affecting the recently introduced LS64 handling and the even more esoteric handling of hwpoison in a nested context - Address yet another syzkaller finding in the vgic initialisation, were we would end-up destroying an uninitialised vgic, with nasty consequences - Address an annoying case of pKVM failing to boot when some of the memblock regions that the host is faulting in are not page-aligned - Inject some sanity in the NV stage-2 walker by checking the limits against the advertised PA size, and correctly report the resulting faults - Drop an unnecessary ISB when emulating an EL2 S1 address translation

… into HEAD KVM/riscv fixes for 7.0, take #1 - Prevent speculative out-of-bounds access using array_index_nospec() in APLIC interrupt handling, ONE_REG regiser access, AIA CSR access, float register access, and PMU counter access - Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(), kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr() - Fix potential null pointer dereference in kvm_riscv_vcpu_aia_rmw_topei() - Fix off-by-one array access in SBI PMU - Skip THP support check during dirty logging - Fix error code returned for Smstateen and Ssaia ONE_REG interface - Check host Ssaia extension when creating AIA irqchip

… into HEAD KVM generic changes for 7.0 - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end. - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive.

…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.0, take #3 - Correctly handle deeactivation of out-of-LRs interrupts by starting the EOIcount deactivation walk *after* the last irq that made it into an LR. This avoids deactivating irqs that are in the LRs and that the vcpu hasn't deactivated yet. - Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when pKVM is already enabled -- not only thhis isn't possible (pKVM will reject the call), but it is also useless: this can only happen for a CPU that has already booted once, and the capability will not change.

Increase 'maxnode' when using 'get_mempolicy' syscall in guest_memfd mmap and NUMA policy tests to fix a failure on one Intel GNR platform. On a CXL-capable platform, the memory affinity of CXL memory regions may not be covered by the SRAT. Since each CXL memory region is enumerated via a CFMWS table, at early boot the kernel parses all CFMWS tables to detect all CXL memory regions and assigns a 'faked' NUMA node for each of them, starting from the highest NUMA node ID enumerated via the SRAT. This increases the 'nr_node_ids'. E.g., on the aforementioned Intel GNR platform which has 4 NUMA nodes and 18 CFMWS tables, it increases to 22. This results in the 'get_mempolicy' syscall failure on that platform, because currently 'maxnode' is hard-coded to 8 but the 'get_mempolicy' syscall requires the 'maxnode' to be not smaller than the 'nr_node_ids'. Increase the 'maxnode' to the number of bits of 'nodemask', which is 'unsigned long', to fix this. This may not cover all systems. Perhaps a better way is to always set the 'nodemask' and 'maxnode' based on the actual maximum NUMA node ID on the system, but for now just do the simple way. Reported-by: Yi Lai <yi1.lai@intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221014 Closes: https://lore.kernel.org/all/bug-221014-28872@https.bugzilla.kernel.org%2F Signed-off-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com> Link: https://patch.msgid.link/20260302205158.178058-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

… type Fix a build error in kvmppc_e500_tlb_init() that was introduced by the conversion to use kzalloc_objs(), as KVM confusingly uses the size of the structure that is one and only field in tlbe_priv: arch/powerpc/kvm/e500_mmu.c:923:33: error: assignment to 'struct tlbe_priv *' from incompatible pointer type 'struct tlbe_ref *' [-Wincompatible-pointer-types] 923 | vcpu_e500->gtlb_priv[0] = kzalloc_objs(struct tlbe_ref, | ^ KVM has been flawed since commit 0164c0f ("KVM: PPC: e500: clear up confusion between host and guest entries"), but the issue went unnoticed until kmalloc_obj() came along and enforced types, as "struct tlbe_priv" was just a wrapper of "struct tlbe_ref" (why on earth the two ever existed separately...). Fixes: 69050f8 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") Cc: Kees Cook <kees@kernel.org> Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Link: https://patch.msgid.link/20260303190339.974325-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Complete the ~13 year journey started by commit 47bf379 ("kvm/ppc/e500: eliminate tlb_refs"), and actually remove "struct tlbe_ref". No functional change intended (verified disassembly of e500_mmu.o and e500_mmu_host.o is identical before and after). Link: https://patch.msgid.link/20260303190339.974325-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM incorrectly synthesizes CPUID bits for KVM-only leaves, as the following branch in kvm_cpu_cap_init() is never taken: if (leaf < NCAPINTS) kvm_cpu_caps[leaf] &= kernel_cpu_caps[leaf]; This means that bits set via SYNTHESIZED_F() for KVM-only leaves are unconditionally set. This for example can cause issues for SEV-SNP guests running on Family 19h CPUs, as TSA_SQ_NO and TSA_L1_NO are always enabled by KVM in 80000021[ECX]. When userspace issues a SNP_LAUNCH_UPDATE command to update the CPUID page for the guest, SNP firmware will explicitly reject the command if the page sets sets these bits on vulnerable CPUs. To fix this, check in SYNTHESIZED_F() that the corresponding X86 capability is set before adding it to to kvm_cpu_cap_features. Fixes: 31272ab ("KVM: SVM: Advertise TSA CPUID bits to guests") Link: https://lore.kernel.org/all/20260208164233.30405-1-clopez@suse.de/ Signed-off-by: Carlos López <clopez@suse.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://patch.msgid.link/20260209153108.70667-2-clopez@suse.de Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

In KVM guests with Hyper-V hypercalls enabled, the hypercalls HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX allow a guest to request invalidation of portions of a virtual TLB. For this, the hypercall parameter includes a list of GVAs that are supposed to be invalidated. Currently, only the base GVA is checked to be canonical. In reality, this check needs to be performed for the entire range of GVAs, as checking only the base GVA enables guests running on Intel hardware to trigger a WARN_ONCE in the host (see Fixes commit below). Move the check for non-canonical addresses to be performed for every GVA of the supplied range to avoid the splat, and to be more in line with the Hyper-V specification, since, although unlikely, a range starting with an invalid GVA may still contain GVAs that are valid. Fixes: fa787ac ("KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush") Signed-off-by: Manuel Andreas <manuel.andreas@tum.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/00a7a31b-573b-4d92-91f8-7d7e2f88ea48@tum.de [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

There is a potential use-after-free in move_existing_remap(): we're calling btrfs_put_block_group() on dest_bg, then passing it to btrfs_add_block_group_free_space() a few lines later. Fix this by getting the BG at the start of the function and putting it near the end. This also means we're not doing a lookup twice for the same thing. Reported-by: Chris Mason <clm@fb.com> Link: https://lore.kernel.org/linux-btrfs/20260125123908.2096548-1-clm@meta.com/ Fixes: bbea42d ("btrfs: move existing remaps before relocating block group") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…unks() Fix a potential segfault in balance_remap_chunks(): if we quit early because btrfs_inc_block_group_ro() fails, all the remaining items in the chunks list will still have their bg value set to NULL. It's thus not safe to dereference this pointer without checking first. Reported-by: Chris Mason <clm@fb.com> Link: https://lore.kernel.org/linux-btrfs/20260125120717.1578828-1-clm@meta.com/ Fixes: 81e5a45 ("btrfs: allow balancing remap tree") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>

Introduce checks for FREE_SPACE_INFO item, which include: - Key alignment check The objectid is the logical bytenr of the chunk/bg, and offset is the length of the chunk/bg, thus they should all be aligned to the fs block size. - Item size check The FREE_SPACE_INFO should a fix size. - Flags check The flags member should have no other flags than BTRFS_FREE_SPACE_USING_BITMAPS. For future expansion, introduce a new macro BTRFS_FREE_SPACE_FLAGS_MASK for such checks. And since we're here, the BTRFS_FREE_SPACE_USING_BITMAPS should not use unsigned long long, as the flags is only 32 bits wide. So fix that to use unsigned long. - Extent count check That member shows how many free space bitmap/extent items there are inside the chunk/bg. We know the chunk size (from key->offset), thus there should be at most (key->offset >> sectorsize_bits) blocks inside the chunk. Use that value as the upper limit and if that counter is larger than that, there is a high chance it's a bitflip in high bits. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Introduce FREE_SPACE_EXTENT checks, which include: - The key alignment check The objectid is the logical bytenr of the free space, and offset is the length of the free space, thus they should all be aligned to the fs block size. - The item size check The FREE_SPACE_EXTENT item should have a size of zero. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Introduce checks for FREE_SPACE_BITMAP item, which include: - Key alignment check Same as FREE_SPACE_EXTENT, the objectid is the logical bytenr of the free space, and offset is the length of the free space, so both should be aligned to the fs block size. - Non-zero range check A zero key->offset would describe an empty bitmap, which is invalid. - Item size check The item must hold exactly DIV_ROUND_UP(key->offset >> sectorsize_bits, BITS_PER_BYTE) bytes. A mismatch indicates a truncated or otherwise corrupt bitmap item; without this check, the bitmap loading path would walk past the end of the leaf and trigger a NULL dereference in assert_eb_folio_uptodate(). Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

[BUG] When recovering relocation at mount time, merge_reloc_root() and btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against an impossible state: a non-zero drop_progress combined with a zero drop_level in a root_item, which can be triggered: ------------[ cut here ]------------ kernel BUG at fs/btrfs/relocation.c:1545! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2 RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545 Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000 Call Trace: merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861 btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195 btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130 open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640 btrfs_fill_super fs/btrfs/super.c:987 [inline] btrfs_get_tree_super fs/btrfs/super.c:1951 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline] btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128 vfs_get_tree+0x9a/0x370 fs/super.c:1758 fc_mount fs/namespace.c:1199 [inline] do_new_mount_fc fs/namespace.c:3642 [inline] do_new_mount fs/namespace.c:3718 [inline] path_mount+0x5b8/0x1ea0 fs/namespace.c:4028 do_mount fs/namespace.c:4041 [inline] __do_sys_mount fs/namespace.c:4229 [inline] __se_sys_mount fs/namespace.c:4206 [inline] __x64_sys_mount+0x282/0x320 fs/namespace.c:4206 ... RIP: 0033:0x7f969c9a8fde Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f ---[ end trace 0000000000000000 ]--- The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic metadata fuzzing tool that corrupts btrfs metadata at runtime. [CAUSE] A non-zero drop_progress.objectid means an interrupted btrfs_drop_snapshot() left a resume point on disk, and in that case drop_level must be greater than 0 because the checkpoint is only saved at internal node levels. Although this invariant is enforced when the kernel writes the root item, it is not validated when the root item is read back from disk. That allows on-disk corruption to provide an invalid state with drop_progress.objectid != 0 and drop_level == 0. When relocation recovery later processes such a root item, merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The same invalid metadata can also trigger the corresponding BUG_ON() in btrfs_drop_snapshot(). [FIX] Fix this by validating the root_item invariant in tree-checker when reading root items from disk: if drop_progress.objectid is non-zero, drop_level must also be non-zero. Reject such malformed metadata with -EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot() and triggers the BUG_ON. After the fix, the same corruption is correctly rejected by tree-checker and the BUG_ON is no longer triggered. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>

Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL check. IS_ERR_OR_NULL() already uses likely(!ptr) internally. checkpatch does not like nesting it: > WARNING: nested (un)?likely() calls, IS_ERR_OR_NULL already uses > unlikely() internally Remove the explicit use of likely(). Change generated with coccinelle. Signed-off-by: Philipp Hahn <phahn-oss@avm.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

read_extent_buffer_pages_nowait() returns immediately when an extent buffer is already marked uptodate. On that cache-hit path, the caller supplied btrfs_tree_parent_check is not re-run. This can let read_tree_root_path() accept a cached tree block whose actual header level/owner does not match the expected value derived from the parent. E.g. a corrupted root item that points to a tree block which doesn't even belong to that root, and has mismatching level/owner. But that tree block is already read and cached, later the corrupted tree root got read from disk and hit the cached tree block. Fix this by re-validating cached extent buffers against the supplied btrfs_tree_parent_check on the uptodate path, and make read_tree_root_path() pass its check to btrfs_buffer_uptodate(). This makes cache hits and fresh reads follow the same tree-parent verification rules, and turns the corruption into a read failure instead of constructing an inconsistent root object. Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Resolve the conflict with extent_buffer_uptodate() helper, handle transid mismatch case ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Unlike other flags used in btrfs, BTRFS_ORDERED_* macros are different as they cannot be directly used as flags. They are defined as bit values, thus they should be utilized with bit operations, not directly with logical operations. Unfortunately sometimes I forgot this and passed the incorrect flags to alloc_ordered_extent() and hit weird bugs. Enhance the type checks in alloc_ordered_extent(): - Make sure there is one and only one bit set for exclusive type flags There are four exclusive type flags, REGULAR, NOCOW, PREALLOC and COMPRESSED. So introduce a new macro, BTRFS_ORDERED_EXCLUSIVE_FLAGS, to cover above flags. Add an ASSERT() to check one and only one of those exclusive flags can be set for alloc_ordered_extent(). - Re-order the type bit numbers to the end of the enum This is make it much harder to get a valid false negative. E.g., with the old code BTRFS_ORDERED_REGULAR starts at zero, we can have the following flags passing the bit uniqueness check: * BTRFS_ORDERED_NOCOW Be treated as BTRFS_ORDERED_REGULAR (1 == 1UL << 0). * BTRFS_ORDERED_PREALLOC Be treated as BTRFS_ORDERED_NOCOW (2 == 1UL << 1). * BTRFS_ORDERED_DIRECT Be treated as BTRFS_ORDERED_PREALLOC (4 == 1UL << 2). Now all those types start at 8, passing any of those bit numbers as flags directly will not pass the ASSERT(). - Add a static assert to avoid overflow To make sure all BTRFS_ORDERED_* flags can fit into an unsigned long. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

During development of a new feature, I triggered that btrfs_panic() inside insert_ordered_extent() and spent quite some unnecessary before noticing I'm passing incorrect flags when creating a new ordered extent. Unfortunately the existing error message is not providing much help. Enhance the output to provide file offset, num bytes and flags of both existing and new ordered extents. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

@atomic

That parameter was introduced by commit b9fab91 ("Btrfs: avoid sleeping in verify_parent_transid while atomic"). At that time we needed to lock the extent buffer range inside the io tree to avoid content changes, thus it could sleep. But that behavior is no longer there, as later commit 9e2aff9 ("btrfs: stop using lock_extent in btrfs_buffer_uptodate") dropped the io tree lock. We can remove the @atomic parameter safely now. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

[BUG] Since commit 3d74a75 ("btrfs: zlib: introduce zlib_compress_bio() helper"), there are some reports about different crashes in zlib compression path. One of the symptoms is list corruption like the following: list_del corruption. next->prev should be fffffbb340204a08, but was ffff8d6517cb7de0. (next=fffffbb3402d62c8) ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:65! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 UID: 0 PID: 21436 Comm: kworker/u16:7 Not tainted 7.0.0-rc2-jcg+ #1 PREEMPT Hardware name: LENOVO 10VGS02P00/3130, BIOS M1XKT57A 02/10/2022 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:__list_del_entry_valid_or_report+0xec/0xf0 Call Trace: <TASK> btrfs_alloc_compr_folio+0xae/0xc0 [btrfs] zlib_compress_bio+0x39d/0x6a0 [btrfs] btrfs_compress_bio+0x2e3/0x3d0 [btrfs] compress_file_range+0x2b0/0x660 [btrfs] btrfs_work_helper+0xdb/0x3e0 [btrfs] process_one_work+0x192/0x3d0 worker_thread+0x19a/0x310 kthread+0xdf/0x120 ret_from_fork+0x22e/0x310 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Other symptoms include VM_BUG_ON() during folio_put() but it's rarer. David Sterba firstly reported this during his CI runs but unfortunately I'm unable to hit it. Meanwhile zstd/lzo doesn't seem to have the same problem. [CAUSE] During zlib_compress_bio() every time the output buffer is full, we queue the full folio into the compressed bio, and allocate a new folio as the output folio. After the input has finished, we loop through zlib_deflate() with Z_FINISH to flush all output. And when that is done, we still need to check if the last folio has any content, and if so we still need to queue that part into the compressed bio. The problem is in the final folio handling, if the final folio is full (for x86_64 the folio size is 4K), the length to queue is calculated by u32 cur_len = offset_in_folio(out_folio, workspace->strm.total_out); But since total_out is 4K aligned, the resulted @cur_len will be 0, then we hit the bio_add_folio(), which has a quirk that if bio_add_folio() got an length 0, it will still queue the folio into the bio, but return false. In that case we go to out: tag, which calls btrfs_free_compr_folio() to release @out_folio, which may put the out folio into the btrfs global pool list. On the other hand, that @out_folio is already added to the compressed bio, and will later be released again by cleanup_compressed_bio(), which results double release. And if this time we still need to put the folio into the btrfs global pool list, it will result a list corruption because it's already in the list. [FIX] Instead of offset_inside_folio(), directly use the difference between strm.total_out and bi_size. So that if the last folio is completely full, we can still properly queue the full folio other than queueing zero byte. Fixes: 3d74a75 ("btrfs: zlib: introduce zlib_compress_bio() helper") Reported-by: David Sterba <dsterba@suse.com> Reported-by: Jean-Christophe Guillain <jean-christophe@guillain.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=221176 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…_sync_file() If overlay is used on top of btrfs, dentry->d_sb translates to overlay's super block and fsid assignment will lead to a crash. Use file_inode(file)->i_sb to always get btrfs_sb. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…o tree When we are clearing all the bits from the last record that contains the target range (i.e. the record starts before our target range and ends beyond it), we are doing a lot of unnecessary work: 1) Allocating a prealloc state if we don't have one already; 2) Adjust that last record's start offset to the end of our range and make the prealloc state have a range going from the original start offset of that last record to the end offset of our target range and the same bits as the last record. Then we insert the prealloc extent in the rbtree - this is done in split_state(); 3) Remove our prealloc state from the rbtree since all the bits were cleared - this is done in clear_state_bit(). This is only wasting time when we can simply trim the last record so that it's start offset is adjust to the end of the target range. So optimize for that case and avoid the prealloc state allocation, insertion and deletion from the rbtree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

…rting When extent_io_tree_panic() is called we get a stace trace that is not very useful since the error message reports the location inside the extent_io_tree_panic() function and not in the caller of the function. Example: [ 7830.424291] BTRFS critical (device sdb): panic in extent_io_tree_panic:334: extent io tree error on add_extent_changeset state start 4083712 end 4112383 (errno=1 unknown) [ 7830.426816] ------------[ cut here ]------------ [ 7830.427581] kernel BUG at fs/btrfs/extent-io-tree.c:334! [ 7830.428495] Oops: invalid opcode: 0000 [#1] SMP PTI [ 7830.429318] CPU: 5 UID: 0 PID: 1451600 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full) [ 7830.430899] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 7830.432771] RIP: 0010:extent_io_tree_panic+0x41/0x43 [btrfs] [ 7830.433815] Code: 75 0a 48 8b (...) [ 7830.436849] RSP: 0018:ffffd2334f4a3b68 EFLAGS: 00010246 [ 7830.437668] RAX: 0000000000000000 RBX: 00000000003ebfff RCX: 0000000000000000 [ 7830.438801] RDX: ffffffffc08d4368 RSI: ffffffffbb6ce475 RDI: ffff896501d6b780 [ 7830.439671] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff [ 7830.440575] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000 [ 7830.441458] R13: ffff896547374c08 R14: 00000000003effff R15: ffff896547374c08 [ 7830.442333] FS: 00007f3e252af0c0(0000) GS:ffff896c6185d000(0000) knlGS:0000000000000000 [ 7830.443326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7830.444047] CR2: 00007f3e252ad000 CR3: 0000000113b0a004 CR4: 0000000000370ef0 [ 7830.444905] Call Trace: [ 7830.445229] <TASK> [ 7830.445557] btrfs_clear_extent_bit_changeset.cold+0x43/0x80 [btrfs] [ 7830.446543] btrfs_clear_record_extent_bits+0x19/0x20 [btrfs] [ 7830.447308] qgroup_free_reserved_data+0xf9/0x170 [btrfs] [ 7830.448040] btrfs_buffered_write+0x368/0x8e0 [btrfs] [ 7830.448707] btrfs_direct_write+0x1a5/0x480 [btrfs] [ 7830.449396] btrfs_do_write_iter+0x18c/0x210 [btrfs] [ 7830.450167] vfs_write+0x21f/0x450 [ 7830.450662] ksys_write+0x5f/0xd0 [ 7830.451092] do_syscall_64+0xe9/0xf20 [ 7830.451610] entry_SYSCALL_64_after_hwframe+0x76/0x7e Change extent_io_tree_panic() to a macro so that we get a report that gives the exact place where the error happens. Example after this change: [63677.406061] BTRFS critical (device sdc): panic in btrfs_clear_extent_bit_changeset:744: extent io tree error on add_extent_changeset state start 1818624 end 1830911 (errno=1 unknown) [63677.410055] ------------[ cut here ]------------ [63677.410910] kernel BUG at fs/btrfs/extent-io-tree.c:744! [63677.411918] Oops: invalid opcode: 0000 [#1] SMP PTI [63677.413032] CPU: 0 UID: 0 PID: 13028 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full) [63677.415139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [63677.417283] RIP: 0010:btrfs_clear_extent_bit_changeset.cold+0xcd/0x10c [btrfs] [63677.418676] Code: 8b 37 48 8b (...) [63677.421917] RSP: 0018:ffffd2290a417b30 EFLAGS: 00010246 [63677.422824] RAX: 0000000000000000 RBX: 00000000001befff RCX: 0000000000000000 [63677.424320] RDX: ffffffffc0970348 RSI: ffffffffa92ce475 RDI: ffff8897ded9dc80 [63677.429772] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff [63677.430787] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000 [63677.431818] R13: ffff8897966655d8 R14: 00000000001bffff R15: ffff8897966655d8 [63677.432764] FS: 00007f5c074c50c0(0000) GS:ffff889ef3b1d000(0000) knlGS:0000000000000000 [63677.433940] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63677.434787] CR2: 00007f5c074c3000 CR3: 000000014b9de002 CR4: 0000000000370ef0 [63677.435960] Call Trace: [63677.436432] <TASK> [63677.436838] btrfs_clear_record_extent_bits+0x19/0x20 [btrfs] [63677.437980] qgroup_free_reserved_data+0xf9/0x170 [btrfs] [63677.439070] btrfs_buffered_write+0x368/0x8e0 [btrfs] [63677.439889] btrfs_do_write_iter+0x1a8/0x210 [btrfs] [63677.441460] do_iter_readv_writev+0x145/0x240 [63677.446309] vfs_writev+0x120/0x3b0 [63677.446878] ? __do_sys_newfstat+0x33/0x60 [63677.447759] ? do_pwritev+0x8a/0xd0 [63677.449119] do_pwritev+0x8a/0xd0 [63677.452342] do_syscall_64+0xe9/0xf20 [63677.452961] entry_SYSCALL_64_after_hwframe+0x76/0x7e Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

It's unexpected to ever call extent_io_tree_panic() so surround with 'unlikely' every if statement condition that leads to it, making it explicit to a reader and to hint the compiler to potentially generate better code. On x86_64, using gcc 14.2.0-19 from Debian, this resulted in a slightly decrease of the btrfs module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1999832 174320 15592 2189744 2169b0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1999768 174320 15592 2189680 216970 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

Currently add_extent_changeset() always returns the return value from its call to ulist_add(), which can return an error, 0 or 1. There are no callers that care about the difference between 0 and 1 and all except one of them, check for negative values and ignore other values, but there is another caller (btrfs_clear_extent_bit_changeset()) that must set its 'ret' variable to 0 after calling add_extent_changeset(), so that it does not return an unexpected value of 1 to its caller. So change add_extent_changeset() to only return errors or 0, avoiding that caller (and any future callers) from having to deal with a return value of 1. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

There's no need to call BUG_ON(), instead call extent_io_tree_panic(), which also calls BUG(), but it prints an additional error message with some useful information before hitting BUG(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

The argument is used as a boolean but it's defined as an integer. Switch it to a boolean for better readability. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

There's no need to pass the 'wake' parameter, we can determine if we have to wake up waiters by checking if EXTENT_LOCK_BITS is set in the bits to clear. So simplify things and remove the parameter. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

Whenever clearing the extent lock bits of an extent state record, we unconditionally call wake_up() on the state's waitqueue. Most of the time there are no waiters on the queue so we are just wasting time calling wake_up(), since that requires locking and unlocking the queue's spinlock, disable and re-enable interrupts, function calls, and other minor overhead while we are holding a critical section delimited by the extent io tree's spinlock. So call wake_up() only if there are waiters on an extent state's wait queue. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

There's no need to free the cached extent state record while holding the io tree's spinlock, it's just making the critical section longer than it needs to be. So just do it after unlocking the io tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

We are not expected ever to split an extent state record that is not in the rbtree, as every record we pass to split_state() was found by iterating the rbtree, so if that ever happens it means we are not holding the extent io tree's spinlock or we have some memory corruption. Instead of simply warning in case the extent state record passed to split_state() is not in the rbtree, panic as this is a serious problem. Also tag as unlikely the case where the record is not in the rbtree. This also makes a tiny reduction the btrfs module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2000080 174328 15592 2190000 216ab0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2000064 174328 15592 2189984 216aa0 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

When we are clearing all the bits from the first record that contains the target range and that record ends at or before our target range but starts before our target range, we are doing a lot of unnecessary work: 1) Allocating a prealloc state if we don't have one already; 2) Adjust that record's start offset to the start of our range and make the prealloc state have a range going from the original start offset of that first record to the start offset of our target range, and with the same bits as that first record. Then we insert the prealloc extent in the rbtree - this is done in split_state(); 3) Remove our adjusted first state from the rbtree since all the bits were cleared - this is done in clear_state_bit(). This is only wasting time when we can simply trim that first record, so that it represents the range from its start offset to the start offset of our target range. So optimize for that case and avoid the prealloc state allocation, insertion and deletion from the rbtree. This patch is the last patch of a patchset comprised of the following patches (in descending order): btrfs: optimize clearing all bits from first extent record in an io tree btrfs: panic instead of warn when splitting extent state not in the tree btrfs: free cached state outside critical section in wait_extent_bit() btrfs: avoid unnecessary wake ups on io trees when there are no waiters btrfs: remove wake parameter from clear_state_bit() btrfs: change last argument of add_extent_changeset() to boolean btrfs: use extent_io_tree_panic() instead of BUG_ON() btrfs: make add_extent_changeset() only return errors or success btrfs: tag as unlikely branches that call extent_io_tree_panic() btrfs: turn extent_io_tree_panic() into a macro for better error reporting btrfs: optimize clearing all bits from the last extent record in an io tree The following fio script was used to measure performance before and after applying all the patches: $ cat ./fio-io-uring-2.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE RUN_TIME" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 RUN_TIME=$3 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randwrite fsync=0 fallocate=none group_reporting=1 direct=1 ioengine=io_uring fixedbufs=1 iodepth=64 bs=4K filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT When running this script on a 12 cores machine using a 16G null block device the results were the following: Before patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=74.8MiB/s (78.5MB/s), 74.8MiB/s-74.8MiB/s (78.5MB/s-78.5MB/s), io=4504MiB (4723MB), run=60197-60197msec After patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=82.2MiB/s (86.2MB/s), 82.2MiB/s-82.2MiB/s (86.2MB/s-86.2MB/s), io=4937MiB (5176MB), run=60027-60027msec Also, using bpftrace to collect the duration (in nanoseconds) of all the btrfs_clear_extent_bit_changeset() calls done during that fio test and then making an histogram from that data, held the following results: Before patchset: Count: 6304804 Range: 0.000 - 7587172.000; Mean: 2011.308; Median: 1219.000; Stddev: 17117.533 Percentiles: 90th: 1888.000; 95th: 2189.000; 99th: 16104.000 0.000 - 8.098: 7 | 8.098 - 40.385: 20 | 40.385 - 187.254: 146 | 187.254 - 855.347: 742048 ####### 855.347 - 3894.426: 5462542 ##################################################### 3894.426 - 17718.848: 41489 | 17718.848 - 80604.558: 46085 | 80604.558 - 366664.449: 11285 | 366664.449 - 1667918.122: 961 | 1667918.122 - 7587172.000: 113 | After patchset: Count: 6282879 Range: 0.000 - 6029290.000; Mean: 1896.482; Median: 1126.000; Stddev: 15276.691 Percentiles: 90th: 1741.000; 95th: 2026.000; 99th: 15713.000 0.000 - 60.014: 12 | 60.014 - 217.984: 63 | 217.984 - 784.949: 517515 ##### 784.949 - 2819.823: 5632335 ##################################################### 2819.823 - 10123.127: 55716 # 10123.127 - 36335.184: 46034 | 36335.184 - 130412.049: 25708 | 130412.049 - 468060.350: 4824 | 468060.350 - 1679903.189: 549 | 1679903.189 - 6029290.000: 84 | Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

…eeing They have no more users since commit a649684 ("btrfs: fix start transaction qgroup rsv double free"), so remove them. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>

…a_prealloc() Since __btrfs_qgroup_free_meta() is only called by btrfs_qgroup_free_meta_prealloc(), which is a simple inline wrapper, get rid of the later and rename __btrfs_qgroup_free_meta() to the later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>

…ve_meta_prealloc() Since __btrfs_qgroup_reserve_meta() is only called by btrfs_qgroup_reserve_meta_prealloc(), which is a simple inline wrapper, get rid of the later and rename __btrfs_qgroup_reserve_meta() to the later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>

There's only one caller outside qgroup.c of btrfs_qgroup_reserve_meta() and we have btrfs_qgroup_reserve_meta_prealloc() is a wrapper around that function. Make that caller use btrfs_qgroup_reserve_meta_prealloc() and unexport btrfs_qgroup_reserve_meta(), simplifying the external API. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>

kdave changed the title ~~Test for-next (regular, GH kvm) 2~~ Test for-next (regular, GH kvm) Mar 5, 2026

kdave closed this Mar 5, 2026

kdave reopened this Mar 5, 2026

kdave force-pushed the ci-kvm branch 2 times, most recently from 69fc6c9 to 98bf7e7 Compare March 5, 2026 23:30

Kuen-Han Tsai and others added 23 commits March 11, 2026 16:21

maharmstone and others added 11 commits March 18, 2026 08:05

adam900710 force-pushed the for-next branch from 02c8fe4 to 2f7bc14 Compare March 17, 2026 21:35

kdave force-pushed the for-next branch from 2f7bc14 to d390292 Compare March 18, 2026 10:15

Goldwyn Rodrigues and others added 16 commits March 18, 2026 12:18

btrfs: change last argument of add_extent_changeset() to boolean

7796567

The argument is used as a boolean but it's defined as an integer. Switch it to a boolean for better readability. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test for-next (regular, GH kvm)#1624

Test for-next (regular, GH kvm)#1624
kdave wants to merge 10000 commits intoci-kvmfrom
for-next

kdave commented Mar 5, 2026 •

edited

Loading

Uh oh!

kdave commented Mar 5, 2026

Uh oh!

kdave commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants

Conversation

kdave commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdave commented Mar 5, 2026

Uh oh!

kdave commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants

kdave commented Mar 5, 2026 •

edited

Loading