Skip to content

Conversation

@syntactically
Copy link
Member

This combines a few commits from the original guest-CoW PR, all of which involve properly CoW'ing the page tables into the scratch region, dealing with this in snapshots, etc.

Previously, this was inconsistent.

Signed-off-by: Lucy Menon <[email protected]>
@syntactically syntactically added kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. Guest-COW PRs that form part of the Guest-COW change labels Jan 28, 2026
whp can now map any region as long as it contains a file
handle (which, presently, only snapshot and scratch regions do---not
user-provided regions).

Signed-off-by: Lucy Menon <[email protected]>
This fixes an error introduced by a merge conflict in a previous PR.

Signed-off-by: Lucy Menon <[email protected]>
@syntactically syntactically force-pushed the lm/pt-scratch branch 2 times, most recently from 1b10f54 to 22173e0 Compare January 28, 2026 06:53
Previously, in 7ca0a78, MAP_SHARED was replaced with MAP_PRIVATE,
primarily to bring the Miri and non-Miri code paths closer together,
since it did not appear to have any detrimental effect.  However, this
seems to cause issues with the scratch mapping on mshv on Linux. It is
not entirely clear what mshv is doing here; possibly something like
"mapping a page which has never been written to" causes the problem.

Signed-off-by: Lucy Menon <[email protected]>
As per the comments preexisting in
src/hyperlight_common/src/layout.rs. This avoids some fairly
cumbersome proliferation of #[cfg(feature = "init-paging")], and may
be developed further in the future.

Signed-off-by: Lucy Menon <[email protected]>
Instead of updating page tables in place, and using the heap region
memory for new page tables, this changes the hyperlight_guest virtual
memory subsystem to treat snapshot page tables as copy-on-write and to
allocate all new page tables in the scratch region.

Signed-off-by: Lucy Menon <[email protected]>
These were already being CoW'd.

Signed-off-by: Lucy Menon <[email protected]>
Now that some important things are being written to the scratch
region, this makes the snapshot process aware of it. In particular,
rather than just copying the current snapshot region into a new
snapshot, the snapshot process now traverses the guest page tables and
picks all pages that are mapped into guest virtual memory, placing
them into the new snapshot along with new page tables which account
for all physical memory having moved to the snapshot region.

Several tests are temporarily disabled in this commit. One,
`snapshot_restore_handles_remapping_correctly`, will be re-enabled in
the next commit; its failure is due to an issue with state in the
guest page mapping process not being reset on snapshot. Several
others, related to the stack guard page, are also disabled, because
the snapshot process loses the context, kept in the host, of which
physical page is the stack guard page.  These will be re-enabled in an
upcoming patch series, which moves stack management fully into the
guest.

This commit also disables an optimisation around restoring to a
just-taken snapshot in certain cases; see the comment at the head of
`MultiUseSandbox::restore` for details.  Note that the optimisation of
taking a new snapshot immediately after a restore remains intact.

Signed-off-by: Lucy Menon <[email protected]>
Previously, the guest tried to infer the addresses of the snapshot
page tables from the values of cr3 that it saw at various times, but
this was brittle and irritating to do without writing to the wrong
address at times.  This replaces that mechanism with an extra word in
the metadata portion of the scratch region that explicitly provides
this information.

Signed-off-by: Lucy Menon <[email protected]>
Remove the unused constant `SCRATCH_TOP_USED_OFFSET`.

Signed-off-by: Lucy Menon <[email protected]>
andreiltd
andreiltd previously approved these changes Jan 28, 2026
Copy link
Member

@andreiltd andreiltd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement! I'm looking forward to removing more special regions from shared memory. Having an extra pair of eyes on the changes would be great also.

@jsturtevant jsturtevant mentioned this pull request Jan 28, 2026
17 tasks
Copy link
Contributor

@dblnz dblnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I left some small comments

pub use arch::{PAGE_SIZE, PAGE_TABLE_SIZE, PageTableEntry, PhysAddr, VirtAddr};
/// This is always the page size that the /guest/ is being compiled
/// for, which may or may not be the same as the host page size.
pub use arch::PAGE_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the page size differs? Does the host signal it somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to use each of these in the correct places (although I'm not sure whether or not the host page size is important to the guest---maybe in the future as a hint for zerocopy buffer alignment?), but as with some of the other portability stuff that's incidental to this series it's more aspirational/laying the groundwork for the future than "all issues related to this fixed".

regions.extend(self.get_mapped_regions().cloned());
// Include dynamically mapped regions
// TODO: include the snapshot and scratch regions
let regions: Vec<MemoryRegion> = self.get_mapped_regions().cloned().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the crashdump to work, we need to generate the MemoryRegions for snapshot and scratch.
We'd have to walk the page tables for that, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether you would want the crash dump to be virtually indexed or physically indexed. If the former, yeah, this should be replaced by code that does the same page table walk as snapshot.rs does right now, I think.

#[allow(clippy::unwrap_used)]
self.scratch_mem
.write::<u64>(size_offset, scratch_size as u64)
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just return an error and Poison the sandbox?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that we should not crash the host if we can absolutely avoid it, maybe we need a new error type besides Poisoned that means something along the lines of "something very bad happened and you should consider restarting your process because we'd really like to panic here"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree about this. Things that are a clearly impossible should panic because who knows what is wrong / whether something else in either hyperlight or the host has been corrupted. See also this thread on the previous PR: #1182 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the case where this should never happen, I get nervous about being in the failure path as a library that could cause down time. If we poison the sandbox and return an error that is specific to things that should not happen then we provide the consumer the ability to handle it fit to their particular needs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is not the right place to have the discussion about panicking on invariant violations in general, so while I still maintain my position that we should, I have for now added fallibility to this function in 54cd065.

A close inspection of the code has however convinced me that there is one caller (SandboxMemoryManager<ExclusiveSharedMemory>::build) for which panicking (at least from our code) is more completely impossible. I have added a slightly more detailed comment to that function arguing why it is morally infallible, but have added a Result type to its return for now anyway in the interests of expediency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on Lucy's side on this one in general. The tricky part here is that one may modify write function in the future without realizing it is breaking upstream assumption. If we decide to keep an error I think having debug_assert is a good way to document violation of invariant.

}
let (mem, off) = access_gpa(snap, scratch, scratch_size, gpa);
if off + PAGE_SIZE <= mem.as_slice().len() {
Some(&mem.as_slice()[off..off + PAGE_SIZE])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the observation in the comment above happen here also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which comment above you are referring to---if it's the one about page-unaligned accesses reading beyond memory, then I don't think so since the if condition checks for that here?

Copy link
Contributor

@simongdavies simongdavies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fwe questions/comments, I think there are a few places where there are types that are pub that could be pub (crate) , didn't look for them all, but possibly worth a pass of the whole PR for this.


pub const MAIN_STACK_TOP_GVA: usize = 0xffff_feff_ffff_f000;

pub fn scratch_size() -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this and the following functions be pub(crate) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following two functions are definitely used in hyperlight_guest_bin. This one isn't currently but I see little downside to making it match.

#[allow(clippy::unwrap_used)]
self.scratch_mem
.write::<u64>(size_offset, scratch_size as u64)
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that we should not crash the host if we can absolutely avoid it, maybe we need a new error type besides Poisoned that means something along the lines of "something very bad happened and you should consider restarting your process because we'd really like to panic here"

// impossible unless an extremely significant invariant violation
// has occurred.
#[allow(clippy::panic)]
unsafe fn alloc_table(&self) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can ew split TableOps into a TableOpsRead and TableOpsWrite to avoid this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably could, but I didn't think it was worth the effort at the moment. It is pretty clear semantically that these implementations could not possibly be r/w and I can't see any natural way in which host code would want to use them that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer compile time checks to make this impossible but OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a downside to splitting the trait into read and write?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, and I agree that it is the correct thing to do, I just wasn't sure if it was worth the hassle. Since it seems like there is more interest in getting right, I went ahead and did it in a2ece56.

// should possibly ever call an update_parent. Therefore, this is
// impossible unless an extremely significant invariant violation
// has occurred.
#[allow(clippy::panic)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above can we split up the TableOps trait into read/write and then avoid this entirely?

Make virt_to_phys use the aligned and un-canonicalised version of its
argument for traversal.

This does not actually matter, because only certain bits are actually
taken from vmin by the (chained) `modify_ptes` iterator(s). However,
since it doesn't matter, this is probably better for clarity.

Signed-off-by: Lucy Menon <[email protected]>
Fix bug that resulted in the first page of scratch memory being
unconditionally copied into the snapshot.

Signed-off-by: Lucy Menon <[email protected]>
Be slightly more careful about handling offsets from the guest in
snapshot.rs: previously, a maliciously-constructed sandbox could
possibly cause underflow panics, and another function could be misused
in a way that would result in a slice out-of-bounds panic (although
it was not currently being misused in such a way).

Signed-off-by: Lucy Menon <[email protected]>
@syntactically
Copy link
Member Author

Thanks for the comments everyone! I have added three fixups that address some of them. Sorry for the commit message summaries---the first line is used by autosquash so has to just be "which commit this is amending", although I did add a few words tying each of them to an issue identified in the review.

I think there are a few places where there are types that are pub that could be pub (crate) , didn't look for them all, but possibly worth a pass of the whole PR for this.

I did a quick grep for added lines starting with pub and pretty much everything I saw was guest library exports for memory layout or paging, which I think will be useful in other guests (e.g. hyperlight-wasm, which has its own page-table manipulations to implement).

ludfjig
ludfjig previously approved these changes Jan 29, 2026
Copy link
Contributor

@ludfjig ludfjig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me :D

Comment on lines -257 to -262
if let Some(snap) = &self.snapshot
&& snap.as_ref() == snapshot.as_ref()
{
// If the snapshot is already the current one, no need to restore
return Ok(());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the self.snapshot or is it used somewhere else? (Maybe in future PR if deemed OK?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still used for the optimisation of taking a snapshot immediately after a restore

andreiltd
andreiltd previously approved these changes Jan 29, 2026
This splits `hyperlight_common::vmem::TableOps` into `TableReadOps`
and `TableOps`, so that implementations like the one in snapshot.rs
that cannot support writes don't have to provide the (extremely
partial!) write functions.  This is slightly involved, because we need
to keep to track of the right information in the type system about
whether writes are possible, and, if so, whether they can move
existing page tables.

The implementation of of writing to cr3 to update page tables is also
moved into `hyperlight_guest_bin::paging`, where it belongs, which is
a simple change.

Signed-off-by: Lucy Menon <[email protected]>
@syntactically syntactically dismissed stale reviews from andreiltd and ludfjig via a2ece56 January 30, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Guest-COW PRs that form part of the Guest-COW change kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants