Storage pipeline

The storage stack carries guest I/O requests from a guest-visible controller to a backing store and back. It's shared between OpenVMM and OpenHCL. Every disk backend implements the DiskIo trait, and frontends hold a Disk wrapper — a cheap, cloneable handle to any backend. For the DiskIo trait surface, method contracts, and error model, see the disk_backend rustdoc.

The pipeline

Every storage I/O flows through the same layered pipeline:

  ┌──────────────────────────────────────────────────────────┐
  │  Guest I/O                                               │
  └────────────────────────┬─────────────────────────────────┘
                           │
  ┌────────────────────────┼─────────────────────────────────┐
  │  Frontend              │                                 │
  │  (NVMe · StorVSP · IDE)│                                 │
  └────┬───────────────────┼────────────────────┬────────────┘
       │                   │                    │
       │ NVMe: direct      │ SCSI / IDE         │
       │                   ▼                    │
       │       ┌────────────────────────┐       │
       │       │ SCSI adapter           │       │
       │       │ (SimpleScsiDisk /      │       │
       │       │  SimpleScsiDvd)        │       │
       │       └───────────┬────────────┘       │
       │                   │                    │
       ▼                   ▼                    ▼
  ┌──────────────────────────────────────────────────────────┐
  │  Disk  (DiskIo trait boundary)                           │
  └────────────────────────┬─────────────────────────────────┘
                           │
  ┌────────────────────────┼─────────────────────────────────┐
  │  Decorator wrappers    │  (optional: crypt · delay · PR) │
  └────────────────────────┼─────────────────────────────────┘
                           │
            ┌──────────────┴──────────────┐
            ▼                             ▼
  ┌──────────────────┐      ┌──────────────────────────────┐
  │  Backend         │      │  Layered disk                │
  │  (file · block   │      │  (optional: RAM + backing)   │
  │   device · blob  │      │    ├── Layer 0 (RAM/sqlite)  │
  │   · VHD · ...)   │      │    └── Layer 1 (backend)     │
  └──────────────────┘      └──────────────────────────────┘

Key vocabulary:

  • Frontend. Speaks a guest-visible storage protocol and translates requests into DiskIo calls.
  • SCSI adapter. For the SCSI and IDE paths, an intermediate layer (SimpleScsiDisk or SimpleScsiDvd) that parses SCSI CDB opcodes before calling DiskIo.
  • Backend. A DiskIo implementation that reads and writes to a specific backing store.
  • Decorator. A DiskIo implementation that wraps another Disk and transforms I/O in transit (encryption, delay, persistent reservations).
  • Layered disk. A DiskIo implementation composed of ordered layers with per-sector presence tracking.

Frontends

Three frontends exist. Each speaks a different guest-visible protocol but they all produce DiskIo calls on the backend side.

FrontendProtocolTransportCrate
NVMeNVMe 2.0PCI MMIO + MSI-Xnvme
StorVSPSCSI CDB over VMBusVMBus ring buffersstorvsp
IDEATA / ATAPIPCI/ISA I/O ports + DMAide

NVMe is the simplest path. The NVMe controller's namespace directly holds a Disk. NVM opcodes (READ, WRITE, FLUSH, DSM) map nearly 1:1 to DiskIo methods. The FUA bit from the NVMe write command is forwarded directly.

StorVSP / SCSI has a two-layer design. StorVSP handles the VMBus transport — negotiation, ring buffer management, sub-channel allocation. It dispatches each SCSI request to an AsyncScsiDisk implementation. For hard drives, that's SimpleScsiDisk, which parses the SCSI CDB and translates it to DiskIo calls. For optical drives, it's SimpleScsiDvd.

IDE is the legacy path. ATA commands for hard drives call DiskIo directly. ATAPI commands for optical drives delegate to SimpleScsiDvd through an ATAPI-to-SCSI translation layer — the same DVD implementation that StorVSP uses. IDE also supports enlightened INT13 commands, a Microsoft-specific optimization that collapses the multi-exit register-programming sequence into a single VM exit.

Backends

A backend is a DiskIo implementation that reads and writes to a specific backing store. Backends are interchangeable — swap one for another without changing the frontend. The frontend holds a Disk and doesn't know what's behind it. See the storage backends page for the full catalog and platform details.

Decorators

A decorator is a DiskIo implementation that wraps another Disk and transforms I/O in transit. Features compose by stacking decorators without modifying backends:

  CryptDisk
    └── BlockDeviceDisk

Three decorators exist: CryptDisk (XTS-AES-256 encryption), DelayDisk (injected latency), and DiskWithReservations (in-memory persistent reservation emulation). All three forward metadata (sector count, sector size, disk ID, wait_resize) to the inner disk unchanged. See the storage backends page for the decorator catalog.

The layered disk model

A LayeredDisk is a DiskIo implementation composed of multiple layers, ordered from top to bottom. Each layer is a block device with per-sector presence tracking. This model powers diff disks, RAM overlays, and caching.

Reads fall through

When a read arrives, the layered disk checks layers top-to-bottom. The first layer that has the requested sectors provides the data. Sectors not present in any layer are zeroed.

Writes go to the top

Writes always go to the topmost layer. If that layer is configured with write-through, the write also propagates to the next layer.

Read caching

A layer can be configured to cache read misses: when sectors are fetched from a lower layer, they're written back to the cache layer. This uses a write_no_overwrite operation to avoid overwriting sectors that were written between the read and the cache population.

Layer implementations

Two concrete layers exist today:

  • RamDiskLayer (disklayer_ram) — ephemeral, in-memory. Data is stored in a BTreeMap keyed by sector number. Fast, but lost when the VM stops.
  • SqliteDiskLayer (disklayer_sqlite) — persistent, backed by a SQLite database (.dbhd file). Designed for dev/test scenarios — no stability guarantees on the on-disk format.

A full Disk can appear at the bottom of the stack as a fully-present layer (DiskAsLayer). This is the typical case: a RAM or sqlite layer on top of a file or block device.

Worked example: memdiff:file:disk.vhdx

  Layer 0: RamDiskLayer (empty, writable)
  Layer 1: DiskAsLayer wrapping FileDisk (fully present, read-only
           from the layered disk's perspective)
  • Guest write → sector goes to the RAM layer.
  • Guest read → check RAM; if the sector is present, return it. If absent, fall through to the file.
  • Sectors absent from both layers → zero-filled.

Changes are ephemeral — they live in the RAM layer and are lost when the VM stops. The Running OpenVMM page shows concrete memdiff: examples.

How configuration becomes a concrete stack

The resource resolver connects configuration (CLI flags, VTL2 settings) to concrete backends. A resource handle describes what backend to use; a resolver creates it.

The storage resolver chain is recursive. An NVMe controller resolves each namespace's disk, which may be a layered disk, which resolves each layer in parallel, which may itself be a disk that needs resolving.

Example: --disk memdiff:file:path/to/disk.vhdx

  1. CLI parses this into a LayeredDiskHandle with two layers:
    • Layer 0: RamDiskLayerHandle { len: None, sector_size: None } (RAM diff, inherits size and sector size from backing disk)
    • Layer 1: DiskLayerHandle(FileDiskHandle(...)) (the file)
  2. The layered disk resolver resolves both layers in parallel.
  3. The RAM layer attaches on top of the file layer, inheriting its sector size and capacity.
  4. The resulting LayeredDisk is wrapped in a Disk and handed to the NVMe namespace or SCSI controller.

For the OpenHCL settings model (StorageController, Lun, PhysicalDevice), see Storage Translation and Storage Configuration Model.

Backend catalog

BackendCrateWrapsPlatformNote
FileDiskdisk_fileHost fileCross-platformSimplest backend
Vhd1Diskdisk_vhd1VHD1 fixed fileCross-platformParses VHD footer
VhdmpDiskdisk_vhdmpWindows vhdmp driverWindowsDynamic/differencing VHD/VHDX
BlobDiskdisk_blobHTTP / Azure BlobCross-platformRead-only, HTTP range requests
BlockDeviceDiskdisk_blockdeviceLinux block deviceLinuxio_uring, resize via uevent, PR passthrough
NvmeDiskdisk_nvmePhysical NVMe (VFIO)Linux/WindowsUser-mode NVMe driver, resize via AEN
StripedDiskdisk_stripedMultiple DisksCross-platformData striping

Online disk resize

Disk resize is a cross-cutting concern that spans backends and frontends.

Backend detection

Only two backends detect capacity changes at runtime:

  • BlockDeviceDisk — listens for Linux uevent notifications on the block device. When the host resizes the device, a uevent fires, the backend re-queries the size via ioctl, and wait_resize completes.
  • NvmeDisk — the user-mode NVMe driver monitors Async Event Notifications (AEN) from the physical controller and rescans namespace capacity.

All other backends default to never signaling (wait_resize returns pending()). Decorators and layered disks delegate wait_resize to the inner backend.

Warning

FileDisk never signals resize. If you attach a file backend and resize the file at runtime, nothing happens — the guest won't be notified. Use BlockDeviceDisk or NvmeDisk for runtime resize.

Frontend notification

Once a backend detects a resize, the frontend notifies the guest:

FrontendMechanismHow it works
NVMeAsync Event NotificationBackground task per namespace calls wait_resize. On change, completes a queued AER command with a changed-namespace-list log page. Guest re-identifies the namespace.
StorVSP / SCSIUNIT_ATTENTIONOn the next SCSI command after a resize, SimpleScsiDisk detects the capacity change and returns CHECK_CONDITION with UNIT_ATTENTION / CAPACITY_DATA_CHANGED. Guest retries and re-reads capacity.
IDENot supportedIDE has no capacity-change notification mechanism.

The resize path is the same in OpenHCL and standalone — BlockDeviceDisk detects the uevent from the host, wait_resize completes, and the frontend notifies the guest through the standard mechanism. No special paravisor-level interception.

Virtual optical / DVD

DVD and CD-ROM drives use a different model from disk devices.

SimpleScsiDvd implements AsyncScsiDisk and manages media state: a disk can be Loaded or Unloaded. Optical media always uses a 2048-byte sector size. The implementation handles optical-specific SCSI commands: GET_EVENT_STATUS_NOTIFICATION, GET_CONFIGURATION, START_STOP_UNIT (eject), and media change events.

Eject

Two eject paths exist:

  • Guest-initiated (SCSI START_STOP_UNIT with the load/eject flag): the DVD handler checks the prevent flag, replaces media with Unloaded, and calls disk.eject(). Once ejected via SCSI, the media is permanently removed for the VM lifetime.
  • Host-initiated (change_media via the resolver's background task): can insert new media or remove existing media dynamically.

Frontend support

FrontendDVD supportHow
StorVSP / SCSIYesSimpleScsiDvd implements AsyncScsiDisk directly.
IDEYesATAPI wraps SimpleScsiDvd through the ATAPI-to-SCSI layer.
NVMeNoNVMe has no removable media concept. Explicitly rejected.

CLI

  • --disk file:my.iso,dvd → SCSI optical drive.
  • --ide file:my.iso,dvd → IDE optical drive (ATAPI).

The dvd flag implicitly sets read_only = true.

mem: and memdiff: CLI mapping

Both CLI options map to the layered disk model:

  • mem:1G creates a single-layer LayeredDisk with a RamDiskLayer sized to 1 GB. No backing disk — the RAM layer is the entire disk.
  • memdiff:file:disk.vhdx creates a two-layer LayeredDisk: a RamDiskLayer (inheriting size from the backing disk) on top of the file. Writes go to the RAM layer; reads fall through to the file for sectors not yet written.

Both use RamDiskLayerHandle under the hood. The difference is len: Some(size) for mem: (standalone RAM disk with explicit size) vs. len: None for memdiff: (inherits from backing disk). The optional sector_size field (default None) lets you override the sector size; when None, it inherits from the lower layer or defaults to 512 bytes. The Running OpenVMM page shows concrete examples.

Controller identity and Azure disk classification

In Azure, which controller a disk sits on is a de facto compatibility boundary. Azure VMs present four SCSI controllers (this may change), each with a distinct instance ID. One controller carries the OS disk, resource (temporary) disk, and related infrastructure disks; a separate controller carries remote data disks. For Gen1 VMs, the IDE controllers logically replace that first SCSI controller, while data disks remain on SCSI.

Guest agents use controller identity to classify disks. The azure-vm-utils udev rules match on SCSI controller instance IDs to create stable symlinks under /dev/disk/azure/. Moving a disk from one StorVSP controller instance to another changes its classification and can break guest-side automation. For SCSI disk mapping details, see the Azure disk mapping docs.

For NVMe, the mapping uses namespace IDs: NSID 1 is the OS disk, NSID 2+ are data disks (portal LUN = NSID − 2). On newer VM sizes (v7+), disks are split across multiple NVMe controllers by caching policy. NVMe is Gen2-only. See the NVMe overview and NVMe disk identification FAQ for the full Azure perspective.

Implementation map

ComponentWhy read itSourceRustdoc
disk_backendDiskIo trait, Disk wrapper, error modelsourcerustdoc
disk_layeredLayered disk, LayerIo trait, bitmap trackingsourcerustdoc
nvmeNVMe controller emulatorsourcerustdoc
storvspVMBus SCSI controllersourcerustdoc
scsidiskSCSI CDB parser (SimpleScsiDisk, SimpleScsiDvd)sourcerustdoc
ideIDE controller emulatorsourcerustdoc
scsi_coreAsyncScsiDisk trait, Request, ScsiResultsourcerustdoc