Storage pipeline
The storage stack carries guest I/O requests from a guest-visible
controller to a backing store and back. It's shared between OpenVMM
and OpenHCL. Every disk backend implements the
DiskIo
trait, and frontends hold a
Disk
wrapper — a cheap, cloneable handle to any backend. For the DiskIo
trait surface, method contracts, and error model, see the
disk_backend rustdoc.
The pipeline
Every storage I/O flows through the same layered pipeline:
┌──────────────────────────────────────────────────────────┐
│ Guest I/O │
└────────────────────────┬─────────────────────────────────┘
│
┌────────────────────────┼─────────────────────────────────┐
│ Frontend │ │
│ (NVMe · StorVSP · IDE)│ │
└────┬───────────────────┼────────────────────┬────────────┘
│ │ │
│ NVMe: direct │ SCSI / IDE │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ SCSI adapter │ │
│ │ (SimpleScsiDisk / │ │
│ │ SimpleScsiDvd) │ │
│ └───────────┬────────────┘ │
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────┐
│ Disk (DiskIo trait boundary) │
└────────────────────────┬─────────────────────────────────┘
│
┌────────────────────────┼─────────────────────────────────┐
│ Decorator wrappers │ (optional: crypt · delay · PR) │
└────────────────────────┼─────────────────────────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────────────────┐
│ Backend │ │ Layered disk │
│ (file · block │ │ (optional: RAM + backing) │
│ device · blob │ │ ├── Layer 0 (RAM/sqlite) │
│ · VHD · ...) │ │ └── Layer 1 (backend) │
└──────────────────┘ └──────────────────────────────┘
Key vocabulary:
- Frontend. Speaks a guest-visible storage protocol and translates
requests into
DiskIocalls. - SCSI adapter. For the SCSI and IDE paths, an intermediate layer
(
SimpleScsiDiskorSimpleScsiDvd) that parses SCSI CDB opcodes before callingDiskIo. - Backend. A
DiskIoimplementation that reads and writes to a specific backing store. - Decorator. A
DiskIoimplementation that wraps anotherDiskand transforms I/O in transit (encryption, delay, persistent reservations). - Layered disk. A
DiskIoimplementation composed of ordered layers with per-sector presence tracking.
Frontends
Three frontends exist. Each speaks a different guest-visible protocol
but they all produce DiskIo calls on the backend side.
| Frontend | Protocol | Transport | Crate |
|---|---|---|---|
| NVMe | NVMe 2.0 | PCI MMIO + MSI-X | nvme |
| StorVSP | SCSI CDB over VMBus | VMBus ring buffers | storvsp |
| IDE | ATA / ATAPI | PCI/ISA I/O ports + DMA | ide |
NVMe is the simplest path. The NVMe controller's namespace directly holds a Disk. NVM opcodes (READ, WRITE, FLUSH, DSM) map nearly 1:1 to DiskIo methods. The FUA bit from the NVMe write command is forwarded directly.
StorVSP / SCSI has a two-layer design. StorVSP handles the VMBus
transport — negotiation, ring buffer management,
sub-channel allocation. It
dispatches each SCSI request to an
AsyncScsiDisk
implementation. For hard drives, that's
SimpleScsiDisk,
which parses the SCSI CDB and translates it to DiskIo calls. For
optical drives, it's
SimpleScsiDvd.
IDE is the legacy path. ATA commands for hard drives call DiskIo directly. ATAPI commands for optical drives delegate to SimpleScsiDvd through an ATAPI-to-SCSI translation layer — the same DVD implementation that StorVSP uses. IDE also supports enlightened INT13 commands, a Microsoft-specific optimization that collapses the multi-exit register-programming sequence into a single VM exit.
Backends
A backend is a DiskIo implementation that reads and writes to a specific backing store. Backends are interchangeable — swap one for another without changing the frontend. The frontend holds a Disk and doesn't know what's behind it. See the storage backends page for the full catalog and platform details.
Decorators
A decorator is a DiskIo implementation that wraps another Disk and transforms I/O in transit. Features compose by stacking decorators without modifying backends:
CryptDisk
└── BlockDeviceDisk
Three decorators exist: CryptDisk (XTS-AES-256 encryption), DelayDisk (injected latency), and DiskWithReservations (in-memory persistent reservation emulation). All three forward metadata (sector count, sector size, disk ID, wait_resize) to the inner disk unchanged. See the storage backends page for the decorator catalog.
The layered disk model
A LayeredDisk is a DiskIo implementation composed of multiple layers, ordered from top to bottom. Each layer is a block device with per-sector presence tracking. This model powers diff disks, RAM overlays, and caching.
Reads fall through
When a read arrives, the layered disk checks layers top-to-bottom. The first layer that has the requested sectors provides the data. Sectors not present in any layer are zeroed.
Writes go to the top
Writes always go to the topmost layer. If that layer is configured with write-through, the write also propagates to the next layer.
Read caching
A layer can be configured to cache read misses: when sectors are fetched from a lower layer, they're written back to the cache layer. This uses a write_no_overwrite operation to avoid overwriting sectors that were written between the read and the cache population.
Layer implementations
Two concrete layers exist today:
- RamDiskLayer (
disklayer_ram) — ephemeral, in-memory. Data is stored in aBTreeMapkeyed by sector number. Fast, but lost when the VM stops. - SqliteDiskLayer (
disklayer_sqlite) — persistent, backed by a SQLite database (.dbhdfile). Designed for dev/test scenarios — no stability guarantees on the on-disk format.
A full Disk can appear at the bottom of the stack as a fully-present layer (DiskAsLayer). This is the typical case: a RAM or sqlite layer on top of a file or block device.
Worked example: memdiff:file:disk.vhdx
Layer 0: RamDiskLayer (empty, writable)
Layer 1: DiskAsLayer wrapping FileDisk (fully present, read-only
from the layered disk's perspective)
- Guest write → sector goes to the RAM layer.
- Guest read → check RAM; if the sector is present, return it. If absent, fall through to the file.
- Sectors absent from both layers → zero-filled.
Changes are ephemeral — they live in the RAM layer and are lost when the VM stops. The Running OpenVMM page shows concrete memdiff: examples.
How configuration becomes a concrete stack
The resource resolver connects configuration (CLI flags, VTL2 settings) to concrete backends. A resource handle describes what backend to use; a resolver creates it.
The storage resolver chain is recursive. An NVMe controller resolves each namespace's disk, which may be a layered disk, which resolves each layer in parallel, which may itself be a disk that needs resolving.
Example: --disk memdiff:file:path/to/disk.vhdx
- CLI parses this into a
LayeredDiskHandlewith two layers:- Layer 0:
RamDiskLayerHandle { len: None, sector_size: None }(RAM diff, inherits size and sector size from backing disk) - Layer 1:
DiskLayerHandle(FileDiskHandle(...))(the file)
- Layer 0:
- The layered disk resolver resolves both layers in parallel.
- The RAM layer attaches on top of the file layer, inheriting its sector size and capacity.
- The resulting
LayeredDiskis wrapped in aDiskand handed to the NVMe namespace or SCSI controller.
For the OpenHCL settings model (StorageController, Lun, PhysicalDevice), see Storage Translation and Storage Configuration Model.
Backend catalog
| Backend | Crate | Wraps | Platform | Note |
|---|---|---|---|---|
| FileDisk | disk_file | Host file | Cross-platform | Simplest backend |
| Vhd1Disk | disk_vhd1 | VHD1 fixed file | Cross-platform | Parses VHD footer |
| VhdmpDisk | disk_vhdmp | Windows vhdmp driver | Windows | Dynamic/differencing VHD/VHDX |
| BlobDisk | disk_blob | HTTP / Azure Blob | Cross-platform | Read-only, HTTP range requests |
| BlockDeviceDisk | disk_blockdevice | Linux block device | Linux | io_uring, resize via uevent, PR passthrough |
| NvmeDisk | disk_nvme | Physical NVMe (VFIO) | Linux/Windows | User-mode NVMe driver, resize via AEN |
| StripedDisk | disk_striped | Multiple Disks | Cross-platform | Data striping |
Online disk resize
Disk resize is a cross-cutting concern that spans backends and frontends.
Backend detection
Only two backends detect capacity changes at runtime:
- BlockDeviceDisk — listens for Linux uevent notifications on the block device. When the host resizes the device, a uevent fires, the backend re-queries the size via ioctl, and
wait_resizecompletes. - NvmeDisk — the user-mode NVMe driver monitors Async Event Notifications (AEN) from the physical controller and rescans namespace capacity.
All other backends default to never signaling (wait_resize returns pending()). Decorators and layered disks delegate wait_resize to the inner backend.
FileDisk never signals resize. If you attach a file backend and resize the file at runtime, nothing happens — the guest won't be notified. Use BlockDeviceDisk or NvmeDisk for runtime resize.
Frontend notification
Once a backend detects a resize, the frontend notifies the guest:
| Frontend | Mechanism | How it works |
|---|---|---|
| NVMe | Async Event Notification | Background task per namespace calls wait_resize. On change, completes a queued AER command with a changed-namespace-list log page. Guest re-identifies the namespace. |
| StorVSP / SCSI | UNIT_ATTENTION | On the next SCSI command after a resize, SimpleScsiDisk detects the capacity change and returns CHECK_CONDITION with UNIT_ATTENTION / CAPACITY_DATA_CHANGED. Guest retries and re-reads capacity. |
| IDE | Not supported | IDE has no capacity-change notification mechanism. |
The resize path is the same in OpenHCL and standalone — BlockDeviceDisk detects the uevent from the host, wait_resize completes, and the frontend notifies the guest through the standard mechanism. No special paravisor-level interception.
Virtual optical / DVD
DVD and CD-ROM drives use a different model from disk devices.
SimpleScsiDvd implements AsyncScsiDisk and manages media state: a disk can be Loaded or Unloaded. Optical media always uses a 2048-byte sector size. The implementation handles optical-specific SCSI commands: GET_EVENT_STATUS_NOTIFICATION, GET_CONFIGURATION, START_STOP_UNIT (eject), and media change events.
Eject
Two eject paths exist:
- Guest-initiated (SCSI
START_STOP_UNITwith the load/eject flag): the DVD handler checks the prevent flag, replaces media withUnloaded, and callsdisk.eject(). Once ejected via SCSI, the media is permanently removed for the VM lifetime. - Host-initiated (
change_mediavia the resolver's background task): can insert new media or remove existing media dynamically.
Frontend support
| Frontend | DVD support | How |
|---|---|---|
| StorVSP / SCSI | Yes | SimpleScsiDvd implements AsyncScsiDisk directly. |
| IDE | Yes | ATAPI wraps SimpleScsiDvd through the ATAPI-to-SCSI layer. |
| NVMe | No | NVMe has no removable media concept. Explicitly rejected. |
CLI
--disk file:my.iso,dvd→ SCSI optical drive.--ide file:my.iso,dvd→ IDE optical drive (ATAPI).
The dvd flag implicitly sets read_only = true.
mem: and memdiff: CLI mapping
Both CLI options map to the layered disk model:
mem:1Gcreates a single-layerLayeredDiskwith aRamDiskLayersized to 1 GB. No backing disk — the RAM layer is the entire disk.memdiff:file:disk.vhdxcreates a two-layerLayeredDisk: aRamDiskLayer(inheriting size from the backing disk) on top of the file. Writes go to the RAM layer; reads fall through to the file for sectors not yet written.
Both use RamDiskLayerHandle under the hood. The difference is len: Some(size) for mem: (standalone RAM disk with explicit size) vs. len: None for memdiff: (inherits from backing disk). The optional sector_size field (default None) lets you override the sector size; when None, it inherits from the lower layer or defaults to 512 bytes. The Running OpenVMM page shows concrete examples.
Controller identity and Azure disk classification
In Azure, which controller a disk sits on is a de facto compatibility boundary. Azure VMs present four SCSI controllers (this may change), each with a distinct instance ID. One controller carries the OS disk, resource (temporary) disk, and related infrastructure disks; a separate controller carries remote data disks. For Gen1 VMs, the IDE controllers logically replace that first SCSI controller, while data disks remain on SCSI.
Guest agents use controller identity to classify disks. The azure-vm-utils udev rules match on SCSI controller instance IDs to create stable symlinks under /dev/disk/azure/. Moving a disk from one StorVSP controller instance to another changes its classification and can break guest-side automation. For SCSI disk mapping details, see the Azure disk mapping docs.
For NVMe, the mapping uses namespace IDs: NSID 1 is the OS disk, NSID 2+ are data disks (portal LUN = NSID − 2). On newer VM sizes (v7+), disks are split across multiple NVMe controllers by caching policy. NVMe is Gen2-only. See the NVMe overview and NVMe disk identification FAQ for the full Azure perspective.
Implementation map
| Component | Why read it | Source | Rustdoc |
|---|---|---|---|
disk_backend | DiskIo trait, Disk wrapper, error model | source | rustdoc |
disk_layered | Layered disk, LayerIo trait, bitmap tracking | source | rustdoc |
nvme | NVMe controller emulator | source | rustdoc |
storvsp | VMBus SCSI controller | source | rustdoc |
scsidisk | SCSI CDB parser (SimpleScsiDisk, SimpleScsiDvd) | source | rustdoc |
ide | IDE controller emulator | source | rustdoc |
scsi_core | AsyncScsiDisk trait, Request, ScsiResult | source | rustdoc |