Overview
stableBoxOS is a 64-bit kernel for x86_64. It does not implement POSIX — there is no fork, no signals, no global file descriptor table. Programs talk to the kernel by sending a Pocket (a 128-byte envelope) that carries a Manifest (a chain of named operations to run). The kernel runs each op through the matching Deck, and writes a Result back on a paired ring.
On a multi-core boot the system is asymmetric: the boot CPU plays K-Core (scheduler, dispatcher), the rest play App-Core (user work). On a single-core boot the same CPU does both — the data path is identical.
Storage is in TagFS: files are rows with content hashes and tag id lists. There are no paths; files are queried by tag intersection. Identical bytes are stored once.
Cabin — the process
stableA Cabin is a process. It owns its address space (its own 4-level page table), its own ring-slot reservation for the per-process Pocket and Result rings, and its own security state. Cabins do not share memory or open handles by default.
Lifecycle is explicit: every spawn is a fresh start with a known entry point. There is no fork and no copy-on-write of another process's memory. When a Cabin exits, its memory and its ring pages are reclaimed deterministically.
Pocket — the envelope
stableA Pocket is a fixed 128-byte struct. It carries the sending pid, an optional target pid (for inter-process messages), a pointer to a Manifest, and a pointer to a list of Crates — the variable-size buffers the ops will read from or write to.
Pockets travel on a PocketRing: a single-producer, single-consumer ring per Cabin. The header lives at a fixed virtual address; the slot pages are mapped on demand the first time userspace writes into them. The kernel side has a paired ResultRing for replies, allocated the same way.
Ordering is straightforward — each ring has one writer and one reader, indexed by 64-bit monotonic counters. Backpressure is observable: pushing into a full ring returns immediately.
On the kernel side, the Guide is the loop that drains PocketRings on behalf of every process. It runs on the K-Core and is the single point where Pockets become Manifest executions. Entry into the kernel goes through a custom Notify path that uses SYSCALL/SYSRETQ with per-CPU state accessed through swapgs, plus a canonical-address guard that covers CVE-2012-0217.
Manifest & Crate
stableA Manifest is a small header followed by a variable-length chain of operations. Each operation names an op (resolved by the OpRegistry), points to its parameter blob, and optionally references one input and one output Crate.
A Crate is a 40-byte descriptor pointing at a chunk of memory. It replaces every fixed *_request_t / *_response_t struct an older design would have, so payload size isn't baked into the notify ABI.
The userspace helper MfCall1 builds a one-op Manifest, attaches optional in/out Crates, and submits via ManifestSubmitTimeout — that's the entire notify surface for a normal call.
OpRegistry
stableEvery named operation in BoxOS is registered in a single kernel-global hash table. Each entry holds a name, a handler function pointer, an authorization level, and a small chunk of metadata. There are no "notify numbers" — Manifest ops are looked up by (deck, opname) at execute time.
Decks register their ops at boot through OpRegistryRegister. A request that names an op no Deck has registered comes back as ERR_INVALID_OPCODE.
For visibility into what the kernel is actually doing, BoxOS compiles in PerfTrace — a 256-slot circular buffer that records the TSC timestamp, pid, deck, opcode and error code of every Pocket dispatch. Single-writer (the Guide loop), no locks on the record path, zero overhead when the feature is compiled out.
The five Decks
stableA Deck is a logically isolated module that owns a vocabulary — a set of named ops it registers with the OpRegistry. Decks are passive: nothing ever calls a Deck directly. The Manifest engine looks the op up, then calls the handler the Deck registered.
decks · what each one owns
Five Decks. Ninety-six ops. One registry.
A Deck is not a place where data lives — it's a logically isolated module that owns a set of named operations. Each Deck registers its ops with the OpRegistry at boot. Userspace never talks to a Deck directly: it submits a Manifest, the engine looks the op up, the matching handler runs.
Operations
10 opsbytewise primitives
- move
- fill
- xor
- hash
- cmp
- find
- pack
- unpack
- bswap
- vadd
Hardware
54 opsmetal-facing
- timer
- rtc
- port
- irq
- disk
- keyboard
- vga
- usb
- halt
- reboot
System
18 opsprocess · ipc · tags
- proc.spawn
- proc.kill
- ipc.route
- ipc.broadcast
- tag.add
- ctx.use
Storage
13 opsfiles in TagFS
- read
- write
- query
- create
- delete
- rename
- tag.set
- ctx.set
Execution
1 opresult writer
- result
Total: 96 ops across the five Decks. The Storage Deck is the only one that touches TagFS — but Decks themselves are not the filesystem. They're where the names live.
TagFS — files by tag, not by path
draftPOSIX organises persistence as a tree: directories nested inside directories, each file pinned to one path. TagFS replaces the tree with a small set of on-disk tables — a tag registry, a file table, and a metadata pool — sitting on top of a 4 KiB block layer.
posix
A tree.
One path per file. Renaming or moving rewrites the link.
tagfs
A registry.
File entry · tag ids · 4 KB blocks. Dedup by content hash.
The same files, looked at differently. Instead of a directory you walk, think of each file as a point that lives inside whichever tag fields it carries. Asking for two tags is asking for the intersection.
tagfs · seen as fields
Files as points. Tags as fields.
A file isn't kept inside a folder — it carries tags, and tags overlap. Asking for vacation & kept is asking for the intersection of two fields. A file with no tags is just a free point.
each ● is a file · each ○ is a tag context
Each file table entry stores the file's content hash, its size, the blocks it occupies, and a list of tag ids. Identical content is stored once — that's where dedup lives. Underneath: BoxHash (per-block content hash and integrity check), BCDC (self-heal on read using those checksums), and copy-on-write for snapshots and ref-counted blocks.
Two more pieces sit alongside the basic file table: SelfHeal — a metadata-side scrubber that mirrors critical metadata blocks and verifies them periodically with CRC32; and Dedup — the same BoxHash content hash is used as a deduplication key, with reference counts and tag-aware locality so blocks with the same tag tend to land together on disk.
Braid — tag-aware redundancy
draftBraid is BoxOS's storage redundancy layer. It sits below TagFS and decides where each block actually lives across multiple disks. Three modes:
- mirror — 2-way mirror, copy on every disk
- stripe — single-disk striping for raw speed
- weave — 3-way mirror for maximum safety
Two unusual things make Braid a Braid and not just RAID. First, every block has a BoxHash checksum tied to its tag context — a read from a mirror that diverges triggers an auto-heal: Braid reads all copies, checksum-votes for the agreed value, and rewrites the diverged copy. Second, tag-aware placement — when a file gets written, its tag context is hashed to pick which disk gets the primary copy, so files that share a tag prefer the same disk and seek less.
Stats are tracked: checksum errors, auto-heals, tag-driven placements. Source: src/kernel/tagfs/braid/braid.{h,c}.
Storage Deck — talking to TagFS
draftThirteen ops, all submitted as Manifest entries: read, write, create, delete, rename, query, getinfo, tag.set, tag.unset, ctx.set, ctx.clear, plus open/close-style helpers. None has a fixed payload size — Crates carry buffers of whatever length the call needs.
The shell's use command is sugar over the context ops. Each argument is a tag name; calling usewith no arguments clears the context. The prompt updates to reflect what's in scope; commands that operate on TagFS use the context as a default filter.
# set the context to two tags
use kernel notify
# the prompt now shows the active context
# (commands that operate on TagFS use it as a default filter)
# clear the context
useNotification — separate from regular IPC
draftNotify is a small system used for low-rate, process-to-process messages — for example, the shell waking the display daemon when a buffer needs redrawing. It travels through its own dedicated path so noisy notifications can't starve a busy IPC ring.
In userspace it lives in its own file (boxlib/notify.c) on purpose, kept out of the regular Pocket-handling code so the two stay easy to read independently.
The kernel side uses a Listen Table: processes register interest in event categories by tag bitmask (e.g. EVENT_TAG_KEYBOARD, EVENT_TAG_STORAGE), and the kernel routes notifications only to the listeners whose tags match. The table holds up to 1024 listeners with a 256-bucket hash index keyed on (pid, required_tags) to keep registration O(1).
Drivers
stableAll hardware-facing ops live in the Hardware Deck — 54 ops across timer, RTC, port I/O, IRQ, disk, keyboard, video and USB. The actual device code lives under src/kernel/drivers/; the Deck is just the names.
The sections below describe each driver one at a time.
Timer & RTC
stablePIT at 100 Hz drives the global tick (g_global_tick in kernel_clock.h) — used for scheduling decisions and timeouts. The RTC is read on demand for wall-clock time. The Hardware Deck exposes three timer ops and three RTC ops.
IRQ · IDT · APIC
stableThe IDT has 256 vectors. Hardware interrupts are routed through an IRQChip abstraction with two backends: legacy 8259 PIC (for early boot) and IO-APIC (for the rest). Each CPU has a Local APIC for IPIs and the per-core timer interrupt.
Storage I/O — ATA, AHCI, DiskBook
draftThree controller paths: ATA PIO (always available), ATA DMA (with an async I/O queue), and AHCI (with NCQ). Selection is done at boot: if the platform exposes an AHCI controller, AHCI wins.
On top of the raw block layer sits DiskBook — a write-ahead log used to make filesystem updates crash-consistent. TagFS metadata changes go through the WAL before being applied.
USB — xHCI host, HID class
draftThe only USB host controller supported is xHCI. Enumeration brings up devices, the HID class driver attaches to keyboards. The Hardware Deck has nine USB ops covering controller status and HID input. Other USB controller flavours (UHCI / EHCI / OHCI) are not implemented.
Display — VGA text, GOP framebuffer
stableTwo paths. On legacy boots BoxOS uses an 80×25 VGA text mode. On UEFI / TagBoot it uses a GOP framebuffer with an embedded ASCII font. A small userspace process — the display daemon — owns the framebuffer and arbitrates redraws on behalf of other processes, so user code doesn't fight over scroll state.
Twelve VGA ops are exposed through the Hardware Deck: putchar, putstring, clear, set_cursor, attribute control, and the like.
Keyboard
stableTwo input paths. PS/2 is the default, with software-side key repeat — a key held down without a key-up triggers a rate-limited synthetic stream. USB-HID through the xHCI driver provides the same byte stream so userspace doesn't care which one is plugged in.
PCI · ACPI · serial
stablePCI is fully enumerated at boot — config-space access is done via the legacy port pair, and the bus walk feeds every other driver that needs to find a device. ACPI parsing covers RSDP, RSDT, FADT and MADT — enough to learn IO-APIC addresses and the CPU topology. COM1 is wired up as the kernel debug log.
There is no network stack.
Memory — PMM and VMM
stableThe physical memory manager is a buddy allocator with three zones — DMA32, USER, and HIGH — so a request for a DMA-able page doesn't collide with a normal allocation. The virtual memory manager uses 4-level paging with a kernel half mapped into every Cabin (so a notify doesn't need a page-table switch) and a pull-map for physical RAM beyond 4 GiB.
virtual address space · per process
Where things live in memory.
x86_64 with 4-level paging. The lower half is private to the process; the upper half is the kernel, mapped into every process so syscalls don't need a page-table switch.
Two tag-flavoured layers ride on top of the raw allocator. PhysTag attaches up to 6 reserved tags (DMA32, USER, HIGH,KERNEL, MMIO, SHARED) and 48 user-definable tags to physical pages, with a 64-element band index for fast lookup. MemTag does the same for virtual regions — processes can ask for memory matching a set of tags (e.g. "dma-safe AND shared") and the allocator filters.
ASLR randomises stack, heap and buffer-heap positions per process at spawn, page-aligned, sourced from RDRAND with a TSC fallback. Stack range is 8 MB (2048 positions); heap and buffer-heap are 16 MB each (4096 positions).
Scheduler — multi-level feedback queue
draftEach core has its own run-queue. The scheduler is a multi-level feedback queue: a process that just woke up sits in a high-priority level; a process that has been running for a while drifts to a lower one. Idle cores can steal work from busier cores.
On a single-core boot the same CPU still runs this scheduler — there's just no peer to steal from. The K-Core role is logical; nothing forces it to be a separate piece of silicon.
On top of MLFQ sits UseContext — a kernel-wide tag set that grants a small priority boost to processes whose tag list matches all of the active context tags. A 64-bit fast path covers the first 64 tags; an overflow array handles the rest. It's the scheduler-side counterpart to the shell's use command.
Security — five auth levels
draftEach registered op carries a required authorization level — one of NONE, APP, UTILITY, SYSTEM, NETWORK. Each Cabin is tagged with a level. Before dispatching, the Manifest engine asks ManifestOpAuthorize whether the calling Cabin is allowed to run that op.
Two special tags override the levels: a Cabin tagged god passes every check; a Cabin tagged stopped fails every check. Kernel-internal Manifests bypass the gate entirely because they're not run on behalf of any process.
Boxlib — the userspace runtime
stableBoxlib is the small library every userspace program links against. It hides the ring mechanics so user code looks like normal C: call a function, get a value back.
The pieces, by file:
pocket.c— build a Pocket, push it on the ringmanifest.c— theMfCall1helper and friendsresult.c— read replies off the ResultRingnotify.c— the separate notification pathvga.c·file.c·keyboard.c·system.c·time.c— typed wrappers around the matching Deck opsprint.c·string.c·convert.c·memory.c·ipc.c— shared C utilitiesarch/— the bits of inline assembly that need to know about x86-64
Shell
stableThe shell starts as PID 1. It owns a line editor, a small parser, and a handful of builtins (including use, the tag-context command). External commands are looked up among the shipped apps and spawned as fresh Cabins.
Source: src/userspace/shell/ — shell.c, executor.c, line_edit.c, plus the commands/ directory for builtins.
Apps
stableThe kernel image ships with a small set of userspace ELFs:
shell— interactive shell, PID 1display daemon— owns the framebuffer and serializes redrawstoday— prints wall-clock time via the RTC opsproca·procb— small two-process samples used to demo IPCchain— exercises a multi-op Manifest in one callmemtest·mtest— memory and PMM stress testsdecks— lists every registered op in every Deckbench— RDTSC microbenchmark across notify, IPC and storage paths
TagBoot — the UEFI loader
stableBoxOS ships its own UEFI loader — TagBoot, written in C using the EFIAPI calling convention. It maps the kernel image, queries the GOP framebuffer, converts the UEFI memory map into the e820 format the kernel expects, walks the ACPI tables for RSDP, and jumps into the kernel entry point in long mode.
On legacy / BIOS boots the kernel loads through Stage 1 and Stage 2 instead — the same shape, just without the firmware services.
Build & run
stableFrom the project root:
make clean && make
make run # single core
make run CORES=4 MEM=16G # multi-core testYou'll need a recent x86_64-elf-gcc cross-toolchain and QEMU 7+. The tools/ directory has a few wrappers for headless runs and scripted input.
Found a bug or have a question that doesn't fit the docs? Open a thread in the forum, ideally in the Kernel internals category.
