Distributed Unix

Goal

Provide POSIX-compatible system distributable across thousands of machines.

Or: What if all the servers of Google were a single usable system?

Or: Timesharing in the cloud

Unsolved Problems

The big problem is: How do you communicate changes in system state (eg process status) to interested nodes without swamping the entire network with status messages.

Robustness and Availability

The overall system should be robust against node failures. IE, if a single node has an error, power failure, or catastrophic disaster, the overall system should be able to recover. The particular resources (hardware, processes) backed by that node may become unavailable, but it should not cascade to a system crash.

The file system should replicated so that n node failures do not prevent resource access. In particular, virtual inodeds (devices, sockets, FIFOs) should never be lost, given their tendency to be critical to system operation. VFS mounts are backed by daemons, so their availability is tied to the robustness of the daemon.

Compute Fabric

The unit of computation is the process. Fabric does not recognize threads.

Hardware

The compute fabric is made of a series of nodes sharing a LAN.

Each node is an independent PC (x86 or ARM based). The entire fabric is a single architecture.

(Should differing processor features be allowed? Is it feasible to move a process from a processor with less features to a processor with more?)

(Not sure if dunix should be a stand-alone operating system or a layer on top of a non-distributed kernel. I’m writing as if its a bare-metal OS, but it’s feasible to adapt it to be a wine-like layer.)

CPU Features

On start, a process is tagged with which CPU features it uses (eg, ARM instruction sets, x86 extensions). A process can only move to a node with a superset of its CPU features.

While strictly speaking, this could allow ARM and x86 nodes to coexist on the same system, this isn’t terribly practical.

This is stored in a bitfield, with options like:

Node Management

To Move Between Nodes

Processes may move arbitrarily between nodes, depending on load, file system concerns, etc.

Unmovable Processes

(mmap() to files is pretty difficult with distributed FS.)

FIXME: Which devices affect process movement? Can certain device handles be swapped between actual nodes (eg, GPUs)?

On fork

  1. Fork locally
  2. If local machine can handle additional load, break
  3. Immediately send STOP to child before it is scheduled
  4. From list of peers, select node to send child to
  5. Move child to new node, following Move Between Nodes algorithm

Streams

Because of process shuffling, part of the fabric infrastructure is the ability to move streams (File Descriptors) between machines. Various processes have FDs spanning between them, which may be local to a node or between nodes.

Networking

(Alternative: Dedicated edge routers)

Stream Daemons

(Alternative: inetd)

This was chosen on the assumption that fork overhead is much lower than reinitialization, configuration reload, etc. A daemon has the flexibility to do initialization and management (through signals) from a central process but still take advantage of load balancing compute fabric.

On Listen

  1. Registration is sent to edge routers, which are responsible for load balancing

On Connection (TCP)

  1. Edge router accepts the connection and creates an FD
  2. Edge router gives the node with the parent template the connection FD
  3. Parent node forks the template process
  4. The child process resumes from the listen call, receiving the connection FD

On Connection (Unix, FIFO)

The inode stores a reference to the listening (parent/template) PID.

  1. Client process creates an FD
  2. Client node sends the FD to the parent node
  3. Parent node forks the template process
  4. Child process resumes from listen call, receiving the connection FD

Scaling

Filesystem

This is largely undetermined.

Managment Goals

net0

A prototype filesystem based for small clusters. It’s layered on top of a traditional disk FS.

On Open

Workstations

Graphical workstations can work in this context.

I haven’t figured out how display/login managers get local hardware FDs.

Databases

Numeric Identifiers

uid_t and gid_t

32-bit numbers are likely sufficient. I believe that 4 billion groups or users are unlikely to be found on a single system.

dev_t

This is used to identify both node devices (keyboards, displays, etc) and file sources (disk drives, storage clusters). Therefore, it will likely need to be a structured number including if it’s a real or virtual device, the node providing it (real only), and some other identifiers.

The idea of major and minor device numbers are all but lost in this system.

pid_t

A process is only on one node at a time. It may be useful assign an ‘authoritive’ node (probably the origin) to the process and encode it in the PID. However, this requires the node to always service the process, even if it has given it to another node. It may also act as a bottle neck and in the event of failure, cause the process to be orphaned and lost.

However, we do not want every node tracking every process, either.

We must also prevent two processes from ever getting the same ID.

Further Reading

To Do & Ideas