Administrator documentation

This contains documentation regarding running PrivateStorageio.

Morph

This directory contains Nix-based configuration for the grid. This takes the form of Nix expressions in .nix files and some JSON-based configuration in .json files.

This configuration is fed to morph to make changes to the deployment.

Deploying

The deployment consists of the public software packages and the private secrets. You can deploy these together:

morph deploy --upload-secrets morph/grid/<testing|production|...>/grid.nix test

Or separately:

morph deploy morph/grid/<testing|production|...>/grid.nix test
morph upload-secrets morph/grid/<testing|production|...>/grid.nix

Separate deployment is useful when the software deploy is done from system which may not be sufficiently secure to host the secrets (such as a cloud build machine). Secrets should only be hosted on an extremely secure system (XXX write the document for what this means).

Note secrets only need to be uploaded after a host in the grid has been rebooted or when the secrets have changed.

See the morph and nixos-rebuild documentation for more details about these commands.

Filesystem Layout

lib

This contains Nix library code for defining the grids.

grid

Specific grid definitions live in subdirectories beneath this directory.

config.json

As much as possible of the static configuration for the PrivateStorage.io application is provided in this file. It is read by grid.nix.

grid.nix

This is the morph entrypoint for the grid. This defines all of the servers that are part of the grid.

The actual configuration is split into separate files that are imported from this one. You can do things like build the network:

morph build grid.nix
<hostname>-hardware.nix

These are the generated hardware-related configuration files for servers in the grid. These files are referenced from the corresponding <hostname>.nix files.

<hostname>-config.nix

Each such file contains a minimal Nix expression supplying critical system configuration details. “Critical” roughly corresponds to anything which must be specified to have a bootable system. These files are referenced by the corresponding <hostname>.nix files.

Configuring New Storage Nodes

Storage nodes are brought into the grid in a multi-step process. Here are the steps to configure a new node, starting from a minimal NixOS 19.03 or 19.09 installation.

  1. Copy the remote file /etc/nixos/hardware-configuration.nix to the local file storageNNN-hardware.nix. In the case of an EC2 instance, copy the remote file /etc/nixos/configuration.nix instead.

  2. Add "zfs" to boot.supportedFilesystems in storageNNN-hardware.nix.

  3. Add a unique value for networking.hostId in storageNNN-hardware.nix.

  4. Copy storageNNN-hardware.nix back to /etc/nixos/hardware-configuration.nix.

  5. Run nixos-rebuild test.

  6. Manually create a storage zpool:

    zpool create -m legacy -o ashift=12 root raidz /dev/disk/by-id/{...}
    
  7. Mount the new ZFS filesystem to verify it is working:

    mkdir /storage
    mount -t zfs root /storage
    
  8. Add a new filesystem entry to storageNNN-hardware.nix:

    # Manually created using:
    #   zpool create -f -m legacy -o ashift=12 root raidz ...
    fileSystems."/storage" = {
      device = "root";
      fsType = "zfs";
    };
    
  9. Create a storageNNN-config.nix containing further configuration for the new host.

  10. Add an entry for the new host to grid.nix referencing the new files.

  11. Deploy to the new host with morph deploy morph/.../grid.nix --on <identifier> boot --upload-secrets --reboot.

Monitoring

This section gives a high-level overview of the PrivateStorageio monitoring efforts.

Goals

Alerting
Something might break soon, so somebody needs to do something.
Comparing over time or experiment groups
Is our service slower than it was last week? Does database B answer queries faster than database A?
Analyzing long-term trends
How big is my database and how fast is it growing? How quickly is my daily-active user count growing?

Introduction to our dashboards

We have two groups of dashboards: Requests (external view, RED method) and Resources (internal view, USE method).

Resources like CPU and memory exist independently of one another (at least in theory) and their corresponding dashboards are listed in arbitrary order.

Services, on the other hand, often directly depend on other services: A request might cause sub-requests, which in turn might call other services. These dependencies can be visualized as a DAG (directed acyclic graph, like a tree but with directed edges) from external-facing to internal systems.

When a service fails, and an Alert is triggered, often the services which depend on the failing service will fail and trigger Alerts as well. This can cause confusion and cost valuable time especially when the current on-call staff is not familiar with the inner workings of a particular machinery.

To mitigate this problem, we order our dashboards to resemble these dependencies according to a breadth-first-search of the service dependency DAG:

digraph {
	subgraph cluster01 {
		label = "DAG of service dependencies";

		1->2;
		1->3;

		3->4;
		3->5;
	}

	subgraph cluster02 {
		label = "Resulting order of dashboards";
		node [ shape = box ];
		edge [ style = invis ];

		d1 [ label = 1 ];
		d2 [ label = 2 ];
		d3 [ label = 3 ];
		d4 [ label = 4 ];
		d5 [ label = 5 ];

		d1->d2;
		d2->d3;
		d3->d4;
		d4->d5;
	}
}

DAG of services to resulting order of corresponding dashboards

This makes finding the first failing link, and thus the cause of the problem, quicker: Problems of a failing service lowest in the DAG bubble “upwards”. Therefore, the “lowest” dashboard that indicates a problem has a high probability of highlighting the origin of the cascading failures.

Meaning of our metrics

Google’s Monitoring Distributed Systems book about what they call the Four Golden Signals has a great explanation and definition of useful metrics:

Latency
Requests, but also errors take time, so don’t discard them.
Traffic
What constitutes “Traffic” depends on the nature of your system.
Errors
(The rate of ) failed requests.
Saturation
How “full” your service is. Take action (i.e. page a human) before service degrades.

“If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring.”

RED method for services (“request-scoped”, “external view”)

Request rate, Errors, Duration (+ Saturation?)

  • Instrument everything that takes time and could fail
    • “In contrast to logging, services should instrument every meaningful number available for capture.” (Peter Bourgon)
  • Plot 99th percentile, 50th percentile and average
    • 50th and average should be close - else something is wrong
    • Averages sum neatly - Service latency average should be sum of child service latencies

USE method for resources (“resource-scoped”, “internal view”)

Utilization, Saturation, Errors:

  • CPU saturation (Idea: max saturation value per machine, since our load is mostly single-core)
  • Memory saturation
  • Network saturation
  • Disks
    • Storage capacity
    • I/O saturation
  • Software resources
    • File descriptors

Logging

Peter Bourgon has a lot of wise things to say about logging in his brilliant article Logging v. instrumentation.

  • “[S]ervices should only log actionable information. That includes serious, panic-level errors that need to be consumed by humans, or structured data that needs to be consumed by machines.”
  • “Logs read by humans should be sparse, ideally silent if nothing is going wrong. Logs read by machines should be well-defined, ideally with a versioned schema.”
  • “A (service) never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout.”
  • “Finally, understand that logging is expensive.” and “Resist the urge to log any information that doesn’t meet the above criteria. As a concrete example, logging each incoming HTTP request is almost certainly a mistake.”

Alerts

Nobody likes being alerted needlessly. Don’t give Alert Fatigue a chance!

Rob Ewaschuk gives some great advice in Google’s Monitoring Distributed Systems: “a good starting point for writing or reviewing a new alert”:

  • Only alert on actionable and urgent events that negatively affect users consistently and that cannot wait & cannot be automated.
  • Page one person at a time.
  • Do only page on novel problems.