Building a Cloud™-Free Hashistack Cluster 🌥

Sat, Sep 12, 2020

Preface
Getting Started
Safety First: TLS
Behind the Firewall
Provisioning With Terraform
Running a Website
Final Thoughts

Preface

“Hashistack” refers to a network cluster based on HashiCorp tools, and after off-and-on spending a considerable amount of time on it, the architecture of my own cluster (on which this blog is running, among other personal projects) has finally (mostly) stabilized. In this post I will walk you through its high-level structure, some of the benefits provided, and hopefully show you how you can build a similar cluster for your personal or professional projects.

Everything I discuss here is available publicly in my infrastructure repo. I reference it frequently, and some may find it helpful simply to browse through that code.

The Cloud™

The term “cloud” is not very well-defined, but usually refers to those platforms that provide managed services beyond Virtual Private Server (VPS) provisioning. In this post, I want to draw a distinction between VPS providers and Cloud™ providers. While there are many different VPS providers, here I use the term Cloud™ to refer to the big 3: Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

There is nothing inherently wrong with building applications on the Cloud™, but it’s an area that is already heavily discussed and supported, especially with the recent launch of HashiCorp Cloud Platform. The benefit of these services is that they let you get up and running quickly, but they also often come with a hefty price tag, and can make it harder to switch providers in the future for cost or other reasons. Using VPS providers can save you money, makes it easier to switch and, in my opinion, is more fun.

My cluster is built on Linode because they offer openSUSE VPS images and DNS management. However, this guide should still be relevant no matter what distribution you’re using, though with some extra steps if you do not have systemd or firewalld available.

Getting Started

Create a Support Server

A primary fixture in my setup is the use of a “support” server, which is a single VPS instance that acts as the entrypoint for the rest of the cluster. Most of the infrastructure is provisioned with Terraform and is designed to be easily replaceable; the support server is the lone instance which is cared for as a pet rather than cattle. This is very similar in concept to a bastion server, but with less of a focus on security, and more on cost savings and functionality.

The support server’s functions include:

Cluster members are provisioned with a random root password which is never exposed; access is only granted via SSH public keys, and never to root (after provisioning has finished). Restricting authorized keys to only what is available on the support server is an easy way to tighten your security. (My setup is actually slightly different in that servers only allow access with the public keys defined in Linode and I always forward my SSH agent to the support server, but I still do all cluster operations on the support server.)
The support server acts as the guardian of the Certificate Authorities, and new certificates are only issued by making a request to the support server.
The support server maintains Terraform state. Setting up a backend is an option here as well, but for relatively simple uses like mine, it’s easier to stick with the “local” backend on the support server.
Cheap artifact hosting. As long as you have a server running with a known address, you can have your support server host all your artifacts and serve them with minio or even a plain HTTP server.

A Note on IPv6

Where possible, everything is configured to communicate over IPv6. Despite its slow adoption, IPv6 is a good choice here because it is more efficient, opens up another possible route for cost savings due to the scarcity of IPv4 addresses, and VPS providers are more likely to support it than Internet Service Providers anyway.

Safety First: TLS

In order to safely restrict access to cluster resources, the first step you’ll want to take with your support server is to generate Certificate Authorities that can be used to configure TLS for each of the services. My setup largely follows the approach outlined in HashiCorp’s guide to enabling TLS for Nomad, which will go more in-depth in how to use cfssl to get set up.

It might be overkill, but I use a different CA for each service, and they are stored on the support server under /etc/ssl:

/etc/ssl
├── consul
│   ├── ca-key.pem
│   └── ca.pem
├── nomad
│   ├── ca-key.pem
│   └── ca.pem
└── vault
    ├── ca-key.pem
    └── ca.pem

Another important security note is that the key permissions should be as restrictive as possible:

-r-------- 1 root root  227 Jul 23  2019 /etc/ssl/consul/ca-key.pem
-r--r--r-- 1 root root 1249 Jul 23  2019 /etc/ssl/consul/ca.pem

CFSSL Configuration

CFSSL is a general-purpose CLI tool for managing TLS files, but it also has the ability to run a server process for handling new certificate requests. That requires defining a configuration file at /etc/ssl/cfssl.json on the support server:

{
  "signing": {
    "default": {
      "expiry": "87600h",
      "usages": [
        "signing",
        "key encipherment",
        "server auth",
        "client auth"
      ],
      "auth_key": "primary"
    }
  },
  "auth_keys": {
    "primary": {
      "type": "standard",
      "key": "..."
    }
  }
}

The primary auth key here must be a 16-bit hex value, and is used to prevent unauthorized parties from requesting certificates. All new certificate requests effectively use that key as a password, so treat it just like you would treat your private keys by never checking it into source control. For more details on CFSSL configuration, see CloudFlare’s post on building your own public key infrastructure.

There is one other tool that CFSSL provides but isn’t mentioned in the article called multirootca, which is effectively just a multiplexer for CFSSL. By default, the CFSSL server will only issue certificates for a single Certificate Authority; multirootca lets you run the server in a way that supports multiple authorities. It requires its own configuration file, but a very simple one:

[ consul ]
private = file:///etc/ssl/consul/ca-key.pem
certificate = /etc/ssl/consul/ca.pem
config = /etc/ssl/cfssl.json

[ vault ]
private = file:///etc/ssl/vault/ca-key.pem
certificate = /etc/ssl/vault/ca.pem
config = /etc/ssl/cfssl.json

[ nomad ]
private = file:///etc/ssl/nomad/ca-key.pem
certificate = /etc/ssl/nomad/ca.pem
config = /etc/ssl/cfssl.json

The multirootca service is then run under systemd so that it can keep running in the background, serving incoming certificate requests.

Issuing new certificates is done from every cluster member via this script, which uses the CFSSL CLI to make a gencert request to the running multirootca service on the support server. Like the support server, certs and keys on cluster members all live under /etc/ssl, grouped by application name, including the public key for the certificate authority.

One thing to note is how Consul, Nomad, and Vault interact with each other, since that affects which certificates you need to issue. Vault depends on Consul, and Nomad depends on both Consul and Vault, so an instance running a Nomad agent will have a lot of certificates in /etc/ssl/nomad:

-rw-r--r-- 1 nomad nomad 692 Jun 14 21:59 ca.pem
-r--r----- 1 nomad nomad 228 Jun 14 22:00 cli-key.pem
-rw-r--r-- 1 nomad nomad 714 Jun 14 22:00 cli.pem
-r-------- 1 nomad nomad 228 Jun 14 22:00 consul-key.pem
-rw-r--r-- 1 nomad nomad 970 Jun 14 22:00 consul.pem
-r-------- 1 nomad nomad 228 Jun 15 19:07 nomad-key.pem
-rw-r--r-- 1 nomad nomad 803 Jun 15 19:07 nomad.pem
-r-------- 1 nomad nomad 228 Jun 14 22:00 vault-key.pem
-rw-r--r-- 1 nomad nomad 714 Jun 14 22:00 vault.pem

A Note on Hostnames

While working on this project, the most common TLS-related errors I encountered were “unknown certificate authority” and “bad hostname.” The former is usually pretty easy to fix; just ensure ca.pem is available on every node and that it’s being used as the relevant CA in the configs; but the latter requires a little more thought.

Every node needs to consider how it is going to be queried. By default, issue-cert.sh considers only localhost to be a valid hostname, which means that only API requests to localhost will be accepted, which in turn means that all requests from another location (like the support server) will be rejected. If you want to query your node using another name, it needs to be included as a valid hostname when the certificate is issued.

For all nodes, the public IP address is a common alternative hostname to specify. This will let you query the node from anywhere as long as your CLI is configured with its own valid certificate (a separate script makes this pretty easy; it’s very similar to the one used during node provisioning, but it operates directly on the CA private key instead of using the remote).

In addition, there are a couple special cases to consider:

Consul services should add <name>.service.consul as a valid hostname. Both Nomad and Vault servers register their own services, so they should add nomad.service.consul and vault.service.consul respectively.
All Nomad agents, both servers and clients, should add their special hostname, which is constructed from the agent’s role and region. All Nomad agents in my cluster stick with the default region global, so Nomad servers use server.global.nomad and clients use client.global.nomad.

Behind the Firewall

With any cluster, a properly-configured firewall is a must. I use firewalld, which is the new default for openSUSE, and it’s not too difficult to configure.

firewalld defines two important concepts for classifying incoming connections: services and zones. Services simply define a list of protocol/port pairs that are identified by a name; for example, the ssh service would be defined as tcp/22, because it requires TCP connections on port 22. Zones, roughly speaking, are used to classify where a connection is coming from, and what should be done with it, such as “for any connection to one of these services, from one of these IP addresses, accept it.” Connections that aren’t explicitly given access will be dropped by default.

The full list of features firewalld provides for zones is outside the scope of this post, and if you plan to use firewalld, it’s probably good to read more. However, it is still useful even with a very simple configuration.

One benefit of having TLS configured for Consul, Nomad, and Vault is that it is perfectly safe to open their ports to any incoming connection regardless of source IP, since connections will be rejected if they do not have a valid client certificate anyway. There is a lot of room for flexibility here though, and further restrictions may be wanted if you expect sensitive information to go through your cluster.

Creating a Cluster-Only Zone

The natural fit for a more secure zone is one that only processes requests coming from other nodes inside your cluster. While my setup leaves many ports open to the world, there is one exception: Nomad client dynamic ports. While connections to Nomad directly require a client certificate, I wanted my applications running on Nomad to be able to communicate with each other (more on that below), and that requires opening up the dynamic port range to the other Nomad clients.

To do this, I created a new service called nomad-dynamic-ports that grants access to the port range used by Nomad. All applications running on Nomad that request a port will be assigned a random one from this range, so we want to open up the whole range, but only to other Nomad clients.

Each Nomad client is provisioned with a zone called nomad-clients, which allows access to the nomad-dynamic-ports service, but with no other information, so by default no connections will land in this zone. In order for it to work, we need to add the IP address of every other Nomad client as a source to this zone, and to do this for all the clients.

To do this, I wrote a script that uses Terraform output to get a list of all the Nomad client IP addresses, then SSH on to each one and make the necessary updates. This script can be run automatically by Terraform with a null_resource, which will help keep things in sync.

Provisioning With Terraform

Terraform was not actually my go-to solution for provisioning. Initially my plan was to stay simple and stick with scripts like deploy-new-server.sh using Linode’s API, but I ended up moving over to Terraform for one big reason: state management. Terraform’s big win is keeping track of what you’ve already deployed, which makes cluster management much easier. In particular, you can provision your client nodes with existing knowledge of your Consul servers, and write scripts that can use that knowledge after-the-fact to make additional changes. All of these operations are much easier with a state management tool than they would be if you had to query your VPS’ API every time you wanted to know a node’s IP address.

Overall Structure

How to organize Terraform code is a question of constant debate, and the right answer is that there is no right answer. A lot of it depends on how you organize your teams, so bear in mind that my cluster is maintained by a team of one.

My module structure has one top-level module, with one module for each “role” that my nodes will play:

├── main.tf
├── outputs.tf
├── secrets.tfvars
├── consul-server
│   ├── main.tf
│   ├── outputs.tf
│   └── variables.tf
├── nomad-client
│   ├── main.tf
│   ├── outputs.tf
│   └── variables.tf
├── nomad-server
│   ├── main.tf
│   ├── outputs.tf
│   └── variables.tf
└── vault-server
    ├── main.tf
    ├── outputs.tf
    └── variables.tf

Each of these modules has a number of variables in common, including how many instances to create, which image to use when creating the node, and a couple of other values. Most of their inputs are the same, but this provides a lot of flexibility, and common values are usually sourced from a block of shared locals.

This setup has several advantages, primarily flexibility and a top-level main.tf that is able to describe the makeup of your cluster very cleanly, but the downside is that it is fairly verbose within the module definitions. Terraform doesn’t appear to provide any utilities for defining a set of provisioners that can be shared across resources, which would help quite a bit.

Division of Provision

The provisioning of a new node is split between a custom stackscript and Terraform provisioners. The stackscript installs packages and does other configuration that is common across nodes, while the Terraform provisioners are used to copy configuration files up from the infra repo directly and to write configuration files that are dependent on cluster knowledge, such as the addresses of the Consul servers.

An alternative, and arguably better, setup would be to use Packer to define the node images, leaving nothing for Terraform to do except deploy instances and do the little configuration that requires cluster knowledge. Unfortunately, this is an area where Linode may not be a great choice; while Packer does support a Linode image builder, custom Linode images don’t appear to be compatible with their network helper tool, which causes basic networking to be broken by default.

Naming Things

Initially, I took a very simple approach to naming nodes by their role, region, and an index, such as nomad-server-ca-central-1. However, this approach lacks flexibility when it comes to upgrading your cluster. If you want to replace a node, it is safest to create a new one and make sure it’s up and running before destroying the old one, but now your carefully numbered servers are no longer in order.

Fortunately, Terraform provides a random provider that can be used to name your nodes instead by generating random identifiers. I use something similar to this:

resource "random_id" "servers" {
    count = var.servers
    keepers = {
        datacenter     = var.datacenter
        image          = var.image
        instance_type  = var.instance_type
        consul_version = var.consul_version
        nomad_version  = var.nomad_version
    }
    byte_length = 4
}

resource "linode_instance" "servers" {
    count = var.servers
    label = "nomad-server-${var.datacenter}-${replace(random_id.servers[count.index].b64_url, "-", "_")}"
}

This gives each Nomad server a name like nomad-server-ca-central-XXXXXX, where XXXXXX is a base64-encoded random string. The URL-safe base64-encoding is used, but Linode doesn’t allow two consecutive dashes in instance labels, so the replace() function is used to replace dashes with underscores in order to prevent a provision failure caused by a dash as the first letter in a server id. (It’s happened to me once already, not a fun reason for the apply to fail)

Running a Website

At this point, we’ve covered pretty much everything you need to be able to spin up a functional cluster. However, as I mentioned before, this blog is currently running on my own cluster, and there are a number of extra steps that need to be taken in order to support running a website. In this section, I will cover the points that are specific to running a website on this setup. While web servers are similar to any other job type in many respects, there are a few additional concerns that bear special mention.

Load Balancing

Running a website on Nomad makes it easy to scale up, but running more than one instance of a web server requires some form of load balancer. The big name in load balancers is HAProxy, but a few newer ones can take advantage of Consul’s service-registration features in order to “just work” with no or minimal configuration. For this website I chose Fabio, but Traefik is another good option.

Regardless of which you choose, you will then have to decide how to run it. Naturally, I decided to run Fabio as a Nomad job too, but due to the nature of load balancing, it has tighter restrictions for how it can run. Most jobs, including the web server itself, don’t actually care which nodes they run on, but load balancers need their host to be registered with DNS. This means that we need the nodes themselves to know whether they are intended to run a load balancer or not.

Nomad provides a number of filtering options for jobs including custom metadata, but I decided to go with the node_class attribute. This is a custom value that you can asign each Nomad client explicitly for filtering purposes, and has the added benefit over custom metadata of being included in node status output:

damien@support:~> nomad node status
ID        DC          Name                            Class          Drain  Eligibility  Status
e9d5cdfe  ca-central  nomad-client-ca-central-UgcT5Q  load-balancer  false  eligible     ready
67d7b064  ca-central  nomad-client-ca-central-4XMmYQ  <none>         false  eligible     ready

Fabio jobs can then be specified to run exclusively on load-balancer nodes with:

constraint {
	attribute = "${node.class}"
	value     = "load-balancer"
}

DNS Management

Once the load-balancer node is up and running an instance of Fabio, everything should technically be available on the internet, but it won’t be very easy to reach without a domain name. However, it would also be a pain to manually update a DNS management system with new records every time your cluster changes.

Fortunately, DNS records can be considered just another part of your infrastructure, and can therefore be provisioned with Terraform! This means that any time a new load-balancer node is created or destroyed, a DNS record is created or destroyed along with it, automatically keeping your domain name in sync with available load balancers.

To support this, I defined a Terraform module called domain-address, which takes as input the domain, a name for the record, and a list of Linode instances. The linode_domain_record resource can then be used to define A and/or AAAA records pointing to the IPv4 and/or IPv6 addresses respectively:

data "linode_domain" "d" {
  domain = var.domain
}

resource "linode_domain_record" "a" {
  for_each    = toset(terraform.workspace == "default" ? var.instances[*].ip_address : [])
  domain_id   = data.linode_domain.d.id
  name        = var.name
  record_type = "A"
  target      = each.value
}

resource "linode_domain_record" "aaaa" {
  for_each    = toset(terraform.workspace == "default" ? [for ip in var.instances[*].ipv6 : split("/", ip)[0]] : [])
  domain_id   = data.linode_domain.d.id
  name        = var.name
  record_type = "AAAA"
  target      = each.value
}

One thing to note here is the terraform.workspace check within the for_each line. This is to support development flows that use Terraform workspaces, which can be useful for testing cluster changes (such as OS upgrades) without affecting the existing deployment. DNS records are global, so we use this check to ensure that they are only created within the default workspace and aren’t overwritten to point to a non-production cluster.

Cert Renewals

The last step is to set up automatic SSL certificate renewal. If you don’t need or want to serve your website over HTTPS, then you can skip this step, but most websites should probably be served securely and therefore will need SSL.

In addition to providing orchestration for always-on services, Nomad supports something akin to cron jobs in the form of the periodic stanza. With this, we can write a Nomad job that executes our SSL renewal regularly so that its validity never lapses.

Getting the Certificate

The first step is deciding which SSL renewal service and tool to go with. Let’s Encrypt is the big name in this space because it’s free and run by a nonprofit, but that’s not a hard requirement as long as whichever service you choose has APIs for automatic renewal.

For tool, I decided to go with acme.sh, because it provides a nice interface with minimal dependencies, though there are a number of other options available for any ACME-compatible service.

The Challenge

The ACME protocol requires you to be able to prove that you own the domain being renewed through a challenge, with the two main options being HTTP and DNS. HTTP challenges work by giving you some data and verifying its existence under http://<domain>/.well-known/acme-challenge/; DNS challenges work similarly, but the challenge expects the data to be available as a TXT record on the domain.

Due to the distributed nature of jobs running on Nomad, the HTTP challenge is not really viable, so I recommend using the DNS challenge along with your DNS provider’s API, such as dns_linode_v4.

Storing Certificates in Vault

Fabio supports loading SSL certificates from Vault, so after the challenge succeeds, that’s where we will want to save the cert and key. However, this also becomes a little tricky due to the distributed nature of this job, since Vault will be running on a different server.

Fortunately, Vault has an HTTP API, so as long as we have the address of at least one available Vault server and a valid token, the job can send an HTTPS request with the new cert and key contents. With acme.sh, this takes the form of specifying --reloadcmd and a script like vault-write-certs.sh, which can be made available as a downloadable artifact that Nomad will make available to the renewal job.

Final Thoughts

The architecture described in this post was born out of a desire to better understand how cluster networking works, a general interest in HashiCorp tools and philosophies, and a purely pragmatic desire to be able to avoid cloud vendor lock-in when needed.

This post was written over the span of several months and ended up being pretty long, so it may not be the easiest to read, and some of its links may also fall out of date as I continue making updates to my architecture.

If you find anything that’s broken, or you just have a question or comment, feel free to shoot me an ✉️ email.