Elixir Cluster with libcluster and swarm
Recently I’ve been toying with automatically clustering Elixir nodes.
I wanted to be able to dynamicially add and remove Kubernetes pods that would automatically join a cluster.
libcluster
provides this functionality, but the docs could use some love
(which they will get, if I find some extra time).
In this cluster, I needed a global process registry.
As I used to do Erlang, i reached for :gproc
without much thought,
but while toying with libcluster
I stumpled upon swarm
which I am now trying out as well.
- I use Docker for Mac and it’s built in kubernetes feature for testing, but you can use a regular Minikube with minor modifications.
- I assume you already know docker and kubernetes basics.
- I am using Elixir 1.9.1 and the new built-in release tooling.
You can find the example project here: https://github.com/tudborg/elixir_cluster_demo
§Goal
Creating a demo project that auto-clusters with libcluster on kubernetes. We’ll start by using Kubernetes DNS to discover our pods.
§Creating the project
$ mix new elixir_cluster_demo --sup
and add libcluster
and swarm
to dependencies,
my MixProject now looks like this:
1defmodule ElixirClusterDemo.MixProject do
2 use Mix.Project
3
4 def project do
5 [
6 app: :elixir_cluster_demo,
7 version: "0.1.0",
8 elixir: "~> 1.9",
9 start_permanent: Mix.env() == :prod,
10 deps: deps()
11 ]
12 end
13
14 def application do
15 [
16 extra_applications: [:logger],
17 mod: {ElixirClusterDemo.Application, []}
18 ]
19 end
20
21 defp deps do
22 [
23 {:libcluster, "~> 3.1"}, # added
24 {:swarm, "~> 3.0"}, # added
25 ]
26 end
27end
I then modify my application file to start libcluster
’s Cluster.Supervisor
:
1defmodule ElixirClusterDemo.Application do
2 use Application
3
4 def start(_type, _args) do
5 children = [
6 {Cluster.Supervisor, [
7 Application.get_env(:libcluster, :topologies),
8 [name: ElixirClusterDemo.ClusterSupervisor]
9 ]},
10 # ... your own children here
11 ]
12 Supervisor.start_link(children, strategy: :one_for_one, name: ElixirClusterDemo.Supervisor)
13 end
14end
And add a couple of config files to configure my topology:
1# config/config.exs
2import Config
3import_config "#{Mix.env()}.exs"
4
5
6# config/prod.exs
7import Config
8
9config :libcluster,
10 topologies: [
11 topology: [
12 strategy: Cluster.Strategy.Kubernetes.DNS,
13 config: [
14 service: "elixir-cluster-demo",
15 application_name: "elixir_cluster_demo",
16 ]
17 ]
18 ]
19
20# This will exclude all of our remote shells, observers, etc:
21config :swarm,
22 node_whitelist: [~r/^elixir_cluster_demo@.*$/]
§Setting up release
First we’ll use mix to generate some release configuration files:
$ mix release.init
And then we customize the rel/env.sh.eex
file:
1# rel/env.sh.eex
2export RELEASE_DISTRIBUTION=name
3export RELEASE_NODE=<%= @release.name %>@$(hostname -i)
See https://hexdocs.pm/libcluster/Cluster.Strategy.Kubernetes.DNS.html for why
we use hostname -i
instead of the FQDN.
We’ll need an image to spawn in kubernetes, so let’s create a Dockerfile:
1FROM elixir:1.9.1-alpine AS build
2WORKDIR /app
3ENV MIX_ENV=prod
4RUN mix local.hex --force \
5 mix local.rebar --force
6# Copy deps and mix file in first
7# to cache dep compilation
8COPY deps mix.exs ./
9RUN mix deps.compile
10COPY . .
11RUN mix release
12
13FROM elixir:1.9.1-alpine
14WORKDIR /app
15COPY --from=build /app/_build/prod/rel/elixir_cluster_demo /app
16CMD ["/app/bin/elixir_cluster_demo", "start"]
I also have a .dockerignore
file that looks like this:
_build/
just to avoid copying over our local _build
files each time.
Build the image with
$ docker build -t elixir-cluster-demo:latest .
We should now have our image.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
elixir-cluster-demo latest 21cc505759db About an hour ago 98.1MB
Since this image is already available inside Docker for Mac I don’t need to do anything else.
If your kubernetes cluster is located elsewhere you’ll need to push the image to a container registry available to the kubernetes cluster.
§Kubernetes Configuration
The kubernetes config is very simple.
We are going to create a deployment to manage replica-set and pods for us, and a “headless” service to allow us to discover our cluster nodes via DNS (I’m using the default CoreDNS).
Here are the two objects we need (defined in the same file)
1# k8s.yml
2apiVersion: v1
3kind: Service
4metadata:
5 name: elixir-cluster-demo
6spec:
7 selector:
8 app: elixir-cluster-demo
9 clusterIP: None # "headless" service
10---
11apiVersion: apps/v1
12kind: Deployment
13metadata:
14 name: elixir-cluster-demo
15 labels:
16 app: elixir-cluster-demo
17spec:
18 replicas: 3
19 selector:
20 matchLabels:
21 app: elixir-cluster-demo
22 template:
23 metadata:
24 labels:
25 app: elixir-cluster-demo
26 spec:
27 containers:
28 - name: elixir-cluster-demo
29 image: elixir-cluster-demo:latest
30 imagePullPolicy: Never # to pick up Docker for Mac images built
And to apply the objects:
$ kubectl apply -f k8s.yml
We should now have 3 pods (and containers) running with our image, and a service that manages the DNS A records on the given service name.
You can check that the DNS works:
$ kubectl run my-dns-test-pod -ti --restart=Never --rm --image=alpine -- sh
/ # apk add bind-tools
/ # dig +short elixir-cluster-demo.default.svc.cluster.local
10.1.0.66
10.1.0.65
10.1.0.67
/ # ^D
/ # pod "my-dns-test-pod" deleted
pod default/my-dns-test-pod terminated (Error)
You can try deleting some pods and check again to see how the DNS changes (but not instantly) over time.
§Cluster Node Output
If all went will you should see something like this in the pod logs:
16:15:25.758 [info] [swarm on elixir_cluster_demo@10.1.0.69] [tracker:init] started
16:15:25.775 [info] [libcluster:topology] connected to :"elixir_cluster_demo@10.1.0.65"
16:15:25.776 [info] [swarm on elixir_cluster_demo@10.1.0.69] [tracker:ensure_swarm_started_on_remote_node] nodeup elixir_cluster_demo@10.1.0.65
16:15:25.793 [info] [libcluster:topology] connected to :"elixir_cluster_demo@10.1.0.67"
16:15:25.806 [info] [swarm on elixir_cluster_demo@10.1.0.69] [tracker:ensure_swarm_started_on_remote_node] nodeup elixir_cluster_demo@10.1.0.67
16:15:30.724 [info] [swarm on elixir_cluster_demo@10.1.0.69] [tracker:cluster_wait] joining cluster..
16:15:30.724 [info] [swarm on elixir_cluster_demo@10.1.0.69] [tracker:cluster_wait] found connected nodes: [:"elixir_cluster_demo@10.1.0.67", :"elixir_cluster_demo@10.1.0.65"]
16:15:30.724 [info] [swarm on elixir_cluster_demo@10.1.0.69] [tracker:cluster_wait] selected sync node: elixir_cluster_demo@10.1.0.67
Great, we now have a cluster.
§Swarm as a process registry
Swarm can be used as a regular process registry. Using this Counter example:
1defmodule ElixirClusterDemo.Counter do
2 use Agent
3
4 def start_link(name, val) do
5 Agent.start_link(fn -> val end, name: via_swarm(name))
6 end
7
8 def value(name) do
9 Agent.get(via_swarm(name), &(&1))
10 end
11
12 def increment(name) do
13 Agent.update(via_swarm(name), &(&1 + 1))
14 end
15
16 defp via_swarm(name) do
17 {:via, :swarm, name}
18 end
19end
On one of our nodes:
iex(elixir_cluster_demo@10.1.0.80)1> ElixirClusterDemo.Counter.start_link(:my_proc, 0)
{:ok, #PID<0.881.0>}
On a different node, try to register the same process again:
iex(elixir_cluster_demo@10.1.0.81)1> ElixirClusterDemo.Counter.start_link(:my_proc, 0)
{:error, {:already_started, #PID<28548.881.0>}}
And we can call our process from any of the nodes:
iex(elixir_cluster_demo@10.1.0.82)1> ElixirClusterDemo.Counter.value(:my_proc)
0
iex(elixir_cluster_demo@10.1.0.82)2> ElixirClusterDemo.Counter.increment(:my_proc)
:ok
iex(elixir_cluster_demo@10.1.0.82)3> ElixirClusterDemo.Counter.value(:my_proc)
1
iex(elixir_cluster_demo@10.1.0.81)1> ElixirClusterDemo.Counter.value(:my_proc)
1
iex(elixir_cluster_demo@10.1.0.81)2> ElixirClusterDemo.Counter.increment(:my_proc)
:ok
iex(elixir_cluster_demo@10.1.0.81)3> ElixirClusterDemo.Counter.value(:my_proc)
2
So there you have it. Cluster-wide process registry.
§Notes
I’ve had swarm deadlock on me multiple times in it’s :syncing
state.
Here is a state dump
1{:syncing,
2 %Swarm.Tracker.TrackerState{
3 clock: {1, 0},
4 nodes: [:"elixir_cluster_demo@10.1.0.79", :"elixir_cluster_demo@10.1.0.78"],
5 pending_sync_reqs: [#PID<28596.845.0>],
6 self: :"elixir_cluster_demo@10.1.0.77",
7 strategy: #<Ring[:"elixir_cluster_demo@10.1.0.79", :"elixir_cluster_demo@10.1.0.78", :"elixir_cluster_demo@10.1.0.77"]>,
8 sync_node: :"elixir_cluster_demo@10.1.0.78",
9 sync_ref: #Reference<0.4085255612.672137219.185792>
10 }}
The pending_sync_reqs
is never resolved for some reason. Haven’t digged into why yet.
Having the node where that pid (#PID<28596.845.0>
) is located on die resolves this, but it doesn’t
seem to resolve itself.
This seem to happen when a lot of nodes (3 in this case) join at the same time (aka, killing all pods in a replica-set).
I probably won’t be using Swarm until I figure out why this happens,
but I havn’t had any problems with libcluster
(unless this turns out to be one) so I’ll probably be using that for auto-clustering my nodes
from now on.