Forwarding container traffic via a VPN in Kubernetes

Or how to (ab)use Kubernetes to get your way.

What I want to achieve

I want to forward a software's traffic so that its public IP is not the server's IP.

Preferably I'd like to use Wireguard which is a modern, secure, very low overhead kernel based L3 VPN. In short, Wireguard has very few lines of code compared to other VPN solutions and uses simple modern cryptography. It's Integrated in the Linux kernel (as well as OpenBSD and apparently soon FreeBSD). Plus the documentation is awesome :).

My requirements are :

As setup-agnostic as possible
Only route the internet requests through it (i.e make local network still work!)
Simple. Probably the most difficult part :).

How my setup is done

More details in an upcoming blog post! Subscribe to RSS for more :D.

My setup is basically a weird beast made of a K3s (lightweight distribution of Kubernetes) running on top of a NixOS host. I won't delve into why I have this setup as this deserves a blog post on its own.

Most of my applications are on Kubernetes, that's what I'll use! Except if there are too many reasons I should use host based services.

Easy testing setup

wg-quick is a shell wrapper that allows deploying Wireguard in an easier way.

Let's take my provider's Wireguard configuration by my provider and enable it using wg-quick up. What does it do ? It creates a network interface named wg-0 (see ip a). And adds some ip rules. More on this later!

That's nice, but I only want one process' traffic routed through the VPN, not all of them. So how do I do that ?

Well since I use Kubernetes, I'm using containers (also known as Docker or OCI containers). So let's try to work with that.

In-Docker setup

Let's try running Wireguard inside a container, shall we?

docker run -it ubuntu bash

We need to install wg-quick, iproute2 (the ip tool we saw above ;) ), vim and a couple of useful tools.

apt update && apt install vim wg-quick iproute2

Let's try to run wg-quick!..

RTNETLINK answers: Operation not permitted

It errors out with a Netlink error. This is a permission issue where it's trying to access some privileged interfaces of the kernel but by default, docker containers have reduced privileges.

We can use --privileged to run the container which will solve our error, for now. (DO NOT USE THIS AT HOME, IT'S NOT THE END SOLUTION!).

Setup works using wg-quick! We can test this using curl https://ifconfig.me

PoLP (Principle of Least Privilege)

As a security-aware person I don't really like having a --privileged laying around. So how can we fix this ? Let's discuss a bit what --privileged actually does (or what its absence does).

Privileged deactivates two things which are important in our case : capabilities (see man capabilities or online). Capabilities are Linux' answer to separating root into different privileges. So when you're root on linux, you're not necessarily the big boss on the machine.

Another thing that's not limited in privileged mode are syscalls via seccomp. We don't really care about that here though! The capability we want is network related and CAP_NET_ADMIN gives us that!

Quick googling shows the answer : https://stackoverflow.com/questions/27708376/why-am-i-getting-an-rtnetlink-operation-not-permitted-when-using-pipework-with-d

Okay that's good and all. But I haven't done anything yet. I've just made a way to tunnel traffic from a container. But it's a bit constraining because If I install the software inside this same container, it gets tied down version wise with wg-quick, which is annoying. As a programmer, I like good abstractions and since this software shouldn't care what public IP it has, this "setup part" will be hidden from it. abstract this "setup part" from the original software.

The Kubernetes' way

Remember I said "I use a Kubernetes distribution" ? Let's get on the same page terminology wise.

What are pods

K8s represents things as resources. The most basic, the Pod is one (or more) containers sharing namespaces. One interesting characteristic of this is that it runs all its containers in the same networking namespace. See where this is going ? One interesting characteristic of a pod is that it runs all its containers in one networking namespace. See where this is going ? :D

Let's add what we had before in the docker container in a pod!

We still have the problem of running the software separately with this unusual networking setup.

Turns out, pods have something called initContainers which are containers that run for the initialization part of a workload. Since the network namespace is shared, we can setup the interface, and then run our workload transparently!

Couple of knobs to tune for this to work :

DNS setup for K8s. We get an error while setting the DNS since it's K8s provided and we're messing with the underlying network with (unfortunately) overlapping ranges, so let's hardcode to some public one (1.1.1.1) and be done with it. I don't use K8s' service discovery there anyway, but that's something to know.

Local networking access

Wait but everything isn't working properly yet!

We still need to access the web interface of that application. The container listens on port 80, but the traffic is forwarded via the default route to our upstream VPN provider. So trying to curl it just hangs forever.

So we need to add ip rules for local traffic. Since our upstream seems to use IPs in 10.0.0.0/8 and I'd prefer having a generic solution there and not hardcode the upstream IP :), I added the following to the wireguard configuration. It adds a route for 10.0.0.0/8 specifically to override the "default" route to the local interface instead of the wireguard one for local traffic. This makes the local traffic able to go back to its origin.

PostUp = ip route add 10.0/8 dev eth0 table 51820

Here is the final manifest !

apiVersion: apps/v1
kind: Deployment
metadata:
 name: something
spec:
  selector:
    matchLabels:
      app: something
  template:
    metadata:
      labels:
        app: something
    spec:
      # We change this from the normal configuration which points to CoreDNS
      # because we're using ~wg-quick~ which makes all traffic go to the ~wg-0~
      # interface. Thus making the basic DNS inaccessible. Thus we need
      # to provide an alternative, public DNS server.
      #
      # Note that this works because this container doesn't contact any other
      # containers, so it's perfectly fine not to have internal DNS resolution.
      #
      # dnsPolicy None allows pods to ignore default K8s' resolv.conf.
      dnsPolicy: "None"
      dnsConfig:
        # We could use DNScrypt / DoH / DoT here but well, we won't leak much
        # information anyway.
        nameservers:
          # Cloudflare
          - "1.1.1.1"
          # Google
          - "8.8.8.8"
        # edns0 for 512 bytes+ UDP DNS requests.
        options:
        - name: edns0
          value: ""

      # The way this works is the following :
      # Pods in K8s share the same network namespace (they are on the same node).
      # So we can setup the network namespace to have a wireguard interface and
      # add ip rules / iptables options to make everything go through it by
      # default.
      #
      # Once setup is finished, we continue to the "normal containers".
      #
      # Cleanup is automatic when the container exits via the Linux Garbage
      # Collector TM.
      #
      # The more intelligent way (forward) would be to have a way to patch
      # containers on the fly to make them "remote" by default. We can do that
      # by means of MutatingWebhook from K8s' dynamic admission controllers.
      initContainers:
        - name: wg-init
          image: wg-init:some-tag
          # We need to have CAP_NET_ADMIN capability here :
          # - to create the ~wg-0~ network interface ;
          # - to create ip rules/iptables to redirect all traffic but wireguard's
          # one through the ~wg-0~.
          securityContext:
            capabilities:
              add:
                - NET_ADMIN
          args:
          - wg-quick
          - up
          - wg0
          volumeMounts:
          - name: wg-key
            mountPath: /etc/wireguard
      containers:
        - name: some-container
          image: docker.io/some-container:some-tag

          # Web UI
          ports:
          - name: http
            containerPort: 8080

          volumeMounts:
            - name: content
              mountPath: /content
      volumes:
        # NB: we need to remove the `DNS =` from the configuration otherwise
        # it tries calling resolvconf which doesn't work in a container.
        # We patch the resolv.conf configuration from the PodSpec. Otherwise it
        # would fail for all DNS request.
        - name: wg-key
          secret:
            secretName: wg-key
            defaultMode: 0420

        - name: content
          persistentVolumeClaim:
            claimName: content

Going further

We can go further by creating a mutating webhook server in Kubernetes to provide this automatically to containers with an annotation for instance. This would use MutatingWebhookConfiguration to auto-patch the pod(s) instead of having a static configuration like this.

See this for an example of a patching server.

Probably something for yet another blog post!

On a pure docker setup, we can use something called OCI hooks which are executable launched at certain times like preStart of containers to do "things". I've used that in the past to move an interface from the host to a container.