How A Cryptocurrency Miner Made Its Way onto Our Internal Kubernetes Clusters

09-20-2018 2587 words 13 minutes

/posts/2018-09-20-kubernetes-crypto-miner/featured-image.png

Contents

Medium post: https://medium.com/jw-player-engineering/how-a-cryptocurrency-miner-made-its-way-onto-our-internal-kubernetes-clusters-9b09c4704205

The explosion of cryptocurrency in recent years spurred a wave of exploits targeting unsuspecting machines to mine cryptocurrency for the attackers. Earlier in the year, the JW Player DevOps team discovered one of the aforementioned miners running on our development and staging Kubernetes clusters.

To be clear, our production cluster was not affected, no JW Player customer data was accessed or exposed, and service was uninterrupted. Malicious actors are not always intent on stealing information or taking a website down, they can be just as content (or more so) in stealing your compute power. We take any intrusion very seriously though, and wanted to share our findings to help other DevOps teams harden their systems.

This blog post is broken up into several parts detailing — discovery and diagnosis, our immediate response, discovering and replicating the attack vector, damage assessment, and plans for preventative measures to further protect our systems.

Discovery

Day 0, 21:06 EST: Datadog alerted us (the DevOps team) to a high normalised load average on our staging environment. The cause of the high load averages was determined to be an increased load on one of our legitimate services. This was normal behaviour.

Day 1, 2018, 16:53 EST: Another Datadog alert was triggered with the same high normalised load average issue, this time on our development environment. That same service repeatedly triggered alerts from it constantly scaling up and scaling down on both development and staging environments. Due to the initial triage of the previous alert and the volume of incoming alerts of the same type, those alerts were muted until the flux stabilised.

Day 3, 17:40 EST: The increased load over the course of 4 days across both clusters was no longer considered normal. Further investigation was necessary to either address the increased load or tweak the Datadog monitors. I logged onto one of the Kubernetes instances via SSH and examined resource consumption using top. A gcc process ran at 100% CPU utilisation and was an immediate suspect for high load averages on the machine. This process was found to be running across every machine in the development and staging clusters.

Diagnosis

Note: All terminal output has been truncated to only show relevant information. A truncated output is denoted with an ellipsis ... .

Initially, I believed a defective container was launched through our internal deployment tool and ran gcc. However since the master nodes were affected, that hypothesis seemed incorrect. Since JW Player follows the practice of keeping infrastructure as code, I double checked our repositories for any applied yaml configurations or deployments that ran gcc and found nothing suspicious.

For further visibility into why gcc would be running at all, I inspected the process:

admin@ip-10-10-201-13:~$ cat /proc/29254/status
Name:    gcc
Umask:    0022
State:    S (sleeping)
Tgid:    29254
Ngid:    0
Pid:    29254
PPid:    3391
...

admin@ip-10-10-201-13:~$ ps aux | grep 3391
root      3391  0.0  0.0 413632  3740 ?        Sl   Sep05   0:00 docker-containerd-shim 7f8f77e3ad08b863841cadc833577f27062ed398546f663a8a77f778ba6c8d3d /var/run/docker/libcontainerd/7f8f77e3ad08b863841cadc833577f27062ed398546f663a8a77f778ba6c8d3d docker-runc

Strange. A Docker container is running gcc despite our code base stating otherwise. Inspecting the Docker container reveals that Weave Scope was the parent process:

admin@ip-10-10-201-13:~$ sudo docker inspect 7f8f77e3ad08b863841cadc833577f27062ed398546f663a8a77f778ba6c8d3d
[
    {
        "Id": "7f8f77e3ad08b863841cadc833577f27062ed398546f663a8a77f778ba6c8d3d",
        "Created": "2018-09-05T17:45:20.073800454Z",
        "Path": "/home/weave/scope",
        "Args": [
            "--mode=probe",
            "--probe-only",
            "--probe.docker.bridge=docker0",
            "--probe.docker=true",
            "--probe.kubernetes=true",
            "weave-scope-app.weave.svc.cluster.local:80"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 3408,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2018-09-05T17:45:20.19489647Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
        "Image": "sha256:4b07159e407beba7d74f1986d3b90e3103f33b87269991400ca2fd1eedf1f4eb",

Weave Scope is a tool used to monitor Kubernetes in real time, and there was no reason for it to be running gcc. At this point I found that the "gcc" binary was actually a cryptocurrency miner with the filename gcc.

objdump:

admin@ip-10-10-201-13:/$ objdump -sj .rodata /gcc
...
 f0c00 584d5269 6720322e 362e320a 20627569  XMRig 2.6.2. bui
 f0c10 6c74206f 6e204d61 79203330 20323031  lt on May 30 201
 f0c20 38207769 74682047 43430000 00000000  8 with GCC......
...

Furthermore, the binary was running from the host machine’s root directory and not from a container. Using Weave Scope to gain more insight into what this container was doing, an outbound connection to a mining pool further confirmed my suspicions.

/posts/2018-09-20-kubernetes-crypto-miner/images/weave-scope-gcc-outbound-connection.png — Weave Scope GCC Output Connection

Immediate Action

Day 3, 19:03 EST: After identifying Weave Scope to be the source of spawning the miner masquerading as gcc, the team was immediately notified. Working in parallel,

Weave Scope was stopped and its deployment was removed from all our Kubernetes clusters.
Access logs were checked for any signs of unauthorised access to our instances.
The gcc process outbound connections were inspected and found to only communicate with a mining pool.
A Google search for XMRig led to a GitHub repository for a Monero miner. Combing through the source confirmed that its only function is to mine Monero.
One of the affected nodes was isolated from the cluster for future investigation.
All nodes in each cluster were rotated out to ensure all affected entities were destroyed and rebuilt.

Discovering the Attack Vector

Finding a cryptocurrency miner on our internal clusters was alarming and indicative of a vulnerability in the software we were running or an issue with our setup. Because Weave Scope was the parent process that spawned the miner, I checked for CVEs related to Weave and Weave Scope, sifted through the GitHub issues, and looked to see if any similar cases existed. No known vulnerabilities were published, no traces of DNS tampering or unauthorised access into our clusters was found, and the Docker image hash for Weave Scope matched the published image on DockerHub.

The next step I took to determine an attack vector was launching a new barebones, isolated Kubernetes cluster with our existing deployment method. Watching over the entire process exposed the first issue — the Weave Scope load balancer security group was public facing and exposed to the world.

/posts/2018-09-20-kubernetes-crypto-miner/images/exposed-security-group.png — Exposed Security Group

Anyone with our load balancer URL can access the Weave Scope dashboard without any authentication. Having metrics exposed to prying eyes will provide attackers information to work with, however one Weave Scope feature, in particular, was abused. The documentation on Weave Scope’s Github repository advertises that one of the features included is the ability to launch a command line on running containers:

Interact with and manage containers Launch a command line.
Interact with your containers directly: pause, restart and stop containers. Launch a command line. All without leaving the scope browser window.

/posts/2018-09-20-kubernetes-crypto-miner/images/weave-scope-shell-access.png — Weave Scope Shell Access

In the GUI, the terminal prompt icon presents a user with an interactive shell to the container. Even so, by design containers have a certain amount of separation from its host. An exposed load balancer along with our specific Kubernetes cluster configuration allowed arbitrary code to break out of the container and run on the host instance. Looking at the default Weave Scope configuration file for load balancers for reference,

k8s-service-type - Kubernetes service type (for running Scope in Standalone mode), can be either LoadBalancer or NodePort, by default this is unspecified (only internal access)

Our deployment was missing the annotation to make the load balancer internal:

- apiVersion: v1
  kind: Service
  metadata:
    name: weave-scope-app
    annotations:
      cloud.weave.works/launcher-info: |-
        {
          "original-request": {
            "url": "/k8s/v1.8/scope.yaml?k8s-service-type=LoadBalancer",
            "date": "Thu Sep 20 2018 17:37:19 GMT+0000 (UTC)"
          },
          "email-address": "support@weave.works"
        }        
  #--TRUNCATED--
  spec:
    ports:
      - name: app
        port: 80
        protocol: TCP
        targetPort: 4040
    #--TRUNCATED--
    type: LoadBalancer

The weave-scope container is running with the --privileged flag:

spec:
  containers:
    - name: scope-agent
      #--TRUNCATED--
      image: "docker.io/weaveworks/scope:1.9.1"
      imagePullPolicy: IfNotPresent
      securityContext:
        privileged: true
  #--TRUNCATED--

Files on the root file system were mounted onto the container:

#--TRUNCATED--
volumeMounts:
  - name: scope-plugins
    mountPath: /var/run/scope/plugins
  - name: sys-kernel-debug
    mountPath: /sys/kernel/debug
  - name: docker-socket
    mountPath: /var/run/docker.sock
#--TRUNCATED--

Containers are run as the root user.

Replicating the Attack

Simulating the attacker, a scope-agent container would be ideal to run commands on due to having elevated privileges.

images/weave-scope-exec-shell-access — Weave Scope Exec Shell Access

Demonstrated above, the host volume can be mounted onto the Docker container. This matches the listing on the underlying host. I created a simple bash script to run in the background to show that it can run on the host machine as root.

/posts/2018-09-20-kubernetes-crypto-miner/images/bash-script-running-root.png — Bash Script Running On Host Machine As Root

Through SSH on the host machine, the gcc file created through the scope-agent container is visible and running.

/posts/2018-09-20-kubernetes-crypto-miner/images/host-machine-terminal-output.png — Host Machine Terminal Output

Using the same method as in our original diagnosis, we see that gcc is running and identify the parent:

root@ip-10-30-184-17:/# ps aux | grep gcc
root      8864  0.0  0.0   1524     4 pts/7    S    16:38   0:00 ash ./gcc
root     14730  0.0  0.0  12784   948 pts/4    S+   16:39   0:00 grep gcc
root@ip-10-30-184-17:/# cat /proc/8864/status
Name:   busybox
Umask:  0022
State:  S (sleeping)
Tgid:   8864
Ngid:   0
Pid:    8864
PPid:   21594

One slight difference here is the parent PID of the gcc bash script points to /bin/ash instead of the Docker container that spawned the process:

root     21284  0.0  0.0  12784  1028 pts/4    S+   16:40   0:00 grep 21594
root     21594  0.0  0.0   1536   960 pts/7    Ss+  16:30   0:00 /bin/ash
root@ip-10-30-184-17:/# cat /proc/21594/status
Name:   ash
Umask:  0022
State:  S (sleeping)
Tgid:   21594
Ngid:   0
Pid:    21594
PPid:   21578

The parent of that process is a Docker container:

root@ip-10-30-184-17:/# ps aux | grep 21578
root     21578  0.0  0.0 199064  3756 ?        Sl   16:30   0:00 docker-containerd-shim b9652096ff9a8b524f7d9ce7688e9709fb24bcc31ffa7e7c3484c8cf117cc56a /var/run/docker/libcontainerd/b9652096ff9a8b524f7d9ce7688e9709fb24bcc31ffa7e7c3484c8cf117cc56a docker-runc
root     24923  0.0  0.0  12784   988 pts/4    S+   16:40   0:00 grep 21578
root@ip-10-30-184-17:/# docker inspect b9652096ff9a8b524f7d9ce7688e9709fb24bcc31ffa7e7c3484c8cf117cc56a
[
    {
        "Id": "b9652096ff9a8b524f7d9ce7688e9709fb24bcc31ffa7e7c3484c8cf117cc56a",
        "Created": "2018-09-20T15:59:01.868162531Z",
        "Path": "/home/weave/scope",
        "Args": [
            "--mode=probe",
            "--probe-only",
            "--probe.docker.bridge=docker0",
            "--probe.docker=true",
            "--probe.kubernetes=true",
            "weave-scope-app.weave.svc.cluster.local:80"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 7338,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2018-09-20T16:22:57.100773796Z",
            "FinishedAt": "2018-09-20T16:22:56.994841933Z"
        },
        "Image": "sha256:4b07159e407beba7d74f1986d3b90e3103f33b87269991400ca2fd1eedf1f4eb",

The slight difference may be attributed to the attack being automated and more sophisticated than the manual reproduction described. This example is also only one of several ways to break out of a privileged container. If applied to all Weave scope-agent containers, the miner can be executed on all instances since scope-agent is running on each instance to gather metrics.

Damage Assessment

An unwanted application gaining access to the root directory on all of our Kubernetes nodes is alarming. In addition to our immediate steps, further analysis was imperative to ensure that our data has not been accessed or compromised in any way.

Access logs show there were no outside logins into the host machines.
The Kubernetes API was not accessed by the intruder.
The load balancer URL was not shared and access to the Weave Scope dashboard was discovered by an automated crawler. We verified that running a cryptocurrency miner was the sole purpose of this automated attack.
The entire extent of damage done was having 100% CPU usage on one core on each of our Kubernetes instances in our development and staging environments. Pod scheduling and service was uninterrupted.
The production cluster was completely unaffected.

When designing our Kubernetes 1.10 cluster, we wanted to take advantage of the new and improved RBAC changes since 1.7.4. In the event of a more malicious attack, our current design already contains measures in place to limit access.

Our RBAC permissions restricted Weave Scope’s access, scoping it to only the weave namespace. If the Kubernetes API was queried for sensitive information such as our Kubernetes secrets, those requests would be denied.
Despite being able to break out of the container and access the underlying host filesystem, no sensitive information is stored on our Kubernetes nodes.
No published exploits or CVEs have been reported for our Kubernetes and Docker versions that would allow an attacker to retrieve output from commands run on the underlying host. Even if an attacker were to install and run a listener on the host to run arbitrary code, they would not be able to connect back due to the way our load balancer listeners and networking are set up.

To wrap up our damage assessment, an audit was done on all our Kubernetes and custom deployment yaml configuration files, and security groups, to ensure our services were not unintendedly public facing or misconfigured.

Consequences of Manual Modifications

The oversight of creating a public load balancer open to the world for an internal dashboard is not normal for our team. Our AWS CloudTrail logs showed that Kubernetes initially attached a security group defaulting to 0.0.0.0/0 access to port 80 when the load balancer was created. The security group was then properly configured to block all ingress traffic from non white-listed IPs through a manual edit on the AWS Console. However that voids my claim on how the attack was performed.

After pinpointing the deleted load balancer, its CloudTrail history contains the creation date along with a detailed event containing the security group’s identifier. Looking up that security group’s history shows that the manual edit made to firewall off unwanted traffic was reverted by Kubernetes shortly afterward.

/posts/2018-09-20-kubernetes-crypto-miner/images/cloudtrail-logs-deleted-lb.png — CloudTrail Logs For Deleted Load Balancers

Questions regarding Kubernetes reverting security group edits are present in the project’s GitHub issues. In this particular issue, a Kubernetes contributor explains:

…the way Kubernetes works, the ELB is owned by Kubernetes and you should never be forced to modify the resource manually…

We normally strictly adhere to the practice of having infrastructure as code and the load balancer should have been defined as internal, or the security group rules should have been defined in our Kubernetes yaml configuration files. In hindsight, a redeploy of Weave Scope would have reverted the manual change and needed to be manually edited back in.

Recap

At JW Player, the DevOps team has frequent code reviews and weekly architectural discussions. Prior to this incident, we had many planning sessions and discussions around upgrading our existing Kubernetes 1.7.4 cluster to 1.10.x. A mix of untimely events and decisions allowed this miner to make its way onto our clusters.

We recently migrated to a new Kubernetes version onto a new cluster with different instance sizes. Unprecedented behaviour was expected during this period.
Response time to this incident was slightly dulled by untimely alerts for a legitimate service creating noise and almost masking this issue.
Some decisions were inherited from older infrastructure and stuck around, such as running containers as the root user.
A manual change was made to a Kubernetes managed resource.

Next Steps

Learning from this lesson and moving forward, we have plans in place to harden our security. First, we learned monitoring on load is not the most effective measurement to determine if there is an issue with the Kubernetes clusters. Our first step is to improve alerting and diagnosis by first determining more insightful monitors, then tweaking them to give us assurance that there is, in fact, an anomaly present instead of being dulled to multiple alarms from an influx in workload.

To prevent this particular attack and future attacks from happening, we are researching use of tools such as falcon or sysdig for behavioural monitoring, and anomaly and intrusion detection. Istio and Linkerd may be useful in capturing and controlling end to end network traffic to observe and prevent unauthorised access. We are also analysing the computational cost of scanning Docker images and containers for known vulnerabilities.

To improve our process as a team, some time in our architectural discussions has been partitioned to revisit choices made in the past such as containers running as root. We also acknowledge that some information has been siloed off to certain team members, and embracing DevOps means this information should be shared. Communication is integral to our team, and being aware of these faults enables us s to make time to shift focus onto giving each engineer more visibility on big projects such as a large scale Kubernetes version migration.