Beyond the AKS Basics: Practical Tips for Your Kubernetes Journey

Beyond the AKS Basics: Practical Tips for Your Kubernetes Journey
I recently completed Microsoft’s Kubernetes on Azure course (here is an archived version) and while it provided a solid foundation, I wanted to share some practical insights and debugging techniques that weren’t covered. This post dives into real-world scenarios with Azure Kubernetes Service (AKS), offering tips for debugging containers and nodes, tackling tricky issues like Posit Workbench session failures, and leveraging tools like Packer. Plus, we’ll show you an example of how we debugged an initially perplexing and frustrating development issue using useful commands.
Following this course, I deployed a large Azure-hosted Posit Workbench deployment. This involved VMs coordinating Kubernetes jobs for user development environments (like RStudio), behind a reverse proxy.
Azure Free Trial: A Great Starting Point
First things first, if you’re new to Azure, don’t forget to take advantage of the various free trials offered. Sometimes one and twelve month free trials are offered. It’s an excellent way to get hands-on experience with AKS. See here.
Level Up Your Debugging Skills
Peeking Inside Containers
Ever had a Kubernetes session refuse to start and wondered what’s going on under the hood? A super useful command is:
kubectl run -it --image <your_image> <your_container_name> -- /bin/bash
This lets you spin up a temporary container based on your image and get a shell inside. For example, when troubleshooting why some Kubernetes jobs for Posit Workbench weren’t starting, this command came in handy:
kubectl run -it --image ${AZURE_CONTAINER_REPOSITORY_NAME}.azurecr.io/${IMAGE_NAME} testme -- /bin/bash
In the above example, we’re using custom-built container images which we pushed to Azure Container Registry, which are based on ones published by Posit and use Packer to build in extra customizations. Keep reading on for more on Packer!
Diving into Nodes
Surprisingly, even with a managed service like AKS, you can debug the underlying nodes!
This proved invaluable when I needed to check the software version in use for implementing an NFS share. It lets you use a shell on the node itself:
kubectl debug node/<your_node_name> -it --image=ubuntu
Learn more about this powerful technique in the Kubernetes documentation.
Logs
Kubernetes logs are essential, but don’t forget the logs of other components
in your system outside of Kubernetes also. Many “always-on” Linux-based
applications rely on systemctl
and journalctl
. This allows you to view logs
filtered by service unit (your application), time range, and specific keywords.
sudo journalctl -u $SERVICE_UNIT_NAME --since "$TIME_RANGE" -g "$SEARCH_TERM"
For example, when a certain Posit Workbench session (corresponding to a Kubernetes job) was having issues earlier that day, I could quickly find relevant events on the application’s virtual machine using this Linux command:
sudo journalctl -u rstudio-launcher --since today -g $SESSION_ID
This can often provide valuable context that complements your Kubernetes logs.
The Unexpected Culprit: Looking Beyond Kubernetes
Here’s a crucial lesson I learned the hard way. Sometimes, issues aren’t within your Kubernetes cluster at all. We had a setup with a reverse proxy sitting in front of applications on virtual machines with a Kubernetes backend. We anticipated users might experience some initial delay when launching jobs due to system resources. However, we were caught off guard when users started reporting 504 Gateway Timeout errors after exactly two minutes.
Our initial instinct was to deep-dive into the Kubernetes configurations. But after some head-scratching, our client pointed out the consistent two-minute interval. This was the key! It forced us to broaden our investigation to all components in the request path, even those outside the Kubernetes cluster.
Our troubleshooting process involved meticulously listing every component from the Kubernetes node all the way to the user’s browser. We then started checking timeout settings on each. Guess what? The reverse proxy (more specifically an Azure Application Gateway), sitting innocently in front of our VMs and the rest of our system had a default two-minute connection timeout. If allocating a job to a node took longer than that, the proxy would prematurely close the connection, resulting in the dreaded 504 error.
This experience underscored the importance of considering the entire system architecture when debugging. Don’t just focus on Kubernetes – think about load balancers, proxies, firewalls, and any other piece of infrastructure that might be interacting with your cluster. We were lucky the problematic component was one of the first we checked!
Automating Image Creation with Packer
Packer is a fantastic tool for building identical machine images for multiple platforms. These can – for example – then be pushed to Azure Container Registry for use in Azure Kubernetes Service, or used on VMs in Azure.
The real power comes from the ability to then run Ansible playbooks on top of a base image. This allows us to automate the installation of software and configuration, leveraging existing Ansible roles we have developed in-house which weren’t necessarily developed for Kubernetes sessions.
Summary
Kubernetes success goes beyond cluster configs. From debugging containers and nodes to tracing issues through proxies, real-world AKS work demands a full-system view. With the right tools and mindset, you’ll turn tricky problems into valuable lessons.
