docker

The state of Docker on popular RISC-V platforms

I've been testing a Milk-V Jupiter this week, and have tested a number of other RISC-V development boards over the past two years.

As with any new CPU architecture, software support and ease of adoption are extremely important if you want to reach a wider audience. I wouldn't expect every developer and SBC hobbyist to be able to compile the Linux kernel, and the need to compile much of anything these days is getting rare. So having any instance where one has to know how to tweak a Makefile or pass in different flags to a compiler is a bit of a turn-off for platform adoption.

So one thing I've followed closely is how easy it is for me to get my own software running on RISC-V boards. It's one thing to run some vendor-provided demos. It's another entirely to take my real-world applications and infrastructure apps, and get them to work without hassle.

And to that end, Docker and Ansible, two tools I use extensively for dev/ops work, both run stably—though with plenty of caveats since RISC-V is still so new.

Resolving 'Temporary failure in name resolution' on Pi OS 12 Bookworm

Raspberry Pi OS version 12 (based on Debian 12 Bookworm) uses NetworkManager instead of dhcpcd for managing network connections, DNS resolution settings, DHCP, etc.

I've already mentioned using nmcli and nmtui for managing WiFi settings, but I ran into a strange issue after installing Docker on a fresh Raspberry Pi OS installation today. Suddenly DNS stopped working.

Trying to ping anything on the Internet gave me:

$ ping www.google.com
ping: www.google.com: Temporary failure in name resolution

As always, It was DNS. It was like DNS just gave up the ghost! Trying to change settings via nmtui seemed to not work (I tried DHCP for IPv4 with manual DNS, and that wasn't working).

Luckily, I found this post and followup comments mentioning the proper nmcli incantation to override DNS settings for an interface, so here it is (assuming built-in Ethernet):

Docker and systemd, getting rid of dreaded 'Failed to connect to bus' error

The following error has been the bane of my existence for the past few months:

TASK [geerlingguy.containerd : Ensure containerd is started and enabled at boot.] ***
fatal: [instance]: FAILED! => {
  "changed": false,
  "cmd": "/bin/systemctl",
  "msg": "Failed to connect to bus: No such file or directory",
  "rc": 1,
  "stderr": "Failed to connect to bus: No such file or directory",
  "stderr_lines": [
    "Failed to connect to bus: No such file or directory"
  ],
  "stdout": "",
  "stdout_lines": []
}

Since I use Molecule with my Ansible roles and playbooks to test them in identical CI environments both locally and in GitHub Actions, I can maintain an identical environment inside which tests are run. And many of my roles and playbooks need to test whether systemd services are configured and run correctly.

But Docker recently switched from cgroups v1 to cgroups v2, and that started this 'Failed to connect to bus' business—systemd relied on some configuration that was easy enough to add in the past: just run your containers with these options:

Using a reverse-NFS mount to access Docker container's data from macOS

For years, Mac users have dealt with slow filesystem performance for Docker volumes when using Docker for Mac. This is because the virtualized filesystem, which used osxfs for a while and will soon be upgraded to use VirtioFS.

But if you need to do large operations on huge codebases inside a shared directory, even using NFS to share from the Mac into Docker is a lot slower than running a native Docker volume or just using files inside the container's own filesystem.

macOS Disk Utility APFS Case Insensitive filesystem

New Docker for Mac VirtioFS file sync is 4x faster

Docker for Mac's shared volume performance saga continues!

After monitoring the issue File system performance improvements for years (discussion has moved to this issue now), it seems like the team behind Docker Desktop for Mac has finally settled on the next generation of filesystem sync.

For years, the built-in osxfs sync performance has been abysmal. For a Drupal developer like me, running a default shared volume could lead to excruciating slowdowns as PHP applications like Symfony and Drupal scan thousands of files when building app caches.

Or God forbid you ever have to install dependencies using Composer or NPM over a shared volume!

It got to the point where I started using NFS to speed up volume performance. Heck, the Docker team almost added Mutagen sync, which I tested successfully, but it caused problems for too many projects.

Allowing Ansible playbooks to work with new user groups on first run

For a long time, I've had some Ansible playbooks—most notably ones that would install Docker then start some Docker containers—where I had to split them in two parts, or at least run them twice, because they relied on the control user having a new group assigned for some later tasks.

The problem is, Ansible would connect over SSH to a server, and use that connection for subsequent tasks. If you add a group to the user (e.g. docker), then keep running more tasks, that new group assignment won't be picked up until the SSH connection is reset (similar to how if you're logged in, you'd have to log out and log back in to see your new groups).

The easy fix for this? Add a reset_connection meta task in your play to force Ansible to drop its persistent SSH connection and reconnect to the server:

Resolving intermittent Fedora DNF error "No such file or directory: '/var/lib/dnf/rpmdb_lock.pid'"

For many of my Ansible playbooks and roles, I have CI tests which run over various distributions, including CentOS, Ubuntu, Debian, and Fedora. Many of my Docker Hub images for Ansible testing include systemd so I can test services that are installed inside. For the most part, systemd-related issues are rare, but it seems with Fedora and DNF, I often encounter random test failures which invariably have an error message like:

No such file or directory: '/var/lib/dnf/rpmdb_lock.pid'

The full Ansible traceback is:

Be careful, Docker might be exposing ports to the world

Recently, I noticed logs for one of my web services had strange entries that looked like a bot trying to perform scripted attacks on an application endpoint. I was surprised, because all the endpoints that were exposed over the public Internet were protected by some form of authentication, or were locked down to specific IP addresses—or so I thought.

I had re-architected the service using Docker in the past year, and in the process of doing so, I changed the way the application ran—instead of having one server per process, I ran a group of processes on one server, and routed traffic to them using DNS names (one per process) and Nginx to proxy the traffic.

In this new setup, I built a custom firewall using iptables rules (since I had to control for a number of legacy services that I have yet to route through Docker—someday it will all be in Kubernetes), installed Docker, and set up a Docker Compose file (one per server) that ran all the processes in containers, using ports like 1234, 1235, etc.

The Docker Compose port declaration for each service looked like this:

Revisiting Docker for Mac's performance with NFS volumes

tl;dr: Docker's default bind mount performance for projects requiring lots of I/O on macOS is abysmal. It's acceptable (but still very slow) if you use the cached or delegated option. But it's actually fairly performant using the barely-documented NFS option!

July 2020 Update: Docker for Mac may soon offer built-in Mutagen sync via the :delegated sync option, and I did some benchmarking here. Hopefully that feature makes it to the standard Docker for Mac version soon.

September 2020 Update: Alas, Docker for Mac will not be getting built-in Mutagen support at this time. So, read on.

Molecule fails on converge and says test instance was already 'created' and 'prepared'

I hit this problem every once in a while; basically, I run molecule test or molecule converge (in this case it was for a Kubernetes Operator I was building with Ansible), and it says the instance is already created/prepared—even though it is not—and then Molecule fails on the 'Gathering Facts' portion of the converge step: