Friday, September 29, 2023

Same-node communication

For containers/VMs running on the same physical machine - including containers in the same Pod or in different Pods scheduled using affinity - it would be highly useful to use modern inter-process communication based on shared memory, DMA or virtio instead of keep copying bytes from buffer to kernel buffer to yet another buffer ( 3 copy is the best case - usually far more).

We have the tools - Istio CNI (and others) can inject abstract unix sockets, there are CSI providers that can inject real unix sockets. 

Unix sockets - just like Android Binder - can pass file descriptors and shared memory blocks to a trusted per node component - which can further pass it to the destination after applying security policies. 

I was looking into this for some time - I worked for many years in Android so I started in the wrong direction attempting to use binder ( which is now included in many kernels ). But I realized Wayland is already there, and it's not a bad generic protocol if you ignore the display parts and the XML. 

Both X11 and Wayland use shared buffers on the local machine - but X11 is a monster with an antiquated protocol focused on rendering on the client - and browsers are doing this far better. Wayland was designed for local display and security - but underneath there is a very clean IPC protocol based on buffer passing. 

How would it look like in Istio or other cloud meshes ? Ztunnel (or another per-node daemon ) would act as a CSI or as a CNI injecting an unix socket in each Pod. It could use the Wayland binary protocol  - but not implement any of the display protocols, just act as a proxy. If it receives a TCP connection - it can just pass the file descriptor after reading the header, but it would mainly act as a proxy for messages containing file/buffer descriptors. Like Android, it can also pass open UDS file descriptors from a container to another, after checking permissions - allowing direct communication. 

The nice thing is that even when using VMs instead of containers - there is now support for virtwl in kernel and sommelier - and this would also work for adding stronger policies on a desktop or when communicating with a GPU. 

Modern computers have a lot of cores and memory - running K8S clusters with fewer but larger nodes and taking advantage of affinity can allow co-location of the entire stack, avoiding slower network and slower TCP traffic for most communications - while keeping the 'least privilege' and isolation. Of course, a monolith can be slightly faster - but shared memory is far closer in speed compared with TCP.

I've been looking at this for few years in my spare time - most of the code and experiments is obsolete now, but I think using Wayland as a base ( with a clean, display independent proxy) is the right pragmatic solution. And simpler is better - I still like Binder and Android model - wish clouds would add it to their kernels...

No comments: