Saturday, November 19, 2022

K8S Events

A few notes on K8S Events.  

K8S at its core is a database of configs - with a stable and well defined schema. Different applications (controllers) use the database to perform actions - run workloads, setup networking and storage. The interface to the database is nosql - with a 'watch' interface similar to pubsub/mqtt that allow controllers to operate with very low latency, on every change.

Most features are defined in terms of CRDs - the database object, with metadata (name, namespace, labels, version ), data and status. The status is used by controllers to write info about how the object was actuated, and by users to find out. For example a Pod represents a workload - the controllers will write the IP of the pod and 'running' in status. Other controllers will use this information to update other object - like EndpointSlice. 

K8S also has a less used and more generic pubsub mechanism - the Event, for 'general purpose' events.

Events, logs and traces are similar in structure and use - but different in persistence and on how the user interacts with them. While 'debugging' is the most obvious use case, analyzing and using them in code, to extract information and trigger actions is where the real power lies.  

The CRD 'status' is persistent and treated as a write to the object - all watchers will be notified, the writing is quite expensive.  Logs are batched and generally written to specialized storage, and deleted after some time - far cheaper but harder to use programmatically, since each log system has a different query API. 

In K8S events have 1h default storage - far less than logs, which are typically stored for weeks, or Status - which is stored as long as the object lives. K8S implementation may also optimize the storage - keep them in RAM longer or using optimized storage mechanisms.  In GKE (and likely others) they are also logged to stackdriver - and may have longer persistence. 

Events are associated with other objects using 'involvedObject' field, which links the event to an object, and is used in 'kubectl describe'. This pattern is similar to the new Gateway 'policy attachment' - where config, overrides or defaults can be are attached to other resources.  

```
# Selectors filter on server side.
kubectl get events -A --field-selector involvedObject.kind!=Pod

kubect get events -A --watch
```

Watching the events can be extremely instructive and reveal a lot of internal problems - Status also includes errors, but you need to know to watch a particular object. 

As a 'pubsub' system the Events are far from ideal - both as storage, API and feature set - but they are close in semantics and easy to bridge to a real pubsub, and for K8S they are very useful. 

In the past I tried to add more Events to Istio - there was some interest but never got to finish the PR, maybe with Ambient we can try again. The real power of Events is not for debugging, but in synchronizing between applications in real time, for example propagate the IP address and info about a node as soon as it connects to the control plane. 

CNCF CloudEvents provides an API and integrations with various messaging and pubsub systems - it is a bit over-designed and more complex then it needs to be, but the integrations make it useful and it provides a basic HTTP based interface that is easy to work with. 

Istio also provides some events over XDS - and can also act as a bridge, to allow components using a control plane to get both configs and events. 

Links:

  • https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#event-v1-core
  • https://www.bluematador.com/blog/kubernetes-events-explained - how to watch and filter
  • https://www.cncf.io/blog/2021/12/21/extracting-value-from-the-kubernetes-events-feed/
TODO:
  • Evaluate CloudEvents integrations with K8S Events and 'real' pubsub
  • Extend Istio XDS 'debug' bridge to Events, evaluate use for sync and ambient info if Events are as reliable as pubsub.
  • Generate events from Istiod - connect/disconnect are clear. Warnings about bad configs are unlikely to be good unless frequency can be controlled.

Posts and comments

Publishing content is very easy - github, blogger, personal pages, countless social sites and fancy P2P networks. 

Reading content is harder - Reddit, Twitter, Feedly and few others are attempting to identify 'interesting' content and organize it. There is too much content published, and too little diversity if you stick with a list of subscriptions (old style blog readers) and too much garbage and propaganda if you want broader sources. 

Comments are the third problem - they usually are more interesting than the original post, and are valuable both to the reader and to the original poster.  Just posting some content on a page with no way to get any feedback is a waste of time.

I've started to look for an alternative to Twitter for reading - and while doing so I also started for a place to start posting/blogging/ranting. I rarely post on Twitter - but I keep a lot of notes on various projects and experiments - just too lazy to publish them except in comments and 'readme' in git repos I work on.

The options so far:

  • blogger - I've been using it for a very long time, easy - but very unfriendly to code and markdown.
  • reddit - for posting in specific topics. Great comments and community associated with the topic. I've seen many posting a 'blog' or page on github and linking it to reddit for discussions/comments.
  • https://github.com/utterance/utterances - comments become github issues. Interesting idea - using search in the issue tracker to hold the post comments. Best associated with blogs hosted on github.
  • disqus - adds on the free edition, $11 for add free. Work to set it up.
  • cusdis - open source, selfhosted option - supports sqlite.
  • ...
For now I'm (re)starting to use blogger - it is by far the least effort, since my goal is to publish my tech notes and rants and maybe get a bit of feedback if anyone stumbles on them - but also checking Reddit. I wouldn't spam Reddit with my rants - but most communities seem high quality and the moderation on each community seems like a far better way to keep the noise out than even old Twitter (very low bar, but that's what I used to use to read, when I somehow trusted they have a team fighting against garbage content and disinformation). 

If I get bored - I would try 'utterances' with github and cusdis.

Wednesday, November 16, 2022

Egress capture using TPROXY

Very low level notes on intercepting traffic for Istio and similar apps. IPtables provide 2 mechanisms to capture, REDIRECT and TPROXY. The first is buggy and not recommended by the kernel docs. TPROXY unfortunately requires NET_ADMIN (or root) and is only available in the `PREROUTING` chain, i.e. can only be used for packets received on an interface - not on packets sent by local apps ( OUTPUT ).

I've been playing with this for some time, and this is what I've found:

  1. Use an OUTPUT chain to mark packets - just like we do for REDIRECT interception
  2. Use a routing table with 'dev lo' to route all marked packets to loopback. 
  3. Apply TPROXY capture on the loopback PREROUTING - if the dest IP is not 127.0.0.0/8
It looks like this: 

  # Anything with the mark 15001 will be sent to loopback
  ip -4 rule add fwmark 15001 lookup 15001
  ip -4 route add local default dev lo table 15001

  # Calling this chain will set the mark resulting in route to lo
  iptables -t mangle -N ZT_CAPTURE_EGRESS
  iptables -t mangle -A ZT_CAPTURE_EGRESS -j MARK --set-mark 15001

  # PREROUTING on loopback - anything routed by the route table 15001, based on OUTPUT mark
  # Ignore local source or dst - it's not egress
  iptables -t mangle -N ZT_TPROXY
  iptables -t mangle -A ZT_TPROXY -d 127.0.0.0/8 -j RETURN
  iptables -t mangle -A ZT_TPROXY -d 127.0.0.0/8 -j RETURN
  iptables -t mangle -A ZT_TPROXY --match mark --mark 15001 -p tcp  -j TPROXY --tproxy-mark 15001/0xffffffff --on-port 15001
  iptables -t mangle -A PREROUTING -i lo -j ZT_TPROXY

  # Table that determines who gets redirected
  iptables -t mangle -N ZT_EGRESS
  iptables -t mangle -A OUTPUT  -j ZT_EGRESS

The OUTPUT table is similar to regular Istio:

  # Exclude few ports that should not be captured
  iptables -t mangle -A ZT_EGRESS  -p tcp --dport 15001 -j RETURN
  iptables -t mangle -A ZT_EGRESS  -p tcp --dport 15009 -j RETURN
  iptables -t mangle -A ZT_EGRESS  -p tcp --dport 15008 -j RETURN

  # UID or GID of the app capturing - so it can originate egress without 
  # getting captured again.
  # Best is to use GID - so root user is also captured. However when debuggin
  # in an IDE like CLion/Golang it is very easy to set 'run as root' but not
  # 'using group id' - so using uid-owner.
  iptables -t mangle -A ZT_EGRESS -m owner --uid-owner 0 -j RETURN

  # For now capture only 10.0.0.0, private range, can be changed to 0.0.0.0/0
  # to capture everything.
  iptables -t mangle -A ZT_EGRESS -d 10.0.0.0/8 -j ZT_CAPTURE_EGRESS

This works for sidecars - and avoids the problems with REDIRECT, however it does require the sidecar to run with NET_ADMIN cap, which is not always possible. For Ambient Istio (ZTunnel) it may not be needed since eBPF or veth can be used instead.

The other major benefit of TPproxy is that it also allows UDP capture - REDIRECT 'original DST' does not work for UDP. I did a bit of testing with UDP and IPv6 - all seems to be working.