Freitag, 18. Mai 2018

Centralised Logging with EFK and hot/warm in Kubernetes

Everybody is talking about centralised logging these days, and most seem to agree that EFK (Elasticsearch, Fluentd, Kibana) is a good combination for accomplishing this. The Kubernetes repo on github contains something to start and play with, but it is far from production-ready:
  1. The Kibana version used there is rather old
  2. Elasticsearch is not production-ready (single node instance, no decoupling of resource-hungry indexing and long-term, read-only storage)
While (1) can be overcome rather easily, (2) poses a bit more of a challenge - how can we create a production-ready Elasticsearch service using Kubernetes? The Elasticsearch folks propose the so-called "hot/warm" architecture for addressing this:
  • "hot" nodes running on fast and expensive hardware (fast CPUs, lots of memory, SSDs) do all the indexing of anything coming in. 
  • All data older than a configurable period of time is moved to so-called "warm" nodes running on potentially slower and less expensive hardware with large disks (usually HDDs). No indexing takes place here, data is kept read-only for queries only.
To my knowledge there is no ready-to-run "hot/warm" Elasticsearch setup for Kubernetes. Hence I had to roll my own. 

A good starting point for this was the more or less ready-to-run setup of Elasticsearch and Kibana by Paulo Pires (without Fluentd, but with a nice-to-have recent version of Kibana). 

Here's what I did to turn this into a centralised logging setup with Fluentd, a "hot/warm" Elasticsearch cluster and a recent version of Kibana.

1. Fluentd setup from the EFK setup "to start and play with" above.

This is straightforward.

2. Elasticsearch and Kibana from the Paulo Pires' repository.

Both namespaces and service names need to fit, otherwise Fluentd will not be able to talk to Elasticsearch. Since I want centralised logging I stick to what I found in the Fluentd setup (namespace kube-system and the term "logging" in the service names). Hence I need to adapt service names and namespaces, so that it works with the Fluentd setup.

I use the yamls from the stateful subdirectory for the Elasticsearch data nodes and set up persistent volumes as needed.

This gave me an up-to-date version of Kibana plus a client/master/data node setup for Elasticsearch - quite good for a start.

3. Tests of this setup, make necessary adaptions until it works.

Let's be realistic, this always takes a while. To keep things simple, I got everything running without "hot/warm" changes before proceeding.

4. ES setup for the "hot/warm" architecture

From the single Elasticsearch data node StatefulSet in the stateful directory, I use a template for creating two separate StatefulSet in the end: one for hot, one for warm

I found that the Elasticsearch image use by Paulo did not support passing of command line args to the elasticsearch command, so I had to extend it and create a PR. The result is quay.io/pires/docker-elasticsearch-kubernetes:6.2.3, i.e. the minimum version 6.2.3 with which my setup works. Subsequent versions will contain the necessary change, too. 

In a nutshell, this is what I had to change for my StatefulSet template:
  1. Templatize the name:
    metadata:
      name: elasticsearch-logging-data-@ES_NODE_TYPE@
  2. Change the version of the Elasticsearch image to 6.2.3:
          containers:
          - name: elasticsearch-logging-data
            image: quay.io/pires/docker-elasticsearch-kubernetes:6.2.3
  3. Add the following environment variable to the list of environment variables (ES_EXTRA_ARGS adds command line arguments):
            - name: ES_EXTRA_ARGS
              value: -Enode.attr.box_type=@ES_NODE_TYPE@
  4. Set up node labels for "hot" and "warm", and assign the ES data nodes to the respective node types:
          nodeSelector:
            node/role: elasticsearch-@ES_NODE_TYPE@
  5. Make sure that there is never more than one data pod running on a single host:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchExpressions:
                    - key: role
                      operator: In
                      values:
                      - data
                  topologyKey: kubernetes.io/hostname
  6. You will most likely want to assign different CPU and memory resources to your hot and warm nodes respectively:
            resources:
              limits:
                cpu: @ES_CPU_LIMIT@
                memory: @ES_MEM_LIMIT@
              requests:
                cpu: @ES_CPU_REQUEST@
                memory: @ES_MEM_REQUEST@

It goes without saying that expressions like @...@ are placeholders that need to be expanded and that @ES_NODE_TYPE@ will stand for either "hot" or "warm". In the end, a simple call to sed can be used to generate the two StatefulSet yamls.

5. Configure Fluentd to deliver data to "hot" ES nodes only

This can be accomplished by setting up a so-called "template file" for the Elasticsearch plugin in the Fluentd configuration. For this, the Fluentd configmap yaml file needs to be extended. First, I add the template code (indentation is important, because we're in yaml):

  logstash.json: |-
    {
      "index_patterns": "logstash-*",
      "settings":
        {
         "number_of_shards": 3,
         "number_of_replicas": 2,
         "index.routing.allocation.require.box_type": "hot"
        }
    }

In order to "activate" this, the following two lines are added to the "@id elasticsearch" block above:

      template_name logstash
      template_file /etc/fluent/config.d/logstash.json

Now every log line produced by Fluentd goes to the "hot" ES nodes only.

6. Set up curator job for moving old data to "warm" nodes

I use my own yaml for the CronJob and extend the curator configuration I found in Paulo Pires' repository. Like above, namespaces and service names need to be adapted.

The CronJob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: curator
  namespace: kube-system
  labels:
    app: curator
spec:
  schedule: "0 1 * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 3
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 120
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - image: bobrik/curator:5.4.0
            name: curator
            args: ["--config", "/etc/config/config.yml", "/etc/config/action_file.yml"]
            volumeMounts:
            - name: config
              mountPath: /etc/config
          volumes:
          - name: config
            configMap:
              name: curator-config
          restartPolicy: OnFailure

In the ConfigMap for curator I replaced the contents of action_file.yaml with what I found in this Graylog blog post:

  action_file.yml: |-
    ---
    # Remember, leave a key empty if there is no value.  None will be a string,
    # not a Python "NoneType"
    #
    # Also remember that all examples have 'disable_action' set to True.  If you
    # want to use this action as a template, be sure to set this to False after
    # copying it.
    actions:
      1:
        action: allocation
        description: "Apply shard allocation filtering rules to the specified indices"
        options:
          key: box_type
          value: warm
          allocation_type: require
          wait_for_completion: true
          timeout_override:
          continue_if_exception: false
          disable_action: false
        filters:
        - filtertype: pattern
          kind: prefix
          value: logstash-
        - filtertype: age
          source: name
          direction: older
          timestring: '%Y.%m.%d'
          unit: days
          unit_count: 3
      2:
        action: forcemerge
        description: "Perform a forceMerge on selected indices to 'max_num_segments' per shard"
        options:
          max_num_segments: 1
          delay:
          timeout_override: 21600 
          continue_if_exception: false
          disable_action: false
        filters:
        - filtertype: pattern
          kind: prefix
          value: logstash-
        - filtertype: age
          source: name
          direction: older
          timestring: '%Y.%m.%d'
          unit: days
          unit_count: 3

Now old data is moved from the hot to the warm nodes automatically after 3 days.

7. Finetuning #1: docker logrotate

Since I am running on bare metal using docker underneath, I need to make sure that the log files in /var/lib/docker/containers/* don't grow indefinitely. The docker daemon can take care of this just fine when you add the following lines to its command line arguments, e.g. by setting the OPTIONS variable in /etc/sysconfig/docker (or wherever else the configuration resides on all the different distros):

OPTIONS="--log-driver json-file --log-opt max-size=100m [...]

8. Finetuning #2: get detection of stack traces to run

The docker logging driver does not know Java (or whatever language) statcktraces. Hence they are split into different lines which is not nice for analysis (e.g. in Kibana) later. 

Fluentd is configured to use the detect_exceptions plugin that is supposed to join stacktrace lines and add "\n" as needed in the JSON entries sent to Elasticsearch in the end. However that plugin currently does not work with what the docker JSON log driver produces, because the latter escapes tabulators as "\u0009" which is not handled by the plugin. I filed an issue on that in which I proposed a change of the regular expression for detecting stack traces.

Because that issue has not been fixed so far, I derived my own Fluentd image from the above one in which I copy over my patched version of exception_detector.rb. Here's what I changed in the plugin (I'm dealing with Java stacktraces only, hence this is not a generic solution, and that's why I did not make this a PR to fix that issue):

--- a/lib/fluent/plugin/exception_detector.rb
+++ b/lib/fluent/plugin/exception_detector.rb
@@ -53,9 +53,9 @@ module Fluent
       rule(:start_state,
            /(?:Exception|Error|Throwable|V8 errors stack trace)[:\r\n]/,
            :java),
-      rule(:java, /^[\t ]+(?:eval )?at /, :java),
-      rule(:java, /^[\t ]*(?:Caused by|Suppressed):/, :java),
-      rule(:java, /^[\t ]*... \d+\ more/, :java)
+      rule(:java, /^(\\u0009|[\t ])+(?:eval )?at /, :java),
+      rule(:java, /^(\\u0009|[\t ])*(?:Caused by|Suppressed):/, :java),
+      rule(:java, /^(\\u0009|[\t ])*... \d+\ more/, :java)
     ].freeze
     PYTHON_RULES = [

Closing remarks

I hope this posts helps others in setting up their own centralised EFK logging environment in Kubernetes. I understand that people might be interested to look into the the full set of yamls I am using. Unfortunately they contain a lot of stuff that is rather specific for my environment and which I am not allowed to share. Looking back on that work I may say that doing all this on my own was a good invenstment of time making what I learned far more sustainable for me :)

[edit] I have now made an ansible-based version of most of what is documented  here available as part of my kubernetes installer on github.