L7ESP K8s Monitoring with Grafana

Introduction

The Document below was created to give instruction to the DevOps support team to help them onboard the monitoring tool to enhance obervability across internal assets and customer assets.

Grafana Architecture

Grafana architecture (mutliple cluster monioring) will consist of one master thanos deployment at the master cluster with the grafana front end application, this master node will play as the single pane for monitoring. The client cluster will need a deployment of the thanos stack to ship metrics and log to the master cluster. In case the assets to be monitors are not on k8s, we will leverage the numerous datasource grafana provide to fetch metrics.

For a single cluster monitoring you can just leverage a prometheus agent scraping metrics from nodeexporter/kube state metrics on your nodes. Metrics are then shipped to grafana.

The logging of your application will be scraped and shipped to grafana by loki running jobs with promtail on the nodes to request logs from your k8s application.

At last grafana can also leverage API datasources to fetch metrics from other assests, such as Azure or AWS cloud environment. This will be done through their datasources.

Requirements

To install the grafana stack, on your kubernetes cluster, you must have

Kubernetes cluster to deploy your helm charts.

The kubectl and helm command line tools will be use respectively to programmatically access the Kubernetes cluster and the AWS account. To install the CLI tools, follow the link below and choose the correct operating sytem:

Grafana

The Grafana application can be installed using the helm chart. Download the helm chart on your laptop from the following website: Grafana helm chart them install it using helm.

Following are values.yaml parameters to managed your dashboards in grafana.

Automated uploads of new dashboard so user can upload new dashboard.json on the data folder of your confimap. That folder lable should match the label on the values file specified below.

    dashboards:
      enabled: true
      SCProvider: true
      # label that the configmaps with dashboards are marked with
      label: grafana_dashboard

enable default dashboard at login by updating the following values parameters.

    dashboards:
      default_home_dashboard_path: /tmp/dashboards/k8s-dashboard.json

Set initial grafana login username and password

grafana:
  enabled: true
  # Pass values to Grafana child chart
  adminUser: admin
  adminPassword: admin

Datasource

Datasources are special back-end query API that query data from different assets. Below are a list of currently used data sources by ESP grafana.

blackboxexporter: A probe that leverages prometheus to check health and metrics from SSL certificates
thanos: High availability master-client architecture to scrape metrics from other clusters
prometheus: Time series data queryier for metrics
aws cloudwatch: Data source to ingest AWS cloudwatch data and logs
azure monitor: Data source to ingest Azure data and logs
Loki: Data source to ingest knative application logs.

Install datasources dependency helm charts

Datasources helm charts can be installed from the following websites: * Prometheus * prometheus-blackbox-exporter * Thanos * Loki * promtail

Values parameters configuration for each Datasources

Enable and configure each datasource from the values.yml:

configure prometheus

          url: http://{{ .Release.Name }}-prometheus-server:80
prometheus:
  server:
    extraArgs:
      log.level: debug
      storage.tsdb.min-block-duration: 2h  # Don't change this, see docs/components/sidecar.md
      storage.tsdb.max-block-duration: 2h  # Don't change this, see docs/components/sidecar.md
    retention: 4h
    service:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    statefulSet:
      enabled: true
    podAnnotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "10902"
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - prometheus
            - key: component
              operator: In
              values:
              - server
          topologyKey: "kubernetes.io/hostname"

configure Thanos

          url: http://{{ .Release.Name }}-thanos-query-frontend:9090
thanos:
  objstoreConfig: |-
    type: s3
    config:
      bucket: 
      endpoint: 
      access_key: 
      secret_key: 
      insecure: true
  query:
    stores:
      - 
      # - SIDECAR-SERVICE-IP-ADDRESS-2:10901
  bucketweb:
    enabled: true
  compactor:
    enabled: true
  storegateway:
    enabled: true
  ruler:
    enabled: true
    alertmanagers:
      - http://l7-devops-monitoring-prometheus-alertmanager.grafana.svc.cluster.local:9093
    config: |-
      groups:
        - name: "metamonitoring"
          rules:
            - alert: "PrometheusDown"
              expr: absent(up{prometheus="monitoring/prometheus-operator"})

configure prometheus-blackbox

    prometheus.yml:
      scrape_configs:
        - job_name: "healthchecks"
          scrape_interval: 60s
          scheme: https
          metrics_path: projects/446fab8e-5a9f-4f41-a90b-f086581f64a5/metrics/lthdUEmLy2xTIbPH5Jngrm_BPMGopHT7
          static_configs:
            - targets: ["healthchecks.io"]

        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
          - role: pod

          relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: (.+):(?:\d+);(\d+)
            replacement: ${1}:${2}
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)

        - job_name: "snykmetrics"
          static_configs:
            - targets: ["localhost:9090"]
            - targets: ["localhost:9532"]

        - job_name: 'blackbox'
          metrics_path: /probe
          params:
            module: [http_2xx]
          static_configs:
            - targets:
                    - https://clinops-stage.preview.l7esp.com
                    - https://stage.preview.l7esp.com
                    - https://250.preview.l7esp.com
                    - https://qa-clinops300.preview.l7esp.com
                    - https://preview.l7esp.com
                    - https://inspiring-jepsen-5082.edgestack.me
                    - https://lab7io.atlassian.net
                    - https://scipher-qa.l7esp.com
                    - https://cdn.l7esp.com
                    - https://lab7io.slack.com
                    - https://ci.l7informatics.com
                    - https://registry.l7informatics.com
                    - https://l7devopstest.azurecr.io
          relabel_configs:
            - source_labels: [__address__]
              target_label: __param_target
            - source_labels: [__param_target]
              target_label: instance
            - target_label: __address__
              replacement: l7-devops-monitoring-prometheus-blackbox-exporter:9115  # Blackbox exporter scraping address

configure loki/promtails

          url: http://{{ .Release.Name }}-loki.{{ .Release.Name }}.svc.cluster.local:3100
loki:
  enabled: true

promtail:
  enabled: true
      scrapeConfigs: |
        # See also https://github.com/grafana/loki/blob/master/production/ksonnet/promtail/scrape_config.libsonnet for reference

        # Pods with a label 'app.kubernetes.io/name'
        - job_name: kubernetes-pods-app-kubernetes-io-name
          pipeline_stages:
            {{- toYaml .Values.config.snippets.pipelineStages | nindent 4 }}
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: replace
              source_labels:
                - __meta_kubernetes_pod_label_app_kubernetes_io_name
              target_label: app
            - action: drop
              regex: ''
              source_labels:
                - app
            - action: replace
              source_labels:
                - __meta_kubernetes_pod_label_app_kubernetes_io_component
              target_label: component
            {{- if .Values.config.snippets.addScrapeJobLabel }}
            - action: replace
              replacement: kubernetes-pods-app-kubernetes-io-name
              target_label: scrape_job
            {{- end }}
            {{- toYaml .Values.config.snippets.common | nindent 4 }}

Azure monitor

        - name: Azure Monitor-1
          type: grafana-azure-monitor-datasource
          access: proxy
          jsonData:
            azureAuthType: clientsecret
            cloudName: azuremonitor
            tenantId: 
            clientId: 
            subscriptionId: l7 informatics
          secureJsonData:
            clientSecret: 
          version: 1

AWS cloudwatch

        - name: CloudWatch
          type: cloudwatch
          jsonData:
            authType: keys
            defaultRegion: us-east-1a
          secureJsonData:
            accessKey: ''
            secretKey: ''

In this section: