Link

Prometheus is an open-source monitoring solution that resides locally on your machine.

What can Zenduty do for Prometheus users?

With Prometheus’s Integration, Zenduty sends new Prometheus alerts to the right team and notifies them based on on-call schedules via email, text messages (SMS), phone calls(Voice), Slack, Microsoft Teams and iOS & Android push notifications, and escalates alerts until the alert is acknowledged or closed. Zenduty provides your NOC, SRE and application engineers with detailed context around the Prometheus alert along with playbooks and a complete incident command framework to triage, remediate and resolve incidents with speed.

Whenever Prometheus alert rule condition is triggered, an alert is created in Zenduty, which creates an incident. When that condition goes back to normal levels, Zenduty will auto-resolve the incident.

You can also use Alert Rules to custom route specific Prometheus alerts to specific users, teams or escalation policies, write suppression rules, auto add notes, responders and incident tasks.

To integrate Prometheus with Zenduty, complete the following steps:

In Zenduty:

  1. To add a new Prometheus integration, go to “Teams” on Zenduty and click on the “Manage” button corresponding to the team you want to add the integration to.

  2. Next, go to “Services” and click on the “Manage” button corresponding to the relevant Service.

  3. Go to “Integrations” and then “Add New Integration”. Give it a name and select the application “Prometheus” from the dropdown menu.

  4. Go to “Configure” under your integrations and copy the webhooks URL generated.

In Prometheus:

  1. Ensure that both Prometheus and Prometheus Alertmanager are downloaded and accessible locally on your system. To download them, visit here

  2. Go to Alertmanager Folder and open “alertmanager.yml”. Add the webhook url (copied in the earlier steps) under “Webhook Configs”. Your “alertmanager.yml” file should now look like this:

     ```
     global:
       resolve_timeout: 5m
     route:
       group_by: ['alertname', 'cluster', 'service']
       group_wait: 30s
       group_interval: 5m
       repeat_interval: 3h
       receiver: 'web.hook'
     receivers:
     - name: 'web.hook'
       webhook_configs:
       - url: 'https://www.zenduty.com/api/integration/prometheus/8a02aa3b-4289-4360-9ad4-f31f40aea5ed/'
     inhibit_rules:
       - source_match:
           severity: 'critical'
         target_match:
           severity: 'warning'
         equal: ['alertname', 'dev', 'instance']
     ```
    
  3. Tip: If you’re trying to generate alerts across multiple Zenduty Services, you can define your “Alert Rules” in different files. For example: “first_rules.yml”, “second_rules.yml”, and so on, each with a different integration endpoint.

  4. In the Prometheus folder, open “prometheus.yml”. Add new rules files that you just created and set Target. Zenduty groups Prometheus alerts based on the alertname parameter. Your “prometheus.yml” file should look like this:

     ```
     # my global config
     global:
       scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
       evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
       # scrape_timeout is set to the global default (10s).
     # Alertmanager configuration
     alerting:
       alertmanagers:
       - static_configs:
         - targets: ["localhost:9093"]
     # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
     rule_files:
       - "first_rules.yml"
       # - "second_rules.yml"
     # A scrape configuration containing exactly one endpoint to scrape:
     # Here it's Prometheus itself.
     scrape_configs:
       # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
       - job_name: 'prometheus'
         # metrics_path defaults to '/metrics'
         # scheme defaults to 'http'.
         static_configs:
         - targets: ['localhost:9090']
     ```
    
  5. Run Prometheus and Alert Manager using commands like:

    run prometheus: ./prometheus --config.file=prometheus.yml

    run alertmanager: ./alertmanager --config.file=alertmanager.yml

  6. Once Prometheus is running, you will be able to see the alerts rules you configured.

    When an alert is required, Zenduty will automatically create an incident.

  7. Prometheus is now integrated.

For Prometheus Docker installations

In order to scrape data from the multiple services or pods, one has to write custom scraping rules on Prometheus. Refer to the example below.

prometheus.yml: |-
    global:
      scrape_interval: 10s
      evaluation_interval: 10s
    rule_files:
      - /etc/prometheus/prometheus.rules
    alerting:
      alertmanagers:
      - scheme: http
        static_configs:
        - targets:
          - "alertmanager.monitoring.svc:9093"

    scrape_configs:

      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

In the above example, scrape_configs defines the location from where the data needs to be scraped, which in this case is the kubernetes apiserver. You can define multiple jobs to scrape data from different services or pods. For Prometheus scraping, you need to define prometheus.io/scrape: 'true', prometheus.io/port: '9100' within the annotations section for the service or pod.

/etc/prometheus/prometheus.rules is the location of the Prometheus rule file, an example of which is shown below:

prometheus.rules: |-
    groups:
    - name: Host-related-AZ1
      rules:
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20
        for: 10m
        labels:
          slack: "true"
          zenduty: "true"
          severity: warning
          team: devops
        annotations:
          summary: Host out of memory (instance )
          description: "Node memory is filling up (< 20% left)\n  VALUE = \n  LABELS = "

In the above example, the different resource related partitions are defined in groups and these groups have different rules for alerting. You need to make sure that you add the appropriate labels in your rules because Zenduty will be matching these labels in the Alertmanager settings.

Now if the rule breaks and Prometheus sends it to the Alertmanager, then alertmanger must have the appropriate channel to notify. For configurating Alertmanager with Zenduty or Slack, please see the below example:

config.yml: |-
    global:
      resolve_timeout: 5m
    templates:
    - '/etc/alertmanager-templates/*.tmpl'
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 20s
      group_interval: 2m
      repeat_interval: 5m
      receiver: default  # this is default receiver
      routes:
      - receiver: zen_hook # this is a condition based receiver, it will only alert the zen_hook receiver only if some conditions are met.
          match:
            team: devops
            zenduty: "true"
          group_wait: 20s
          repeat_interval: 2m
    receivers:
    - name: zen_hook    # zen_hook receiver defination
      webhook_configs:
      - url: <Zenduty_integration_url>
        send_resolved: true
    - name: 'default'   # default receiver defination
      slack_configs:
      - channel: '# default-infra-logs'
        send_resolved: true
        title: "\n"
        text: "\n"

One can add proxy in global settings if needed, like the snippet below.

config.yml: |-
    global:
      resolve_timeout: 5m
      http_config:
        proxy_url: 'http://127.0.0.1:1025'

For more information, visit the Alertmanager docs here


Respond to Prometheus Integration alerts faster

Looking for a better way to get real-time alerts from Prometheus Integration, setup a solid incident escalation and incident response pipeline and minimize response and resolution times for Prometheus Integration incidents?

Signup for a free trial


Copyright Zenduty 2020. Product of YellowAnt