Prometheus is an open-source monitoring solution that resides locally on your machine.
With Prometheus’s Integration, Zenduty sends new Prometheus alerts to the right team and notifies them based on on-call schedules via email, text messages (SMS), phone calls(Voice), Slack, Microsoft Teams and iOS & Android push notifications, and escalates alerts until the alert is acknowledged or closed. Zenduty provides your NOC, SRE and application engineers with detailed context around the Prometheus alert along with playbooks and a complete incident command framework to triage, remediate and resolve incidents with speed.
Whenever Prometheus alert rule condition is triggered, an alert is created in Zenduty, which creates an incident. When that condition goes back to normal levels, Zenduty will auto-resolve the incident.
You can also use Alert Rules to custom route specific Prometheus alerts to specific users, teams or escalation policies, write suppression rules, auto add notes, responders and incident tasks.
To add a new Prometheus integration, go to “Teams” on Zenduty and click on the “Manage” button corresponding to the team you want to add the integration to.
Next, go to “Services” and click on the “Manage” button corresponding to the relevant Service.
Go to “Integrations” and then “Add New Integration”. Give it a name and select the application “Prometheus” from the dropdown menu.
Go to “Configure” under your integrations and copy the webhooks URL generated.
Ensure that both Prometheus and Prometheus Alertmanager are downloaded and accessible locally on your system. To download them, visit here
Go to Alertmanager Folder and open “alertmanager.yml”. Add the webhook url (copied in the earlier steps) under “Webhook Configs”. Your “alertmanager.yml” file should now look like this:
``` global: resolve_timeout: 5m route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 3h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'https://www.zenduty.com/api/integration/prometheus/8a02aa3b-4289-4360-9ad4-f31f40aea5ed/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] ```
Tip: If you’re trying to generate alerts across multiple Zenduty Services, you can define your “Alert Rules” in different files. For example: “first_rules.yml”, “second_rules.yml”, and so on, each with a different integration endpoint.
In the Prometheus folder, open “prometheus.yml”. Add new rules files that you just created and set Target. Zenduty groups Prometheus alerts based on the alertname parameter. Your “prometheus.yml” file should look like this:
``` # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] ```
Run Prometheus and Alert Manager using commands like:
run prometheus: ./prometheus --config.file=prometheus.yml
run alertmanager: ./alertmanager --config.file=alertmanager.yml
Once Prometheus is running, you will be able to see the alerts rules you configured.
When an alert is required, Zenduty will automatically create an incident.
Prometheus is now integrated.
In order to scrape data from the multiple services or pods, one has to write custom scraping rules on Prometheus. Refer to the example below.
prometheus.yml: |- global: scrape_interval: 10s evaluation_interval: 10s rule_files: - /etc/prometheus/prometheus.rules alerting: alertmanagers: - scheme: http static_configs: - targets: - "alertmanager.monitoring.svc:9093" scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https
In the above example,
scrape_configs defines the location from where the data needs to be scraped, which in this case is the kubernetes apiserver. You can define multiple jobs to scrape data from different services or pods. For Prometheus scraping, you need to define
prometheus.io/port: '9100' within the annotations section for the service or pod.
/etc/prometheus/prometheus.rules is the location of the Prometheus rule file, an example of which is shown below:
prometheus.rules: |- groups: - name: Host-related-AZ1 rules: - alert: HostOutOfMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20 for: 10m labels: slack: "true" zenduty: "true" severity: warning team: devops annotations: summary: Host out of memory (instance ) description: "Node memory is filling up (< 20% left)\n VALUE = \n LABELS = "
In the above example, the different resource related partitions are defined in groups and these groups have different rules for alerting. You need to make sure that you add the appropriate
labels in your rules because Zenduty will be matching these
labels in the Alertmanager settings.
Now if the rule breaks and Prometheus sends it to the Alertmanager, then alertmanger must have the appropriate channel to notify. For configurating Alertmanager with Zenduty or Slack, please see the below example:
config.yml: |- global: resolve_timeout: 5m templates: - '/etc/alertmanager-templates/*.tmpl' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 20s group_interval: 2m repeat_interval: 5m receiver: default # this is default receiver routes: - receiver: zen_hook # this is a condition based receiver, it will only alert the zen_hook receiver only if some conditions are met. match: team: devops zenduty: "true" group_wait: 20s repeat_interval: 2m receivers: - name: zen_hook # zen_hook receiver defination webhook_configs: - url: <Zenduty_integration_url> send_resolved: true - name: 'default' # default receiver defination slack_configs: - channel: '# default-infra-logs' send_resolved: true title: "\n" text: "\n"
One can add proxy in global settings if needed, like the snippet below.
config.yml: |- global: resolve_timeout: 5m http_config: proxy_url: 'http://127.0.0.1:1025'
For more information, visit the Alertmanager docs here
Looking for a better way to get real-time alerts from Prometheus Integration, setup a solid incident escalation and incident response pipeline and minimize response and resolution times for Prometheus Integration incidents?