Operations Architecture

Operations architecture

To operate the infrastructure several tools are currently in use:

  • Prometheus - on Kubernetes
  • Grafana - on Kubernetes
  • node_exporter - on metal (to be replaced by CollectD if feature complete, including top processes)
  • Kibana - on Kubernetes (coming soon!)
  • CollectD - on metal (coming soon!)
  • SNMP - on CollectD (coming soon!)

What and why

The following table shows what is being monitored and why :

Monitoring item nameImplementation StatusData SourceReason for MonitoringDashboard LinkAssociated Alert with Threshold(s)
Non-running Kubernetes Pods in replicaSetCollectDWill show if deployment is healthyIf > 0
Node CPUnode_exporterWill show if node has sufficient resourcesIf > 80% over last hour
Node Memorynode_exporterWill show if node has sufficient resourcesIf > 80% over last hour
Node Networknode_exporterWill show if node has sufficient resourcesIf > 80% over last hour
Node Disk IOnode_exporterWill show if node has sufficient resourcesIf > 80% over last hour
Node Uptimenode_exporterWill show if node needs restart-
Node Top Processesnode_exporterWill show if there are very heavy processes-
Network usage spikes from any device on LANsnmpWill show if a network device is using a significant amount of bandwidthIf > 50% over last 15 minutes