Cluster Metrics

There are lots of metrics to help you monitor a running cluster. Many of these metrics are still a work in progress and so is the metrics system itself so any of them may change, even between minor version releases. We will try to keep them as stable as possible, but they should all be considered somewhat unstable. Some of the metrics may also be for experimental features, or features that are not complete yet, so please read the description of the metric before using it for monitoring or alerting.

Also be aware that depending on the metrics system you use, the names are likely to be translated into a different format that is compatible with the system. Typically this means that the ':' separating character will be replaced with a '.' character.

Most metrics should have the units that they are reported in as a part of the description. For Timers often this is configured by the reporter that is uploading them to your system. Pay attention because even if the metric name has a time unit in it, it may be false.

Also most metrics, except for gauges and counters, are a collection of numbers, and not a single value. Often these result in multiple metrics being uploaded to a reporting system, such as percentiles for a histogram, or rates for a meter. It is dependent on the configured metrics reporter how this happens, or how the name here corresponds to the metric in your reporting system.

Cluster Metrics (From Nimbus)

These are metrics that come from the active nimbus instance and report the state of the cluster as a whole, as seen by nimbus.

Metric Name	Type	Description
cluster:num-nimbus-leaders	gauge	Number of nimbuses marked as a leader. This should really only ever be 1 in a healthy cluster, or 0 for a short period of time while a failover happens.
cluster:num-nimbuses	gauge	Number of nimbuses, leader or standby.
cluster:num-supervisors	gauge	Number of supervisors.
cluster:num-topologies	gauge	Number of topologies.
cluster:num-total-used-workers	gauge	Number of used workers/slots.
cluster:num-total-workers	gauge	Number of workers/slots.
cluster:total-fragmented-cpu-non-negative	gauge	Total fragmented CPU (% of core). This is CPU that the system thinks it cannot use because other resources on the node are used up.
cluster:total-fragmented-memory-non-negative	gauge	Total fragmented memory (MB). This is memory that the system thinks it cannot use because other resources on the node are used up.
topologies:assigned-cpu	histogram	CPU scheduled per topology (% of a core)
topologies:assigned-mem-off-heap	histogram	Off heap memory scheduled per topology (MB)
topologies:assigned-mem-on-heap	histogram	On heap memory scheduled per topology (MB)
topologies:num-executors	histogram	Number of executors per topology.
topologies:num-tasks	histogram	Number of tasks per topology.
topologies:num-workers	histogram	Number of workers per topology.
topologies:replication-count	histogram	Replication count per topology.
topologies:requested-cpu	histogram	CPU requested per topology (% of a core).
topologies:requested-mem-off-heap	histogram	Off heap memory requested per topology (MB).
topologies:requested-mem-on-heap	histogram	On heap memory requested per topology (MB).
topologies:uptime-secs	histogram	Uptime per topology (seconds).
nimbus:available-cpu-non-negative	gauge	Available cpu on the cluster (% of a core).
nimbus:total-cpu	gauge	total CPU on the cluster (% of a core)
nimbus:total-memory	gauge	total memory on the cluster MB
supervisors:fragmented-cpu	histogram	fragmented cpu per supervisor (% of a core)
supervisors:fragmented-mem	histogram	fragmented memory per supervisor (MB)
supervisors:num-used-workers	histogram	workers used per supervisor
supervisors:num-workers	histogram	number of workers per supervisor
supervisors:uptime-secs	histogram	uptime of supervisors
supervisors:used-cpu	histogram	cpu used per supervisor (% of a core)
supervisors:used-mem	histogram	memory used per supervisor MB

Nimbus Metrics

These are metrics that are specific to a nimbus instance. In many instances only the active nimbus will be reporting these metrics, but they could come from standby nimbus instances as well.

Metric Name	Type	Description
nimbus:files-upload-duration-ms	timer	Time it takes to upload a file from start to finish (Not Blobs, but this may change)
nimbus:longest-scheduling-time-ms	gauge	Longest time ever taken so far to schedule. This includes the current scheduling run, which is intended to detect if scheduling is stuck for some reason.
nimbus:num-activate-calls	meter	calls to the activate thrift method.
nimbus:num-added-executors-per-scheduling	histogram	number of executors added after a scheduling run.
nimbus:num-added-slots-per-scheduling	histogram	number of slots added after a scheduling run.
nimbus:num-beginFileUpload-calls	meter	calls to the beginFileUpload thrift method.
nimbus:num-blacklisted-supervisor	gauge	Number of supervisors currently marked as blacklisted because they appear to be somewhat unstable.
nimbus:num-deactivate-calls	meter	calls to deactivate thrift method.
nimbus:num-debug-calls	meter	calls to debug thrift method.
nimbus:num-downloadChunk-calls	meter	calls to downloadChunk thrift method.
nimbus:num-finishFileUpload-calls	meter	calls to finishFileUpload thrift method.
nimbus:num-gained-leadership	meter	number of times this nimbus gained leadership.
nimbus:num-getClusterInfo-calls	meter	calls to getClusterInfo thrift method.
nimbus:num-getComponentPageInfo-calls	meter	calls to getComponentPageInfo thrift method.
nimbus:num-getComponentPendingProfileActions-calls	meter	calls to getComponentPendingProfileActions thrift method.
nimbus:num-getLeader-calls	meter	calls to getLeader thrift method.
nimbus:num-getLogConfig-calls	meter	calls to getLogConfig thrift method.
nimbus:num-getNimbusConf-calls	meter	calls to getNimbusConf thrift method.
nimbus:num-getOwnerResourceSummaries-calls	meter	calls to getOwnerResourceSummaries thrift method.
nimbus:num-getSupervisorPageInfo-calls	meter	calls to getSupervisorPageInfo thrift method.
nimbus:num-getTopology-calls	meter	calls to getTopology thrift method.
nimbus:num-getTopologyConf-calls	meter	calls to getTopologyConf thrift method.
nimbus:num-getTopologyInfo-calls	meter	calls to getTopologyInfo thrift method.
nimbus:num-getTopologyInfoWithOpts-calls	meter	calls to getTopologyInfoWithOpts thrift method includes calls to getTopologyInfo.
nimbus:num-getTopologyPageInfo-calls	meter	calls to getTopologyPageInfo thrift method.
nimbus:num-getUserTopology-calls	meter	calls to getUserTopology thrift method.
nimbus:num-isTopologyNameAllowed-calls	meter	calls to isTopologyNameAllowed thrift method.
nimbus:num-killTopology-calls	meter	calls to killTopology thrift method.
nimbus:num-killTopologyWithOpts-calls	meter	calls to killTopologyWithOpts thrift method includes calls to killTopology.
nimbus:num-launched	meter	number of times a nimbus was launched
nimbus:num-lost-leadership	meter	number of times this nimbus lost leadership
nimbus:num-negative-resource-events	meter	Any time a resource goes negative (either CPU or Memory). This metric is not ideal as it is measured in a data structure that is used for internal calculations that may go negative and not actually represent over scheduling of a resource.
nimbus:num-net-executors-increase-per-scheduling	histogram	added executors minus removed executors after a scheduling run
nimbus:num-net-slots-increase-per-scheduling	histogram	added slots minus removed slots after a scheduling run
nimbus:num-rebalance-calls	meter	calls to rebalance thrift method.
nimbus:num-removed-executors-per-scheduling	histogram	number of executors removed after a scheduling run
nimbus:num-removed-slots-per-scheduling	histogram	number of slots removed after a scheduling run
nimbus:num-setLogConfig-calls	meter	calls to setLogConfig thrift method.
nimbus:num-setWorkerProfiler-calls	meter	calls to setWorkerProfiler thrift method.
nimbus:num-shutdown-calls	meter	times nimbus is shut down (this may not actually be reported as nimbus is in the middle of shutting down)
nimbus:num-submitTopology-calls	meter	calls to submitTopology thrift method.
nimbus:num-submitTopologyWithOpts-calls	meter	calls to submitTopologyWithOpts thrift method includes calls to submitTopology.
nimbus:num-uploadChunk-calls	meter	calls to uploadChunk thrift method.
nimbus:num-uploadNewCredentials-calls	meter	calls to uploadNewCredentials thrift method.
nimbus:process-worker-metric-calls	meter	calls to processWorkerMetrics thrift method.
nimbus:mkAssignments-Errors	meter	tracks exceptions from mkAssignments
nimbus:topology-scheduling-duration-ms	timer	time it takes to do a scheduling run.
nimbus:total-available-memory-non-negative	gauge	available memory on the cluster MB
nimbuses:uptime-secs	histogram	uptime of nimbuses
MetricsCleaner:purgeTimestamp	gauge	last time metrics were purged (Unfinished Feature)
RocksDB:metric-failures	meter	generally any failure that happens in the rocksdb metrics store. (Unfinished Feature)

DRPC Metrics

Metrics related to DRPC servers.

Metric Name	Type	Description
drpc:HTTP-request-response-duration	timer	how long it takes to execute an http drpc request
drpc:num-execute-calls	meter	calls to execute a DRPC request
drpc:num-execute-http-requests	meter	http requests to the DRPC server
drpc:num-failRequest-calls	meter	calls to failRequest
drpc:num-fetchRequest-calls	meter	calls to fetchRequest
drpc:num-result-calls	meter	calls to returnResult
drpc:num-server-timedout-requests	meter	times a DRPC request timed out without a response
drpc:num-shutdown-calls	meter	number of times shutdown is called on the drpc server

Logviewer Metrics

Metrics related to the logviewer process. This process currently also handles cleaning up worker logs when they get too large or too old.

Metric Name	Type	Description
logviewer:cleanup-routine-duration-ms	timer	how long it takes to run the log cleanup routine
logviewer:deep-search-request-duration-ms	timer	how long it takes for /deepSearch/{topoId}
logviewer:disk-space-freed-in-bytes	histogram	number of bytes cleaned up each time through the cleanup routine.
logviewer:download-file-size-rounded-MB	histogram	size in MB of files being downloaded
logviewer:num-daemonlog-page-http-requests	meter	calls to /daemonlog
logviewer:num-deep-search-no-result	meter	number of deep search requests that did not return any results
logviewer:num-deep-search-requests-with-archived	meter	calls to /deepSearch/{topoId} with ?search-archived=true
logviewer:num-deep-search-requests-without-archived	meter	calls to /deepSearch/{topoId} with ?search-archived=false
logviewer:num-download-daemon-log-exceptions	meter	num errors in calls to /daemondownload
logviewer:num-download-dump-exceptions	meter	num errors in calls to /dumps/{topo-id}/{host-port}/{filename}
logviewer:num-download-log-daemon-file-http-requests	meter	calls to /daemondownload
logviewer:num-download-log-exceptions	meter	num errors in calls to /download
logviewer:num-download-log-file-http-requests	meter	calls to /download
logviewer:num-file-download-exceptions	meter	errors while trying to download files.
logviewer:num-file-download-exceptions	meter	number of exceptions trying to download a log file
logviewer:num-file-open-exceptions	meter	errors trying to open a file (when deleting logs)
logviewer:num-file-open-exceptions	meter	number of exceptions trying to open a log file for serving
logviewer:num-file-read-exceptions	meter	number of exceptions trying to read from a log file for serving
logviewer:num-file-removal-exceptions	meter	number of exceptions trying to cleanup files.
logviewer:num-files-cleaned-up	histogram	number of files cleaned up each time through the cleanup routine.
logviewer:num-files-scanned-per-deep-search	histogram	number of files scanned per deep search
logviewer:num-list-dump-files-exceptions	meter	num errors in calls to /dumps/{topo-id}/{host-port}
logviewer:num-list-logs-http-request	meter	calls to /listLogs
logviewer:num-log-page-http-requests	meter	calls to /log
logviewer:num-other-cleanup-exceptions	meter	number of exception in the cleanup loop, not directly deleting files.
logviewer:num-page-read	meter	number of pages (parts of a log file) that are served up
logviewer:num-read-daemon-log-exceptions	meter	num errors in calls to /daemonlog
logviewer:num-read-log-exceptions	meter	num errors in calls to /log
logviewer:num-search-exceptions	meter	num errors in calls to /search
logviewer:num-search-log-exceptions	meter	num errors in calls to /listLogs
logviewer:num-search-logs-requests	meter	calls to /search
logviewer:num-search-request-no-result	meter	number of regular search results that were empty
logviewer:num-set-permission-exceptions	meter	num errors running set permissions to open up files for reading.
logviewer:num-shutdown-calls	meter	number of times shutdown was called on the logviewer
logviewer:search-requests-duration-ms	timer	how long it takes for /search
logviewer:worker-log-dir-size	gauge	size in bytes of the worker logs directory.

Supervisor Metrics

Metrics associated with the supervisor, which launches the workers for a topology. The supervisor also has a state machine for each slot. Some of the metrics are associated with that state machine and can be confusing if you do not understand the state machine.

Metric Name	Type	Description
supervisor:blob-cache-update-duration	timer	how long it takes to update all of the blobs in the cache (frequently just check if they have changed, but may also include downloading them.)
supervisor:blob-fetching-rate-MB/s	histogram	Download rate of a blob in MB/sec. Blobs are downloaded rarely so it is very bursty.
supervisor:blob-localization-duration	timer	Approximately how long it takes to get the blob we want after it is requested.
supervisor:current-reserved-memory-mb	gauge	total amount of memory reserved for workers on the supervisor (MB)
supervisor:current-used-memory-mb	gauge	memory currently used as measured by the supervisor (this typically requires cgroups) (MB)
supervisor:num-blob-update-version-changed	meter	number of times a version of a blob changes.
supervisor:num-cleanup-exceptions	meter	exceptions thrown during container cleanup.
supervisor:num-force-kill-exceptions	meter	exceptions thrown during force kill.
supervisor:num-kill-exceptions	meter	exceptions thrown during kill.
supervisor:num-launched	meter	number of times the supervisor is launched.
supervisor:num-shell-exceptions	meter	number of exceptions calling shell commands.
supervisor:num-slots-used-gauge	gauge	number of slots used on the supervisor.
supervisor:num-worker-transitions-into-empty	meter	number of transitions into empty state.
supervisor:num-worker-transitions-into-kill	meter	number of transitions into kill state.
supervisor:num-worker-transitions-into-kill-and-relaunch	meter	number of transitions into kill-and-relaunch state
supervisor:num-worker-transitions-into-kill-blob-update	meter	number of transitions into kill-blob-update state
supervisor:num-worker-transitions-into-running	meter	number of transitions into running state
supervisor:num-worker-transitions-into-waiting-for-blob-localization	meter	number of transitions into waiting-for-blob-localization state
supervisor:num-worker-transitions-into-waiting-for-blob-update	meter	number of transitions into waiting-for-blob-update state
supervisor:num-worker-transitions-into-waiting-for-worker-start	meter	number of transitions into waiting-for-worker-start state
supervisor:num-workers-force-kill	meter	number of times a worker was force killed. This may mean that the worker did not exit cleanly/quickly.
supervisor:num-workers-killed-assignment-changed	meter	workers killed because the assignment changed.
supervisor:num-workers-killed-blob-changed	meter	workers killed because the blob changed and they needed to be relaunched.
supervisor:num-workers-killed-hb-null	meter	workers killed because there was no hb at all from the worker. This would typically only happen when a worker is launched for the first time.
supervisor:num-workers-killed-hb-timeout	meter	workers killed because the hb from the worker was too old. This often happens because of GC issues in the worker that prevents it from sending a heartbeat, but could also mean the worker process exited, and the supervisor is not the parent of the process to know that it exited.
supervisor:num-workers-killed-memory-violation	meter	workers killed because the worker was using too much memory. If the supervisor can monitor memory usage of the worker (typically through cgroups) and the worker goes over the limit it may be shot.
supervisor:num-workers-killed-process-exit	meter	workers killed because the process exited and the supervisor was the parent process
supervisor:num-workers-launched	meter	number of workers launched
supervisor:single-blob-localization-duration	timer	how long it takes for a blob to be updated (downloaded, unzipped, inform slots, and make the move)
supervisor:time-worker-spent-in-state-empty-ms	timer	time spent in empty state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-kill-and-relaunch-ms	timer	time spent in kill-and-relaunch state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-kill-blob-update-ms	timer	time spent in kill-blob-update state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-kill-ms	timer	time spent in kill state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-running-ms	timer	time spent in running state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-waiting-for-blob-localization-ms	timer	time spent in waiting-for-blob-localization state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-waiting-for-blob-update-ms	timer	time spent in waiting-for-blob-update state as it transitions out. Not necessarily in ms.
supervisor:time-worker-spent-in-state-waiting-for-worker-start-ms	timer	time spent in waiting-for-worker-start state as it transitions out. Not necessarily in ms.
supervisor:worker-launch-duration	timer	Time taken for a worker to launch.
supervisor:worker-per-call-clean-up-duration-ns	meter	how long it takes to cleanup a worker (ns).
supervisor:worker-shutdown-duration-ns	meter	how long it takes to shutdown a worker (ns).

UI Metrics

Metrics associated with a single UI daemon.

Metric Name	Type	Description
ui:num-activate-topology-http-requests	meter	calls to /topology/{id}/activate
ui:num-all-topologies-summary-http-requests	meter	calls to /topology/summary
ui:num-build-visualization-http-requests	meter	calls to /topology/{id}/visualization
ui:num-cluster-configuration-http-requests	meter	calls to /cluster/configuration
ui:num-cluster-summary-http-requests	meter	calls to /cluster/summary
ui:num-component-op-response-http-requests	meter	calls to /topology/{id}/component/{component}/debug/{action}/{spct}
ui:num-component-page-http-requests	meter	calls to /topology/{id}/component/{component}
ui:num-deactivate-topology-http-requests	meter	calls to topology/{id}/deactivate
ui:num-debug-topology-http-requests	meter	calls to /topology/{id}/debug/{action}/{spct}
ui:num-get-owner-resource-summaries-http-request	meter	calls to /owner-resources or /owner-resources/{id}
ui:num-log-config-http-requests	meter	calls to /topology/{id}/logconfig
ui:num-main-page-http-requests	meter	number of requests to /index.html
ui:num-mk-visualization-data-http-requests	meter	calls to /topology/{id}/visualization-init
ui:num-nimbus-summary-http-requests	meter	calls to /nimbus/summary
ui:num-supervisor-http-requests	meter	calls to /supervisor
ui:num-supervisor-summary-http-requests	meter	calls to /supervisor/summary
ui:num-topology-lag-http-requests	meter	calls to /topology/{id}/lag
ui:num-topology-metric-http-requests	meter	calls to /topology/{id}/metrics
ui:num-topology-op-response-http-requests	meter	calls to /topology/{id}/logconfig or /topology/{id}/rebalance/{wait-time} or /topology/{id}/kill/{wait-time}
ui:num-topology-page-http-requests	meter	calls to /topology/{id}
num-web-requests	meter	nominally the total number of web requests being made.

Pacemaker Metrics (Deprecated)

The pacemaker process is deprecated and only still exists for backwards compatibility.

Metric Name	Type	Description
pacemaker:get-pulse=count	meter	number of times getPulse was called. yes the = is in the name, but typically this is mapped to a '-' by the metrics reporters.
pacemaker:heartbeat-size	histogram	size in bytes of heartbeats
pacemaker:send-pulse-count	meter	number of times sendPulse was called
pacemaker:size-total-keys	gauge	total number of keys in this pacemaker instance
pacemaker:total-receive-size	meter	total size in bytes of heartbeats received
pacemaker:total-sent-size	meter	total size in bytes of heartbeats read