Observe system and service health on edge devices

With m87, deployments can include observation jobs that continuously check whether services and systems are alive and healthy.
These checks run on the device, report their results centrally, and attach evidence when something goes wrong.

Observation works independently of how something was deployed — Docker, systemd, or one-shot jobs.

The problem

On edge devices, failures are rarely clean:

Containers restart silently
Services stay up but stop behaving correctly
CPU or memory pressure builds up over time
Devices are unreachable when issues happen

Logs alone are insufficient unless they are collected at the moment of failure.

The solution: observe jobs

m87 provides observe jobs that:

Run checks on a schedule
Distinguish between liveness and health
Execute custom data-capture logic on failure
Record and report diagnostic evidence automatically
Expose aggregated status via CLI and UI

Observation is declarative and runs continuously on the device.

Core concepts

Observe job

A job of type observe, or a service job with an observe section.
It never deploys anything — it only evaluates state and reports results.

Liveness

Answers: Is this thing still running?

Binary checks
Fast execution
Failure usually means the service is down or stuck

Health

Answers: Is this thing behaving correctly?

Tolerates short spikes or transient errors
Uses fails_after to avoid flapping
Often based on logs, metrics, or heuristics

Record

A command executed when a check fails to capture diagnostic evidence, such as:

Logs
System metrics
Process state

The output is attached to the observation result.

Report

Both liveness and health checks support a report command.

report is executed on failure, just like record, but is intended for active reactions rather than passive data capture.

Typical uses:

Start recording ROS bags on a robot
Enable verbose logging or tracing
Dump domain-specific state
Trigger external scripts or signals

This allows you to collect richer, context-aware data at the exact moment a failure is detected.

Observation structure (overview)

observe:
  logs:
    follow: <command>

  liveness:
    every: 5s
    observe: <command>
    record: <command>
    report: <command>

  health:
    every: 10s
    observe: <command>
    record: <command>
    report: <command>
    fails_after: 3

Only the sections you define are active.

Logs follow and unified log streaming

The logs.follow command is used when you run:

m87 <device> logs

This means:

You receive runtime-level logs and
Logs from any services or commands you explicitly follow via logs.follow

This provides a unified log stream across:

The runtime
Deployed services
Custom log sources defined in observe jobs

Example: Docker Compose liveness

Check whether any containers are in a bad state.

observe:
  liveness:
    every: 5s
    observe: >
      docker compose ps
      --status exited
      --status restarting
      --status dead
      --status paused
      --status removing
      --status created
      | sed '1d'
      | grep -q . && exit 1 || exit 0
    record: docker compose logs --timestamps --tail 200
    report: echo "Container liveness failure detected"

Behavior:

Fails immediately if any container is not running
Captures recent logs as evidence
Executes custom reporting logic

Example: Log-based health check

Detect error patterns without killing the service immediately.

observe:
  health:
    every: 10s
    observe: >
      docker compose logs --no-color --tail=200
      | grep -Ei 'error|panic|fatal|crash'
      && exit 1 || exit 0
    record: docker compose logs --timestamps --tail 200
    report: echo "Health degradation detected"
    fails_after: 3

Behavior:

Allows transient errors
Marks unhealthy only after repeated failures
Triggers reporting and evidence capture

Example: Host CPU and memory monitoring

A pure system observer that checks sustained resource pressure.

jobs:
  - id: system_monitor
    type: observe
    enabled: true

    workdir:
      mode: ephemeral

    observe:
      health:
        every: 10s
        observe: |
          sh -lc '
          CPU_T=${CPU_THRESHOLD:-95}
          MEM_T=${MEM_THRESHOLD:-90}

          PREV=/tmp/m87_cpu_prev
          CUR=$(awk "/^cpu /{print \$2,\$3,\$4,\$5,\$6,\$7,\$8,\$9}" /proc/stat)

          if [ ! -f "$PREV" ]; then echo "$CUR" > "$PREV"; exit 0; fi
          PREVLINE=$(cat "$PREV")
          echo "$CUR" > "$PREV"

          set -- $PREVLINE
          p_user=$1; p_nice=$2; p_sys=$3; p_idle=$4; p_iowait=$5; p_irq=$6; p_softirq=$7; p_steal=$8
          set -- $CUR
          c_user=$1; c_nice=$2; c_sys=$3; c_idle=$4; c_iowait=$5; c_irq=$6; c_softirq=$7; c_steal=$8

          p_idle_all=$((p_idle + p_iowait))
          c_idle_all=$((c_idle + c_iowait))
          p_non_idle=$((p_user + p_nice + p_sys + p_irq + p_softirq + p_steal))
          c_non_idle=$((c_user + c_nice + c_sys + c_irq + c_softirq + c_steal))
          p_total=$((p_idle_all + p_non_idle))
          c_total=$((c_idle_all + c_non_idle))

          d_total=$((c_total - p_total))
          d_idle=$((c_idle_all - p_idle_all))

          CPU_OK=1
          if [ "$d_total" -gt 0 ]; then
            usage=$(( (100 * (d_total - d_idle)) / d_total ))
            [ "$usage" -ge "$CPU_T" ] && CPU_OK=0
          fi

          TOTAL=$(awk "/^MemTotal:/{print \$2}" /proc/meminfo)
          AVAIL=$(awk "/^MemAvailable:/{print \$2}" /proc/meminfo)
          USED_PCT=$(( (100 * (TOTAL - AVAIL)) / TOTAL ))

          MEM_OK=1
          [ "$USED_PCT" -ge "$MEM_T" ] && MEM_OK=0

          [ "$CPU_OK" -eq 1 ] && [ "$MEM_OK" -eq 1 ] && exit 0
          exit 1
          '
        record: |
          date
          echo '--- load ---'
          cat /proc/loadavg
          echo '--- mem ---'
          egrep '^(MemTotal|MemAvailable|MemFree|Buffers|Cached|SwapTotal|SwapFree):' /proc/meminfo
          echo '--- top cpu ---'
          ps -eo pid,ppid,comm,%cpu,%mem --sort=-%cpu | head -n 15
          echo '--- top mem ---'
          ps -eo pid,ppid,comm,%cpu,%mem --sort=-%mem | head -n 15
        report: echo "System resource pressure detected"
        fails_after: 18

Viewing observation status (CLI)

You can inspect aggregated observation results using:

m87 atlas-node status

Example output:

Device atlas-node
Observations
  NAME             LIVELINESS   HEALTH      CRASHES   UNHEALTHY_CHECKS
  app_service      DEAD         UNHEALTHY         1                  2
  system_monitor   DEAD         UNHEALTHY         0                  3

This gives you a quick, high-level view of the device’s operational state.

Web UI and logs

In the web UI, m87 provides a unified view of all collected logs, including:

Runtime logs
Followed service logs
Evidence captured during failures

CLI display of failure logs and unhealthy records is coming soon.

Operational tips

Use liveness for binary conditions (process exited, container stopped)
Use health for noisy signals (logs, load, latency)
Use report to trigger domain-specific diagnostics
Keep record commands informative but cheap
Prefer multiple focused checks over one complex script