← Back to Guides

    Observe system and service health on edge devices

    With m87, deployments can include observation jobs that continuously check whether services and systems are alive and healthy.
    These checks run on the device, report their results centrally, and attach evidence when something goes wrong.

    Observation works independently of how something was deployed — Docker, systemd, or one-shot jobs.


    The problem

    On edge devices, failures are rarely clean:

    • Containers restart silently
    • Services stay up but stop behaving correctly
    • CPU or memory pressure builds up over time
    • Devices are unreachable when issues happen

    Logs alone are insufficient unless they are collected at the moment of failure.


    The solution: observe jobs

    m87 provides observe jobs that:

    • Run checks on a schedule
    • Distinguish between liveness and health
    • Execute custom data-capture logic on failure
    • Record and report diagnostic evidence automatically
    • Expose aggregated status via CLI and UI

    Observation is declarative and runs continuously on the device.


    Core concepts

    Observe job

    A job of type observe, or a service job with an observe section.
    It never deploys anything — it only evaluates state and reports results.


    Liveness

    Answers: Is this thing still running?

    • Binary checks
    • Fast execution
    • Failure usually means the service is down or stuck

    Health

    Answers: Is this thing behaving correctly?

    • Tolerates short spikes or transient errors
    • Uses fails_after to avoid flapping
    • Often based on logs, metrics, or heuristics

    Record

    A command executed when a check fails to capture diagnostic evidence, such as:

    • Logs
    • System metrics
    • Process state

    The output is attached to the observation result.


    Report

    Both liveness and health checks support a report command.

    report is executed on failure, just like record, but is intended for active reactions rather than passive data capture.

    Typical uses:

    • Start recording ROS bags on a robot
    • Enable verbose logging or tracing
    • Dump domain-specific state
    • Trigger external scripts or signals

    This allows you to collect richer, context-aware data at the exact moment a failure is detected.


    Observation structure (overview)

    observe:
      logs:
        follow: <command>
    
      liveness:
        every: 5s
        observe: <command>
        record: <command>
        report: <command>
    
      health:
        every: 10s
        observe: <command>
        record: <command>
        report: <command>
        fails_after: 3
    

    Only the sections you define are active.


    Logs follow and unified log streaming

    The logs.follow command is used when you run:

    m87 <device> logs
    

    This means:

    • You receive runtime-level logs and
    • Logs from any services or commands you explicitly follow via logs.follow

    This provides a unified log stream across:

    • The runtime
    • Deployed services
    • Custom log sources defined in observe jobs

    Example: Docker Compose liveness

    Check whether any containers are in a bad state.

    observe:
      liveness:
        every: 5s
        observe: >
          docker compose ps
          --status exited
          --status restarting
          --status dead
          --status paused
          --status removing
          --status created
          | sed '1d'
          | grep -q . && exit 1 || exit 0
        record: docker compose logs --timestamps --tail 200
        report: echo "Container liveness failure detected"
    

    Behavior:

    • Fails immediately if any container is not running
    • Captures recent logs as evidence
    • Executes custom reporting logic

    Example: Log-based health check

    Detect error patterns without killing the service immediately.

    observe:
      health:
        every: 10s
        observe: >
          docker compose logs --no-color --tail=200
          | grep -Ei 'error|panic|fatal|crash'
          && exit 1 || exit 0
        record: docker compose logs --timestamps --tail 200
        report: echo "Health degradation detected"
        fails_after: 3
    

    Behavior:

    • Allows transient errors
    • Marks unhealthy only after repeated failures
    • Triggers reporting and evidence capture

    Example: Host CPU and memory monitoring

    A pure system observer that checks sustained resource pressure.

    jobs:
      - id: system_monitor
        type: observe
        enabled: true
    
        workdir:
          mode: ephemeral
    
        observe:
          health:
            every: 10s
            observe: |
              sh -lc '
              CPU_T=${CPU_THRESHOLD:-95}
              MEM_T=${MEM_THRESHOLD:-90}
    
              PREV=/tmp/m87_cpu_prev
              CUR=$(awk "/^cpu /{print \$2,\$3,\$4,\$5,\$6,\$7,\$8,\$9}" /proc/stat)
    
              if [ ! -f "$PREV" ]; then echo "$CUR" > "$PREV"; exit 0; fi
              PREVLINE=$(cat "$PREV")
              echo "$CUR" > "$PREV"
    
              set -- $PREVLINE
              p_user=$1; p_nice=$2; p_sys=$3; p_idle=$4; p_iowait=$5; p_irq=$6; p_softirq=$7; p_steal=$8
              set -- $CUR
              c_user=$1; c_nice=$2; c_sys=$3; c_idle=$4; c_iowait=$5; c_irq=$6; c_softirq=$7; c_steal=$8
    
              p_idle_all=$((p_idle + p_iowait))
              c_idle_all=$((c_idle + c_iowait))
              p_non_idle=$((p_user + p_nice + p_sys + p_irq + p_softirq + p_steal))
              c_non_idle=$((c_user + c_nice + c_sys + c_irq + c_softirq + c_steal))
              p_total=$((p_idle_all + p_non_idle))
              c_total=$((c_idle_all + c_non_idle))
    
              d_total=$((c_total - p_total))
              d_idle=$((c_idle_all - p_idle_all))
    
              CPU_OK=1
              if [ "$d_total" -gt 0 ]; then
                usage=$(( (100 * (d_total - d_idle)) / d_total ))
                [ "$usage" -ge "$CPU_T" ] && CPU_OK=0
              fi
    
              TOTAL=$(awk "/^MemTotal:/{print \$2}" /proc/meminfo)
              AVAIL=$(awk "/^MemAvailable:/{print \$2}" /proc/meminfo)
              USED_PCT=$(( (100 * (TOTAL - AVAIL)) / TOTAL ))
    
              MEM_OK=1
              [ "$USED_PCT" -ge "$MEM_T" ] && MEM_OK=0
    
              [ "$CPU_OK" -eq 1 ] && [ "$MEM_OK" -eq 1 ] && exit 0
              exit 1
              '
            record: |
              date
              echo '--- load ---'
              cat /proc/loadavg
              echo '--- mem ---'
              egrep '^(MemTotal|MemAvailable|MemFree|Buffers|Cached|SwapTotal|SwapFree):' /proc/meminfo
              echo '--- top cpu ---'
              ps -eo pid,ppid,comm,%cpu,%mem --sort=-%cpu | head -n 15
              echo '--- top mem ---'
              ps -eo pid,ppid,comm,%cpu,%mem --sort=-%mem | head -n 15
            report: echo "System resource pressure detected"
            fails_after: 18
    

    Viewing observation status (CLI)

    You can inspect aggregated observation results using:

    m87 atlas-node status
    

    Example output:

    Device atlas-node
    Observations
      NAME             LIVELINESS   HEALTH      CRASHES   UNHEALTHY_CHECKS
      app_service      DEAD         UNHEALTHY         1                  2
      system_monitor   DEAD         UNHEALTHY         0                  3
    

    This gives you a quick, high-level view of the device’s operational state.


    Web UI and logs

    In the web UI, m87 provides a unified view of all collected logs, including:

    • Runtime logs
    • Followed service logs
    • Evidence captured during failures

    CLI display of failure logs and unhealthy records is coming soon.


    Operational tips

    • Use liveness for binary conditions (process exited, container stopped)
    • Use health for noisy signals (logs, load, latency)
    • Use report to trigger domain-specific diagnostics
    • Keep record commands informative but cheap
    • Prefer multiple focused checks over one complex script

    Next