Observe system and service health on edge devices
With m87, deployments can include observation jobs that continuously check whether services and systems are alive and healthy.
These checks run on the device, report their results centrally, and attach evidence when something goes wrong.
Observation works independently of how something was deployed — Docker, systemd, or one-shot jobs.
The problem
On edge devices, failures are rarely clean:
- Containers restart silently
- Services stay up but stop behaving correctly
- CPU or memory pressure builds up over time
- Devices are unreachable when issues happen
Logs alone are insufficient unless they are collected at the moment of failure.
The solution: observe jobs
m87 provides observe jobs that:
- Run checks on a schedule
- Distinguish between liveness and health
- Execute custom data-capture logic on failure
- Record and report diagnostic evidence automatically
- Expose aggregated status via CLI and UI
Observation is declarative and runs continuously on the device.
Core concepts
Observe job
A job of type observe, or a service job with an observe section.
It never deploys anything — it only evaluates state and reports results.
Liveness
Answers: Is this thing still running?
- Binary checks
- Fast execution
- Failure usually means the service is down or stuck
Health
Answers: Is this thing behaving correctly?
- Tolerates short spikes or transient errors
- Uses
fails_afterto avoid flapping - Often based on logs, metrics, or heuristics
Record
A command executed when a check fails to capture diagnostic evidence, such as:
- Logs
- System metrics
- Process state
The output is attached to the observation result.
Report
Both liveness and health checks support a report command.
report is executed on failure, just like record, but is intended for active reactions rather than passive data capture.
Typical uses:
- Start recording ROS bags on a robot
- Enable verbose logging or tracing
- Dump domain-specific state
- Trigger external scripts or signals
This allows you to collect richer, context-aware data at the exact moment a failure is detected.
Observation structure (overview)
observe:
logs:
follow: <command>
liveness:
every: 5s
observe: <command>
record: <command>
report: <command>
health:
every: 10s
observe: <command>
record: <command>
report: <command>
fails_after: 3
Only the sections you define are active.
Logs follow and unified log streaming
The logs.follow command is used when you run:
m87 <device> logs
This means:
- You receive runtime-level logs and
- Logs from any services or commands you explicitly follow via
logs.follow
This provides a unified log stream across:
- The runtime
- Deployed services
- Custom log sources defined in observe jobs
Example: Docker Compose liveness
Check whether any containers are in a bad state.
observe:
liveness:
every: 5s
observe: >
docker compose ps
--status exited
--status restarting
--status dead
--status paused
--status removing
--status created
| sed '1d'
| grep -q . && exit 1 || exit 0
record: docker compose logs --timestamps --tail 200
report: echo "Container liveness failure detected"
Behavior:
- Fails immediately if any container is not running
- Captures recent logs as evidence
- Executes custom reporting logic
Example: Log-based health check
Detect error patterns without killing the service immediately.
observe:
health:
every: 10s
observe: >
docker compose logs --no-color --tail=200
| grep -Ei 'error|panic|fatal|crash'
&& exit 1 || exit 0
record: docker compose logs --timestamps --tail 200
report: echo "Health degradation detected"
fails_after: 3
Behavior:
- Allows transient errors
- Marks unhealthy only after repeated failures
- Triggers reporting and evidence capture
Example: Host CPU and memory monitoring
A pure system observer that checks sustained resource pressure.
jobs:
- id: system_monitor
type: observe
enabled: true
workdir:
mode: ephemeral
observe:
health:
every: 10s
observe: |
sh -lc '
CPU_T=${CPU_THRESHOLD:-95}
MEM_T=${MEM_THRESHOLD:-90}
PREV=/tmp/m87_cpu_prev
CUR=$(awk "/^cpu /{print \$2,\$3,\$4,\$5,\$6,\$7,\$8,\$9}" /proc/stat)
if [ ! -f "$PREV" ]; then echo "$CUR" > "$PREV"; exit 0; fi
PREVLINE=$(cat "$PREV")
echo "$CUR" > "$PREV"
set -- $PREVLINE
p_user=$1; p_nice=$2; p_sys=$3; p_idle=$4; p_iowait=$5; p_irq=$6; p_softirq=$7; p_steal=$8
set -- $CUR
c_user=$1; c_nice=$2; c_sys=$3; c_idle=$4; c_iowait=$5; c_irq=$6; c_softirq=$7; c_steal=$8
p_idle_all=$((p_idle + p_iowait))
c_idle_all=$((c_idle + c_iowait))
p_non_idle=$((p_user + p_nice + p_sys + p_irq + p_softirq + p_steal))
c_non_idle=$((c_user + c_nice + c_sys + c_irq + c_softirq + c_steal))
p_total=$((p_idle_all + p_non_idle))
c_total=$((c_idle_all + c_non_idle))
d_total=$((c_total - p_total))
d_idle=$((c_idle_all - p_idle_all))
CPU_OK=1
if [ "$d_total" -gt 0 ]; then
usage=$(( (100 * (d_total - d_idle)) / d_total ))
[ "$usage" -ge "$CPU_T" ] && CPU_OK=0
fi
TOTAL=$(awk "/^MemTotal:/{print \$2}" /proc/meminfo)
AVAIL=$(awk "/^MemAvailable:/{print \$2}" /proc/meminfo)
USED_PCT=$(( (100 * (TOTAL - AVAIL)) / TOTAL ))
MEM_OK=1
[ "$USED_PCT" -ge "$MEM_T" ] && MEM_OK=0
[ "$CPU_OK" -eq 1 ] && [ "$MEM_OK" -eq 1 ] && exit 0
exit 1
'
record: |
date
echo '--- load ---'
cat /proc/loadavg
echo '--- mem ---'
egrep '^(MemTotal|MemAvailable|MemFree|Buffers|Cached|SwapTotal|SwapFree):' /proc/meminfo
echo '--- top cpu ---'
ps -eo pid,ppid,comm,%cpu,%mem --sort=-%cpu | head -n 15
echo '--- top mem ---'
ps -eo pid,ppid,comm,%cpu,%mem --sort=-%mem | head -n 15
report: echo "System resource pressure detected"
fails_after: 18
Viewing observation status (CLI)
You can inspect aggregated observation results using:
m87 atlas-node status
Example output:
Device atlas-node
Observations
NAME LIVELINESS HEALTH CRASHES UNHEALTHY_CHECKS
app_service DEAD UNHEALTHY 1 2
system_monitor DEAD UNHEALTHY 0 3
This gives you a quick, high-level view of the device’s operational state.
Web UI and logs
In the web UI, m87 provides a unified view of all collected logs, including:
- Runtime logs
- Followed service logs
- Evidence captured during failures
CLI display of failure logs and unhealthy records is coming soon.
Operational tips
- Use liveness for binary conditions (process exited, container stopped)
- Use health for noisy signals (logs, load, latency)
- Use
reportto trigger domain-specific diagnostics - Keep
recordcommands informative but cheap - Prefer multiple focused checks over one complex script