Smartmontools – Comprehensive Monitoring Guide

Overview

Smartmontools is an essential tool for monitoring SSDs and HDDs in modern server environments. It identifies early signs of hardware degradation, provides detailed diagnostics, and integrates seamlessly into automated maintenance workflows. This guide delivers a production‑ready configuration, clear threshold recommendations, structured troubleshooting, operational best practices, and an automated daily health‑check script for the server at1.

1. Recommended `smartd.conf` (NVMe + SATA)

# NVMe (Samsung 980)
/dev/nvme0 -a -d nvme -n standby -m root -M diminishing -W 4,45,55 -s (S/../.././03|L/../01/./04)

# SATA SSDs (sda/sdb)
/dev/sda -a -n standby -m root -M diminishing -W 4,45,55 -s (S/../.././02|L/../01/./05)
/dev/sdb -a -n standby -m root -M diminishing -W 4,45,55 -s (S/../.././02|L/../01/./06)

Key Option Summary

-a – monitor all SMART attributes
-d nvme – NVMe driver mode
-n standby – avoids waking idle drives
-m root – send alerts to the root mailbox
-M diminishing – suppresses repeated notifications
-W 4,45,55 – temperature thresholds
-s (…) – scheduled tests (weekly + monthly)

2. Recommended SMART Thresholds for 24/7 Operation

Note: Many SMART attributes are vendor‑specific. Interpret them cautiously and observe long‑term trends rather than isolated values.

Temperature Guidelines

Warning: 45°C
Critical: 55°C
Ideal NVMe range: 40–50°C

NVMe Wear Indicators

Percentage Used > 80% → plan replacement
Media/Data Integrity Errors > 0 → critical
Critical Warning != 0 → immediate replacement

SATA Attributes That Matter

Reallocated_Sector_Ct > 0 → monitor closely
Uncorrectable Error Count > 0 → serious issue
CRC_Error_Count > 0 → often cable‑related

3. Monitoring Strategy for Continuous Server Operation

Tip: Test email alerts via smartctl -M test -m root /dev/sda to ensure notifications arrive correctly.

smartd (Automatic Monitoring)

Performs continuous attribute monitoring
Executes scheduled tests
Sends pre‑fail warnings

Automated Daily Health Check

Covers:

SMART summaries
Btrfs device statistics
Btrfs scrub status
Filesystem usage
Recent high‑priority journal logs

Monthly Manual Long Tests

smartctl -t long /dev/nvme0
smartctl -t long /dev/sda
smartctl -t long /dev/sdb

Log Integration

System logs: /var/log/syslog
Service logs: journalctl -u smartmontools.service

4. Diagnostic Guide (Handling SMART Warnings)

SMART warnings typically appear as pre‑failure flags, temperature alerts, or steadily increasing error counters. The following list highlights the most relevant indicators.

Reallocated_Sector_Ct > 0

Indicates physical degradation.

Action: plan replacement.

Current_Pending_Sector > 0

Read‑error candidates waiting for reallocation.

Action: run a long test and ensure backups.

CRC_Error_Count Increasing

Usually a cable or connection problem.

Action: replace/reattach cable.

NVMe Media/Data Integrity Errors > 0

Critical reliability failure.

Action: back up immediately and replace.

Temperature Warnings

Improve airflow
Adjust NVMe heatsink
Remove dust

5. Best Practices for Reliable 24/7 Operation

Use realistic SMART thresholds
NVMe drives do not require daily testing
Keep temperatures below 55°C
Run a monthly Btrfs scrub
Replace SSDs at 80%+ wear
Use RAID1 + Btrfs for short‑term resilience

6. Automated Health‑Check Script for at1

Tip: Run /root/scripts/at1_healthcheck.sh manually for testing. Logs are stored in /var/log/healthcheck/.

File: /root/scripts/at1_healthcheck.sh

#!/bin/bash
# Script Version: 01
# Description: Daily health check for at1 (SMART, Btrfs, disk usage, basic log summary)

set -euo pipefail

# Set variables
# ========
LOG_DIR=/var/log/healthcheck
LOG_FILE="${LOG_DIR}/health_$(date +%F).log"
HOSTNAME=$(hostname)
BTRFS_MOUNT=/mnt/raid1
NVME_DEV=/dev/nvme0
SATA1_DEV=/dev/sda
SATA2_DEV=/dev/sdb

DEBUG=1

log() {
  local LEVEL="$1"; shift
  echo "$(date '+%F %T') [${HOSTNAME}] [${LEVEL}] $*" | tee -a "${LOG_FILE}"
}

_debug() {
  if [ "${DEBUG}" -eq 1 ]; then
    log "DEBUG" "$*"
  fi
}

# Functions
# ========
init_log() {
  mkdir -p "${LOG_DIR}"
  touch "${LOG_FILE}"
}

check_smart() {
  log INFO "==== SMART Status (${NVME_DEV}, ${SATA1_DEV}, ${SATA2_DEV}) ===="
  for DEV in "${NVME_DEV}" "${SATA1_DEV}" "${SATA2_DEV}"; do
    if [ -b "${DEV}" ]; then
      _debug "Checking SMART for ${DEV}"
      echo "---- ${DEV} ----" | tee -a "${LOG_FILE}"
      smartctl -H -A "${DEV}" 2>&1 | tee -a "${LOG_FILE}"
    else
      log WARN "Device ${DEV} does not exist, skipping"
    fi
  done
}

check_btrfs_device_stats() {
  log INFO "==== Btrfs Device Stats (${BTRFS_MOUNT}) ===="
  if mount | grep -q " ${BTRFS_MOUNT} "; then
    btrfs device stats "${BTRFS_MOUNT}" 2>&1 | tee -a "${LOG_FILE}"
  else
    log WARN "${BTRFS_MOUNT} is not mounted, skipping device stats"
  fi
}

check_btrfs_scrub_status() {
  log INFO "==== Btrfs Scrub Status (${BTRFS_MOUNT}) ===="
  if mount | grep -q " ${BTRFS_MOUNT} "; then
    btrfs scrub status "${BTRFS_MOUNT}" 2>&1 | tee -a "${LOG_FILE}"
  else
    log WARN "${BTRFS_MOUNT} is not mounted, skipping scrub status"
  fi
}

check_df() {
  log INFO "==== Filesystem Usage (df -h) ===="
  df -h | tee -a "${LOG_FILE}"
}

check_journal_errors() {
  log INFO "==== Journal Summary (last 15 minutes, priority <= 3) ===="
  journalctl --since "15 min ago" -p 0..3 -o short-precise 2>/dev/null | tee -a "${LOG_FILE}" || true
}

# Main Process
# ========
init_log
log INFO "Starting health check for ${HOSTNAME}"

check_smart
check_btrfs_device_stats
check_btrfs_scrub_status
check_df
check_journal_errors

log INFO "Health check completed successfully"

Your monitoring setup is now clean, consistent, and ready for long‑term reliable operation.