Back to Cookbook

SLURM Memory-Kill Detective

Turn "Exceeded job memory limit" into a parameterized fix.

Diagnose SLURM OOM/memory-limit terminations, align requests with actual usage, and update job scripts to prevent repeated queue waste.

CommunitySubmitted by CommunityWork12 min

INGREDIENTS

🔍Web

PROMPT

You are OpenClaw. Ask for the sbatch script, job ID, memory flags, and the error log. Explain how SLURM enforces memory, then propose updated resource requests and code-level mitigations (chunking, reducing workers, streaming IO). Provide a short template sbatch block for the user's scenario.

Pain point

Jobs fail with slurmstepd messages indicating memory limits were exceeded, sometimes with confusing "step"

vs "job" memory semantics.

Repro/diagnostic steps

  1. Collect job stderr/stdout showing the memory error and job parameters (--mem, --mem-per-cpu, etc).
  2. Identify whether failure is immediate (startup) or late (peak usage).

Root causes (common)

  • Requested memory below peak consumption.
  • Memory accounted/enforced via cgroups (RSS + cache effects).
  • Containerized jobs with additional overhead.
  • Memory blow-ups from parallelism or unbounded data structures.

Fix workflow

  1. Estimate peak memory from prior runs or profiling.
  2. Adjust memory request strategy (per-cpu vs per-node) and parallelism.
  3. Add checkpoints and partial outputs so a kill doesn't lose all progress.
  4. Verify with a smaller test case, then scale.

Expected result

  • Jobs complete without memory-limit termination and resource requests are right-sized.

References

  • https://docs.hpc.ut.ee/public/cluster/monitoring_and_managing_jobs/investigate_job_failure/
  • https://stackoverflow.com/questions/45993739/slurmstepd-error-exceeded-step-memory-limit-at-some-point
  • https://hpcc.umd.edu/faq/slurm/
Tags:#hpc#slurm#reproducibility#performance