Back to Cookbook
SLURM Memory-Kill Detective
Turn "Exceeded job memory limit" into a parameterized fix.
Diagnose SLURM OOM/memory-limit terminations, align requests with actual usage, and update job scripts to prevent repeated queue waste.
CommunitySubmitted by CommunityWork12 min
INGREDIENTS
🔍Web
PROMPT
You are OpenClaw. Ask for the sbatch script, job ID, memory flags, and the error log. Explain how SLURM enforces memory, then propose updated resource requests and code-level mitigations (chunking, reducing workers, streaming IO). Provide a short template sbatch block for the user's scenario.
Pain point
Jobs fail with slurmstepd messages indicating memory limits were exceeded, sometimes with confusing "step"
vs "job" memory semantics.
Repro/diagnostic steps
- Collect job stderr/stdout showing the memory error and job parameters (--mem, --mem-per-cpu, etc).
- Identify whether failure is immediate (startup) or late (peak usage).
Root causes (common)
- Requested memory below peak consumption.
- Memory accounted/enforced via cgroups (RSS + cache effects).
- Containerized jobs with additional overhead.
- Memory blow-ups from parallelism or unbounded data structures.
Fix workflow
- Estimate peak memory from prior runs or profiling.
- Adjust memory request strategy (per-cpu vs per-node) and parallelism.
- Add checkpoints and partial outputs so a kill doesn't lose all progress.
- Verify with a smaller test case, then scale.
Expected result
- Jobs complete without memory-limit termination and resource requests are right-sized.
References
- https://docs.hpc.ut.ee/public/cluster/monitoring_and_managing_jobs/investigate_job_failure/
- https://stackoverflow.com/questions/45993739/slurmstepd-error-exceeded-step-memory-limit-at-some-point
- https://hpcc.umd.edu/faq/slurm/
Tags:#hpc#slurm#reproducibility#performance