KiloClawPowered by OpenClaw

Back to Cookbook

OpenClaw recipe

SLURM Memory-Kill Detective

Turn "Exceeded job memory limit" into a parameterized fix.

Diagnose SLURM OOM/memory-limit terminations, align requests with actual usage, and update job scripts to prevent repeated queue waste.

CommunitySubmitted by CommunityWork12 min

Try in KiloClawFree 7-day trial

INGREDIENTS

🔍Web

PROMPT

You are OpenClaw. Ask for the sbatch script, job ID, memory flags, and the error log. Explain how SLURM enforces memory, then propose updated resource requests and code-level mitigations (chunking, reducing workers, streaming IO). Provide a short template sbatch block for the user's scenario.

Pain point

Jobs fail with slurmstepd messages indicating memory limits were exceeded, sometimes with confusing "step"

vs "job" memory semantics.

Repro/diagnostic steps

Collect job stderr/stdout showing the memory error and job parameters (--mem, --mem-per-cpu, etc).
Identify whether failure is immediate (startup) or late (peak usage).

Root causes (common)

Requested memory below peak consumption.
Memory accounted/enforced via cgroups (RSS + cache effects).
Containerized jobs with additional overhead.
Memory blow-ups from parallelism or unbounded data structures.

Fix workflow

Estimate peak memory from prior runs or profiling.
Adjust memory request strategy (per-cpu vs per-node) and parallelism.
Add checkpoints and partial outputs so a kill doesn't lose all progress.
Verify with a smaller test case, then scale.

Expected result

Jobs complete without memory-limit termination and resource requests are right-sized.

References

https://docs.hpc.ut.ee/public/cluster/monitoring_and_managing_jobs/investigate_job_failure/
https://stackoverflow.com/questions/45993739/slurmstepd-error-exceeded-step-memory-limit-at-some-point
https://hpcc.umd.edu/faq/slurm/

Tags:#hpc#slurm#reproducibility#performance

Related Recipes

Conda Solver Triage & Environment Stabilization

Turn "Solving environment…" hangs into a deterministic fix workflow.

Diagnose and resolve slow/failed conda dependency solves (hangs, frozen/flexible solve loops, UnsatisfiableError) by auditing channels, minimizing specs, and using faster solvers when appropriate.

Snakemake Unlock & Incomplete-Run Recovery

Resume safely after crashes without corrupting outputs.

Resolve Snakemake LockException, unlock safely, and recover from partial outputs after kill signals or interrupted jobs.

Proxy Factory

Auto-generate proxies the moment footage lands

Choppy playback, laggy scrubbing, and unusable timelines — even on strong machines — often comes down to editing long-GOP or heavy codecs at high resolution without proxies. This recipe generates proxies automatically whenever new camera originals appear, so every project starts edit-ready instead of debug-ready.

Creative20 min setup

Cache Janitor

Keep media caches from eating your SSD and slowing your apps

Media cache and disk caches balloon over time, causing storage emergencies and performance regressions. This recipe turns cache cleanup from a panic action into a controlled, scheduled maintenance job — for Adobe Premiere Pro, After Effects, and similar tools.

Creative10 min setup