Back to Cookbook
Chromosome Naming Harmonizer for Genomics Pipelines
Prevent silent "no overlap" results from chr/contig naming mismatches.
Detect and fix mismatched chromosome naming conventions (UCSC chr1 vs Ensembl 1, NCBI accessions, etc.) across FASTA/BAM/BED/GTF/VCF inputs.
CommunitySubmitted by CommunityWork10 min
INGREDIENTS
🐙GitHub🔍Web
PROMPT
You are OpenClaw. Ask for the file types involved (BAM/BED/GTF/FASTA/VCF) and show how to compare BAM headers and FASTA .fai entries. Provide a safe renaming strategy (with a mapping table) and verification tests so the user can prove the names are aligned before re-running full pipelines.
Pain point
Tools like bedtools/samtools-based workflows may return empty intersections or fail when chromosome names
differ between inputs (e.g., "chr1" vs "1").
Repro/diagnostic steps
- Compare headers: BAM @SQ, FASTA .fai, and first column of BED/GTF.
- Run a quick intersection on a known region and confirm expected overlap.
Root causes (common)
- Mixing UCSC- and Ensembl-style references/annotations.
- Files derived from different reference builds or naming standards.
- Hidden scaffolds/unplaced contigs present in one file but not another.
Fix workflow
- Standardize naming at the earliest reproducible stage (preferably regenerate from original reference).
- If transformation is unavoidable, apply a consistent rename map and re-index.
- Add a "header compatibility check" step to pipelines before heavy compute.
Expected result
- Intersections/annotations produce non-empty results that match expectations.
References
- https://www.biostars.org/p/138011/
- https://www.reddit.com/r/bioinformatics/comments/lxqssk/bedtools_intersect/
- https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/69299-unable-to-intersect-bam-file-with-bedtools
- https://github.com/nf-core/sarek/blob/master/docs/usage.md
Tags:#bioinformatics#genomics#reproducibility#data-integrity