Dedup Detective
Find and resolve duplicate records with fuzzy matching
Goes beyond exact matches to find duplicates that differ by typos, formatting, abbreviations, or missing fields. "Jon Smith" at "123 Main St" and "John Smith" at "123 Main Street" — caught and grouped for your review.
PROMPT
Create a skill called "Dedup Detective". When I give you a dataset and specify the columns to match on, find duplicate records using: (1) Exact matching on normalized values (trim whitespace, lowercase, remove punctuation). (2) Phonetic matching (Soundex, Metaphone) for name columns. (3) Edit distance (Levenshtein) for strings that might have typos. (4) Token overlap for address-style fields ("123 Main Street" vs "123 Main St"). (5) Combination scoring across multiple columns. For each candidate duplicate group, assign a confidence score (0-1). Present groups for my review, sorted by confidence. Let me approve or reject each group. For approved groups, ask which record to keep as the master and how to combine fields from duplicates. Save the matching rules so I can rerun on future data without re-approving the same patterns.
How It Works
Exact dedup is easy. Fuzzy dedup is where analysts lose days. This skill
uses multiple matching strategies (phonetic, edit distance, token overlap)
to find records that are probably the same entity, groups them, and lets
you review before merging.
What You Get
- Exact duplicate detection and removal
- Fuzzy duplicate detection using configurable match strategies
- Match confidence scores for each candidate pair
- Grouped duplicates for review (not auto-merged without approval)
- Merge rules: which record to keep, how to combine fields
- A dedup report showing how many records were consolidated
- Reusable match rules for recurring deduplication jobs
Setup Steps
- Ask your Claw to create a "Dedup Detective" skill with the prompt below
- Give it a dataset and specify which columns to match on
- Review candidate duplicate groups with confidence scores
- Approve merges and save the rules for next time
Tips
- Start with high-confidence matches (>0.95) and work down
- The review step is critical — never auto-merge without human review on the first run
- Save your merge rules to handle recurring data from the same sources
- Works on customer records, vendor lists, contact databases, product catalogs, and more