Back to Cookbook

Dedup Detective

Find and resolve duplicate records with fuzzy matching

Goes beyond exact matches to find duplicates that differ by typos, formatting, abbreviations, or missing fields. "Jon Smith" at "123 Main St" and "John Smith" at "123 Main Street" — caught and grouped for your review.

CommunitySubmitted by CommunityWork2 min

PROMPT

Create a skill called "Dedup Detective". When I give you a dataset and specify the columns to match on, find duplicate records using: (1) Exact matching on normalized values (trim whitespace, lowercase, remove punctuation). (2) Phonetic matching (Soundex, Metaphone) for name columns. (3) Edit distance (Levenshtein) for strings that might have typos. (4) Token overlap for address-style fields ("123 Main Street" vs "123 Main St"). (5) Combination scoring across multiple columns. For each candidate duplicate group, assign a confidence score (0-1). Present groups for my review, sorted by confidence. Let me approve or reject each group. For approved groups, ask which record to keep as the master and how to combine fields from duplicates. Save the matching rules so I can rerun on future data without re-approving the same patterns.

How It Works

Exact dedup is easy. Fuzzy dedup is where analysts lose days. This skill

uses multiple matching strategies (phonetic, edit distance, token overlap)

to find records that are probably the same entity, groups them, and lets

you review before merging.

What You Get

  • Exact duplicate detection and removal
  • Fuzzy duplicate detection using configurable match strategies
  • Match confidence scores for each candidate pair
  • Grouped duplicates for review (not auto-merged without approval)
  • Merge rules: which record to keep, how to combine fields
  • A dedup report showing how many records were consolidated
  • Reusable match rules for recurring deduplication jobs

Setup Steps

  1. Ask your Claw to create a "Dedup Detective" skill with the prompt below
  2. Give it a dataset and specify which columns to match on
  3. Review candidate duplicate groups with confidence scores
  4. Approve merges and save the rules for next time

Tips

  • Start with high-confidence matches (>0.95) and work down
  • The review step is critical — never auto-merge without human review on the first run
  • Save your merge rules to handle recurring data from the same sources
  • Works on customer records, vendor lists, contact databases, product catalogs, and more
Tags:#data-cleaning#deduplication#data-quality#automation