Dedup a folder in 3 easy steps

Imagine we want to dedup these files:

example01
├── beethoven.mp3
├── haydn.mp3
├── mozart (copy 1).mp3
├── mozart (copy 2).mp3
├── mozart (copy 3).mp3
├── mozart.mp3
├── pdf_1
│   ├── different-dimensions.pdf
│   ├── ocr.pdf
│   └── scans.pdf
├── pdf_2
│   ├── different-dimensions.pdf
│   ├── ocr.pdf
│   └── scans.pdf
└── pdf_3
    ├── different-dimensions.pdf
    ├── ocr.pdf
    └── scans.pdf

Step 1: Tell BigTrees to scan them and generate a dedup script.

bigtrees dupes example01 \
    --output dedup.sh \
    --dupes-out-fmt dedup-script

Step 2: Skim the script to make sure it’s categorized the duplicates the way you want. BigTrees sorts each set of dupes, putting its best guess about which copy you’d prefer to keep at the top. The script will confirm the first one still exists, then delete the others.

Here are the relevant lines for our example:

# 3 duplicate directories with hash yefChv7y4LiL0Lw+vjmXOm
keep 'example01/pdf_1'
rm_d 'example01/pdf_2'
rm_d 'example01/pdf_3'

# 4 duplicate files with hash xRddK/EUyJ+AdIJCRZM2ib
keep 'example01/mozart.mp3'
rm_f 'example01/mozart (copy 1).mp3'
rm_f 'example01/mozart (copy 2).mp3'
rm_f 'example01/mozart (copy 3).mp3'

You can edit the script now to pick different files if you want. Just be sure to move any new keep lines to the top of their groups.

Step 3: When you’re ready, go ahead and run it:

bash dedup.sh

KEEP    'example01/pdf_1'
rm dir  'example01/pdf_2'
rm dir  'example01/pdf_3'
KEEP    'example01/mozart.mp3'
rm file 'example01/mozart (copy 1).mp3'
rm file 'example01/mozart (copy 2).mp3'
rm file 'example01/mozart (copy 3).mp3'

example01
├── beethoven.mp3
├── haydn.mp3
├── mozart.mp3
└── pdf_1
    ├── different-dimensions.pdf
    ├── ocr.pdf
    └── scans.pdf