Save hashes to a .bigtree file

This is useful any time you want to scan something now and use it later. That might be so you can:

search the tree without plugging in the corresponding backup drive
pre-run the long hashing step to speed up other operations
diff the current state with a future one to see what changed

For now, let’s say we want to dedup these files:

example03
├── beethoven.mp3
├── haydn.mp3
├── mozart (copy 1).mp3
├── mozart (copy 2).mp3
├── mozart (copy 3).mp3
├── mozart.mp3
├── pdf_1
│   ├── different-dimensions.pdf
│   ├── ocr.pdf
│   └── scans.pdf
├── pdf_2
│   ├── different-dimensions.pdf
│   ├── ocr.pdf
│   └── scans.pdf
└── pdf_3
    ├── different-dimensions.pdf
    ├── ocr.pdf
    └── scans.pdf

We’ll pretend there are a lot more of them, so that it makes sense to avoid hashing them multiple times. Here’s how we can do only the hashing step.

bigtrees hash example03 \
    --output example03.bigtree

The .bigtree file is a flattened version of the tree structure BigTrees works with internally. It’s mostly a list of lines (or a long table if you prefer) where each line has all the info we need about a particular file or folder.

# type	depth	hash	modtime	nbytes	nfiles	name
F	2	tI3o1uovy6ruP0NQVdubqK	1771715830	197597	1	scans.pdf
F	2	ECxxis6WQ/mK8R01yWrY79	1771715830	890714	1	ocr.pdf
F	2	etwnD5VGdg7T2UP2l48bFs	1771715830	4991	1	different-dimensions.pdf
D	1	yefChv7y4LiL0Lw+vjmXOm	1771715830	1097398	4	pdf_3
F	2	tI3o1uovy6ruP0NQVdubqK	1771715830	197597	1	scans.pdf
F	2	ECxxis6WQ/mK8R01yWrY79	1771715830	890714	1	ocr.pdf
F	2	etwnD5VGdg7T2UP2l48bFs	1771715830	4991	1	different-dimensions.pdf
D	1	yefChv7y4LiL0Lw+vjmXOm	1771715830	1097398	4	pdf_2
F	2	tI3o1uovy6ruP0NQVdubqK	1771715830	197597	1	scans.pdf
F	2	ECxxis6WQ/mK8R01yWrY79	1771715830	890714	1	ocr.pdf
F	2	etwnD5VGdg7T2UP2l48bFs	1771715830	4991	1	different-dimensions.pdf
D	1	yefChv7y4LiL0Lw+vjmXOm	1771715830	1097398	4	pdf_1
F	1	xRddK/EUyJ+AdIJCRZM2ib	1542057367	222794	1	mozart.mp3
F	1	xRddK/EUyJ+AdIJCRZM2ib	1771715830	222794	1	mozart (copy 3).mp3
F	1	xRddK/EUyJ+AdIJCRZM2ib	1771715830	222794	1	mozart (copy 2).mp3
F	1	xRddK/EUyJ+AdIJCRZM2ib	1771715830	222794	1	mozart (copy 1).mp3
F	1	IdF2PaNu2wsEJyuGXiKvBE	1542057367	163047	1	haydn.mp3
F	1	Y9WRQPWD5N0ofEyXqklqsd	1542057367	144546	1	beethoven.mp3
D	0	FyvV3pYJ/+gp0rMXHhLtgp	1771715830	4495059	19	example03

More importantly though, you can use it in place of an actual folder in many of the other commands: diff, dupes, find, and set-add.

The main advantage is that if you want to play with the parameters of the dupe finding algorithm—which files it should include or exclude, how many levels deep to search, how to sort the results, etc—you won’t have to re-run the hash step each time.

In my experience deduping large personal backup drives (~1-4 terabytes and a few million files each) that might save you an entire day of work per run! Hashing could take 8-10 hours if you’re doing it over a USB connection, and then the dupe finding algorithm might only take 10-30 minutes at the end.

Here’s the same process we used in the first example, except starting from the tree file…

bigtrees dupes example03.bigtree \
    --output dedup.sh \
    --dupes-out-fmt dedup-script

bash dedup.sh

KEEP    'example03/pdf_1'
rm dir  'example03/pdf_2'
rm dir  'example03/pdf_3'
KEEP    'example03/mozart.mp3'
rm file 'example03/mozart (copy 1).mp3'
rm file 'example03/mozart (copy 2).mp3'
rm file 'example03/mozart (copy 3).mp3'

And the final deduped files:

example03
├── beethoven.mp3
├── haydn.mp3
├── mozart.mp3
└── pdf_1
    ├── different-dimensions.pdf
    ├── ocr.pdf
    └── scans.pdf