Save hashes to a .bigtree file
This is useful any time you want to scan something now and use it later. That might be so you can:
- search the tree without plugging in the corresponding backup drive
- pre-run the long hashing step to speed up other operations
- diff the current state with a future one to see what changed
For now, let’s say we want to dedup these files:
example03
├── beethoven.mp3
├── haydn.mp3
├── mozart (copy 1).mp3
├── mozart (copy 2).mp3
├── mozart (copy 3).mp3
├── mozart.mp3
├── pdf_1
│ ├── different-dimensions.pdf
│ ├── ocr.pdf
│ └── scans.pdf
├── pdf_2
│ ├── different-dimensions.pdf
│ ├── ocr.pdf
│ └── scans.pdf
└── pdf_3
├── different-dimensions.pdf
├── ocr.pdf
└── scans.pdf
We’ll pretend there are a lot more of them, so that it makes sense to avoid hashing them multiple times. Here’s how we can do only the hashing step.
bigtrees hash example03 \
--output example03.bigtreeThe .bigtree file is a flattened version of the tree structure BigTrees works with internally. It’s mostly a list of lines (or a long table if you prefer) where each line has all the info we need about a particular file or folder.
# type depth hash modtime nbytes nfiles name
F 2 tI3o1uovy6ruP0NQVdubqK 1763783650 197597 1 scans.pdf
F 2 ECxxis6WQ/mK8R01yWrY79 1763783650 890714 1 ocr.pdf
F 2 etwnD5VGdg7T2UP2l48bFs 1763783650 4991 1 different-dimensions.pdf
D 1 yefChv7y4LiL0Lw+vjmXOm 1763783650 1097398 4 pdf_3
F 2 tI3o1uovy6ruP0NQVdubqK 1763783650 197597 1 scans.pdf
F 2 ECxxis6WQ/mK8R01yWrY79 1763783650 890714 1 ocr.pdf
F 2 etwnD5VGdg7T2UP2l48bFs 1763783650 4991 1 different-dimensions.pdf
D 1 yefChv7y4LiL0Lw+vjmXOm 1763783650 1097398 4 pdf_2
F 2 tI3o1uovy6ruP0NQVdubqK 1763783650 197597 1 scans.pdf
F 2 ECxxis6WQ/mK8R01yWrY79 1763783650 890714 1 ocr.pdf
F 2 etwnD5VGdg7T2UP2l48bFs 1763783650 4991 1 different-dimensions.pdf
D 1 yefChv7y4LiL0Lw+vjmXOm 1763783650 1097398 4 pdf_1
F 1 xRddK/EUyJ+AdIJCRZM2ib 1542057367 222794 1 mozart.mp3
F 1 xRddK/EUyJ+AdIJCRZM2ib 1763783650 222794 1 mozart (copy 3).mp3
F 1 xRddK/EUyJ+AdIJCRZM2ib 1763783650 222794 1 mozart (copy 2).mp3
F 1 xRddK/EUyJ+AdIJCRZM2ib 1763783650 222794 1 mozart (copy 1).mp3
F 1 IdF2PaNu2wsEJyuGXiKvBE 1542057367 163047 1 haydn.mp3
F 1 Y9WRQPWD5N0ofEyXqklqsd 1542057367 144546 1 beethoven.mp3
D 0 FyvV3pYJ/+gp0rMXHhLtgp 1763783650 4495059 19 example03More importantly though, you can use it in place of an actual folder in many of the
other commands: diff, dupes, find, and set-add.
The main advantage is that if you want to play with the parameters of the dupe finding algorithm—which files it should include or exclude, how many levels deep to search, how to sort the results, etc—you won’t have to re-run the hash step each time.
In my experience deduping large personal backup drives (~1-4 terabytes and a few million files each) that might save you an entire day of work per run! Hashing could take 8-10 hours if you’re doing it over a USB connection, and then the dupe finding algorithm might only take 10-30 minutes at the end.
Here’s the same process we used in the first example, except starting from the tree file…
bigtrees dupes example03.bigtree \
--output dedup.sh \
--dupes-out-fmt dedup-scriptbash dedup.shKEEP 'example03/pdf_1'
rm dir 'example03/pdf_2'
rm dir 'example03/pdf_3'
KEEP 'example03/mozart.mp3'
rm file 'example03/mozart (copy 1).mp3'
rm file 'example03/mozart (copy 2).mp3'
rm file 'example03/mozart (copy 3).mp3'
And the final deduped files:
example03
├── beethoven.mp3
├── haydn.mp3
├── mozart.mp3
└── pdf_1
├── different-dimensions.pdf
├── ocr.pdf
└── scans.pdf