In this section, we will implement a little tool that finds out which files in a directory are duplicates of each other. With that knowledge, it will remove all but one of all duplicated files, and substitute them with symbolic links, which reduces the folder size.
Make sure to have a backup of your system's data. We will be playing with STL functions that remove files. A simply misspelled path in such a program can lead to a program that greedily removes too many files in unwanted ways.
- First, we need to include the necessary headers and then we declare that we use namespace std and filesystem by default.
#include <iostream>
#include <fstream>
#include <unordered_map>
#include <filesystem>
using namespace std;
using namespace filesystem;
- In order to find out which files are duplicates of each other, we will construct a hash map that maps from hashes of file content to the path of the first file from which that hash was generated. It would be a better idea to use a production hash algorithm for files such as MD5 or an SHA variant. In order to keep the recipe clean and simple, we just read the whole file into a string and then use the same hash function object that unordered_map already uses for strings to calculate hashes.
static size_t hash_from_path(const path &p)
{
ifstream is {p.c_str(),
ios::in | ios::binary};
if (!is) { throw errno; }
string s;
is.seekg(0, ios::end);
s.reserve(is.tellg());
is.seekg(0, ios::beg);
s.assign(istreambuf_iterator<char>{is}, {});
return hash<string>{}(s);
}
- Then we implement the function that constructs such a hash map and deletes duplicates. It iterates recursively through a directory and its subdirectories.
static size_t reduce_dupes(const path &dir)
{
unordered_map<size_t, path> m;
size_t count {0};
for (const auto &entry :
recursive_directory_iterator{dir}) {
- For every directory entry, it checks whether it is a directory itself. All directory items are skipped. For every file, we generate its hash value and try to insert it into the hash map. If the hash map already contains the same hash, then this means that we already inserted a file with the same hash. This means that we just found a duplicate! In case of a clash during insertion, the second value in the pair that try_emplace returns is false.
const path p {entry.path()};
if (is_directory(p)) { continue; }
const auto &[it, success] =
m.try_emplace(hash_from_path(p), p);
- Using the return values from try_emplace, we can tell the user that we just inserted a file because we have seen its hash for the first time. In case we found a duplicate, we tell the user what other file it is a duplicate of and delete it. After deletion, we create a symbolic link that replaces the duplicate.
if (!success) {
cout << "Removed " << p.c_str()
<< " because it is a duplicate of "
<< it->second.c_str() << 'n';
remove(p);
create_symlink(absolute(it->second), p);
++count;
}
- After the filesystem iteration, we return the number of files we deleted and replaced with symlinks.
}
return count;
}
- In the main function, we make sure that the user provided a directory on the command line, and that this directory exists.
int main(int argc, char *argv[])
{
if (argc != 2) {
cout << "Usage: " << argv[0] << " <path>n";
return 1;
}
path dir {argv[1]};
if (!exists(dir)) {
cout << "Path " << dir << " does not exist.n";
return 1;
}
- The only thing we need to do now is to call reduce_dupes on this directory and print how many files it deleted.
const size_t dupes {reduce_dupes(dir)};
cout << "Removed " << dupes << " duplicates.n";
}
- Compiling and running the program on an example directory that contains some duplicate files looks like the following. I used the du tool to check the folder size before and after launching our program to demonstrate that the approach works.
$ du -sh dupe_dir
1.1M dupe_dir
$ ./dupe_compress dupe_dir
Removed dupe_dir/dir2/bar.jpg because it is a duplicate of
dupe_dir/dir1/bar.jpg
Removed dupe_dir/dir2/base10.png because it is a duplicate of
dupe_dir/dir1/base10.png
Removed dupe_dir/dir2/baz.jpeg because it is a duplicate of
dupe_dir/dir1/baz.jpeg
Removed dupe_dir/dir2/feed_fish.jpg because it is a duplicate of
dupe_dir/dir1/feed_fish.jpg
Removed dupe_dir/dir2/foo.jpg because it is a duplicate of
dupe_dir/dir1/foo.jpg
Removed dupe_dir/dir2/fox.jpg because it is a duplicate of
dupe_dir/dir1/fox.jpg
Removed 6 duplicates.
$ du -sh dupe_dir
584K dupe_dir