We already noticed that our hash function is a too simple one. For the sake of keeping this recipe simple and without external dependencies, we chose this way.
What is the problem with our hash function? There are actually two problems:
- We read the whole file into a string. This is disastrous for files that are larger than our system memory.
- The C++ hash function trait hash<string> is most probably not designed for such hashes.
If we are looking for a better hash function, we should take one that is fast, memory-friendly, and that makes sure that no two really large but different files get the same hash. The latter requirement is maybe the most important one. If we decide that one file is a duplicate of the other although they do not contain the same data, we surely have some data loss after deleting it.
Better hash algorithms are, for example, MD5 or one of the SHA variants. In order to get access to such functions in our program, we could use the OpenSSL cryptography API, for example.