cancel
Showing results for 
Search instead for 
Did you mean: 

Looking for a GENUINE FREEWARE duplicate file finder

Luzern
Seasoned Champion
Posts: 4,374
Thanks: 707
Fixes: 5
Registered: ‎31-07-2007

Looking for a GENUINE FREEWARE duplicate file finder

I have a multitude of duplicate files, the duplicates mostly having longer names that include the likes of

... (2018_06_04 06_52_58 UTC)....

I say GENUINE FREEWARE because I have downloaded too many. which purport to be, but then it turns out they are very limited, unless you upgrade to a pay version Totally dishonest!:angry:

I want something that is easy to use, and allows the user good choice of drive and locations to scan.

I'll be grateful for any recommendations.

No one has to agree with my opinion, but in the time I have left a miracle would be nice.
13 REPLIES 13
Moderator
Moderator
Posts: 19,707
Thanks: 3,451
Fixes: 337
Registered: ‎06-04-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder

I use two:

https://www.auslogics.com/en/software/duplicate-file-finder/

and https://www.ashisoft.com Both have paid-for versions but the free version has always been good enough for my needs with various ways of searching and with previews.

CCleaner also has a duplicate file searcher but I found that took much longer.

Forum Moderator and Customer
Courage is resistance to fear, mastery of fear, not absence of fear - Mark Twain
He who feared he would not succeed sat still

Moderator
Moderator
Posts: 28,424
Thanks: 2,458
Fixes: 317
Registered: ‎14-04-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder

I've used Freecommander quite a bit in the past.

Customer and Forum Moderator. Windows 10 Firefox 69.0.3 (64-bit)

Life is tough enough without adding Linux into the mix.
Community Veteran
Posts: 6,970
Thanks: 2,250
Fixes: 43
Registered: ‎16-10-2014

Re: Looking for a GENUINE FREEWARE duplicate file finder

I wrote one of these before for a forum member, but I guess they're not using it any more :cry:

But there are duplicate files and duplicate file names so may not actually be a duplicate file. So any good duplicator will hash the file as well to ensure the files are true duplicates.

With that said doing this is a very resource and process intensive procedure so will take some time. Imagine if you have a file called A.dat on the root of C:\ and a possible duplicate of this elsewhere on the disk. Every other file on the disk needs to get checked with A.dat to see if it's a duplicate so the more you have the more time it will take and if hashing is involved it will take even longer. Even when one is found there may be more, so every file on the disk must be checked.

 

VileReynard
Champion
Posts: 11,845
Thanks: 455
Fixes: 17
Registered: ‎01-09-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder

A good file duplicate detector will analyse photo's on a similarity basis so that virtually identical photo's can be presented for selection by a human. :smiley:

Simple hashing will only find exact duplicates.

digiKam (Linux) can do this - although it is really a much bigger photo management tool and does much more than this.

 

"In The Beginning Was The Word, And The Word Was Aardvark."

Community Veteran
Posts: 6,970
Thanks: 2,250
Fixes: 43
Registered: ‎16-10-2014

Re: Looking for a GENUINE FREEWARE duplicate file finder

That’s all very well @VileReynard but similar is not identical, that’s where the hash comes in!

For what it’s worth, no doubt nothing to most of you, this code will hash all of the files in a directory using a SHA512 hash digest.

#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <cstdio>
#include <stdlib.h>
#include <cstring>
#include <iostream>
#include <unistd.h>
#include <exception>
#include <filesystem>
#include <openssl/sha.h>
#include <openssl/ssl.h>

char * mdString = new char[(SHA512_DIGEST_LENGTH * 2) +1]{0};
unsigned char * mdDigest = new unsigned char[SHA512_DIGEST_LENGTH]{0};

const unsigned long getFileSize(const int& fd) {
    struct stat fileInfo;
	std::memset((void *)&fileInfo, 0, sizeof(fileInfo));
    if (fstat(fd, &fileInfo) < 0) {
		std::cerr << "Failed to stat file" << std::endl;
	}
    return fileInfo.st_size;
}

const bool shaFile(const char * content, const unsigned long& size) {
	bool success{false} ;
	try {
		std::memset(mdString, 0, (SHA512_DIGEST_LENGTH * 2)) ;
		SHA512_CTX context ;
		SHA512_Init(&context) ;
		SHA512_Update(&context, content, size) ;
		SHA512_Final(mdDigest, &context) ;
		for (unsigned int i = 0; i < SHA512_DIGEST_LENGTH; i++) {
			std::snprintf(&mdString[i*2], SHA512_DIGEST_LENGTH, "%02x", (unsigned int) mdDigest[i]) ;
		}
		success = true ;
	} catch (std::exception const& e) {
		std::cerr << e.what() << " while hashing file!" << std::endl ;
	}
	return success ;
}

int main(int argc, char *argv[]) {
    if (argc != 2) {
    	std::cerr << "The path to hash the file in is missing from Command Line!" << std::endl;
        exit(-1);
    }

	for (auto& f : std::filesystem::directory_iterator(argv[1])) {
    	const int filePtr = open(f.path().c_str(), O_RDONLY);
	    if (filePtr < 0) {
			std::cerr << "Unable to open " << f.path() << "!" << std::endl ;
		} else {
		    const unsigned long fileSize = getFileSize(filePtr);
			if (fileSize) {
		  	  	char * fileBuffer = (char *) mmap(0, fileSize, PROT_READ, MAP_SHARED, filePtr, 0);
				if (fileBuffer != MAP_FAILED) {
					if (shaFile(fileBuffer, fileSize)) {
						std::cout << "File : " << f.path() << " size : " << fileSize << " Hash : " << mdString << std::endl ;
					}
					munmap(fileBuffer, fileSize);
				}
			}
		}
		close(filePtr) ;
	}
	if (mdString) delete[] mdString ;
	if (mdDigest) delete[] mdDigest ;
    return 0;
}

To compile it use:

g++ FileHash.cpp -std=c++17 -o FileHash -lstdc++fs -lcrypto

And to use it do:

./FileHash ./

You'll then get output like this:

File : "./FileHash.cpp" size : 2147 Hash : e2719c384b834cb345bc36275c28a923369cbac0f474e4508b1d592af38827b0524925b4f22f57bcff899b2db36c982799a83d9c915fd053fc0caae0d54ff0ae
File : "./FileHash" size : 220536 Hash : 1c921c84549b7783df29e375499c3688e8a288cf281003f40d1515d890f2964e4ad402500c42ac6df1567c20c0f1273c8e121cf0f6c9619da91d67bd9b48545d

 

 

 

VileReynard
Champion
Posts: 11,845
Thanks: 455
Fixes: 17
Registered: ‎01-09-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder

Very good!

"In The Beginning Was The Word, And The Word Was Aardvark."

Community Veteran
Posts: 5,394
Thanks: 589
Fixes: 25
Registered: ‎10-06-2010

Re: Looking for a GENUINE FREEWARE duplicate file finder

Or you could just use the sha512sum command.

Community Veteran
Posts: 6,970
Thanks: 2,250
Fixes: 43
Registered: ‎16-10-2014

Re: Looking for a GENUINE FREEWARE duplicate file finder

And if you do you get the exact same results as well.

[mook@dev FileHash]$ sha512sum FileHash.cpp 
e2719c384b834cb345bc36275c28a923369cbac0f474e4508b1d592af38827b0524925b4f22f57bcff899b2db36c982799a83d9c915fd053fc0caae0d54ff0ae  FileHash.cpp
[mook@dev FileHash]$ sha512sum FileHash
1c921c84549b7783df29e375499c3688e8a288cf281003f40d1515d890f2964e4ad402500c42ac6df1567c20c0f1273c8e121cf0f6c9619da91d67bd9b48545d  FileHash
VileReynard
Champion
Posts: 11,845
Thanks: 455
Fixes: 17
Registered: ‎01-09-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder

You need to filter your files by mime type - it is generally unusual to get duplicates of most file types.

Sometimes you get duplicates of content such as ebooks or image files - but not random files.

So you really need to check the contents of files, which makes it a hard problem.

"In The Beginning Was The Word, And The Word Was Aardvark."

Community Veteran
Posts: 6,970
Thanks: 2,250
Fixes: 43
Registered: ‎16-10-2014

Re: Looking for a GENUINE FREEWARE duplicate file finder

I know what you mean by MIME type but that’s a web thing and having duplicates of the same file and type is quite easy if you copy them to various locations around your disk over time.

However, in order to do the way you are suggesting you’d really need to open the file in binary mode and attempt to read a 'header'. You could do this easily if you could take as true what its extension implied, but not all files have extensions of course and there's nothing stopping you from renaming an image file as an exe.

Doing it by 'header' would be horrendous, say you had an image file called doggie where you’d removed the extension, to truly determine its type you would have to try and open it and read 8 bytes to see if it might be a PNG if not, then another 6 bytes to see if it were a BMP or equally it may be a GIF, or if none of these it might be a JPEG etc, you get the idea, and that's reducing the task to image files! 

I for one wouldn’t want to be doing this.

wisty
Pro
Posts: 525
Thanks: 80
Fixes: 7
Registered: ‎30-07-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder

 

I use a paid for product. EF Duplicate Files Manager its €11.90 one time payment. It's regularly updated and works extremely well. It has lots of options and an easy to use  interface.

Given the complexity of what you need to do, I think it only fair to reward the person who develops such a complex product.

Luzern
Seasoned Champion
Posts: 4,374
Thanks: 707
Fixes: 5
Registered: ‎31-07-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder


@VileReynard wrote:

A good file duplicate detector will analyse photo's on a similarity basis so that virtually identical photo's can be presented for selection by a human. :smiley:

Simple hashing will only find exact duplicates.

digiKam (Linux) can do this - although it is really a much bigger photo management tool and does much more than this.

 


I use Awesome Photo Finder to analyse. It picks differences difficult to see.

No one has to agree with my opinion, but in the time I have left a miracle would be nice.
Luzern
Seasoned Champion
Posts: 4,374
Thanks: 707
Fixes: 5
Registered: ‎31-07-2007

Re: Looking for a GENUINE FREEWARE duplicate file finder

I got rid of a lot of the duplicates that had an extra string in the name compared with the originals by searching for example my docs folder very vaguely with *(*UTC), then removed all to the recycle bin.

All picture diles had  the log names, so I put My Pictures in one of the bulk re-namers just to remove the offending string in the middle

No one has to agree with my opinion, but in the time I have left a miracle would be nice.