[wellylug] Indentifying Duplicate Files

Mon Dec 20 16:54:32 NZDT 2004

--- Jamie Dobbs <jamie.dobbs at orcon.net.nz> wrote:

> I have a directory full of sound bytes and clip art (approx. 45,000 
> files) that I have collected over many years.
> I want to search the entire directory (and sub directories) and find any 
> duplicate files, by content rather than by filename or filesize. Can 
> anyone tell me of any command line programs for Linux that might allow 
> me to do this, then give me the option of deleting (or moving) the 
> duplicate files.

All that springs to mind is wrapping a script around diff. I don't know of any
applications for this, tho I believe there may be a Windows application out
there to do this. Maybe share the filesystem using samba & use Windows
(ugghh!!)

I'd do something like generate a list of files with file size (maybe ls -l *)

Iterate through this, get the file size from each line & grep duplicate files
sizes from the list, if there is more than 1 match, diff them. The check on
matching file size will be much faster than diff'ing all the files, as you only
need to check files of the same size, coz if the size is different I assume the
contents are also.  

Spotcha,

  Brent