[wellylug] Indentifying Duplicate Files

Mon Dec 20 17:01:50 NZDT 2004

On Sat, 2004-12-18 at 17:32, Jamie Dobbs wrote:
> I want to search the entire directory (and sub directories) and find any 
> duplicate files, by content rather than by filename or filesize.

You could use md5sum which will give a hex 'hash' of the contents each
file, eg:

  ef9630b73a7193029f35f229fcca0f48  cavalry.gif
  4a50e5b47f84496a98e2492efd02e344  cavalry.jpg
  8ee6a4bcb0bb776d6aa05b01cc8e6a7d  compass.gif

You could then identify all files with the same hash.  Here's an
incredibly ugly Perl one liner to do that:

find . -type f -print0 | xargs -0 md5sum | perl -ne '($m, $f) = split;
$d{$m} ||= [], push @{$d{$m}}, $f; END { foreach (values %d) { next
unless @$_ > 1; print join "\n", @$_, "", ""} }'

My mailer will have wrapped that but it should work anyway.

Cheers
Grant