[Techtalk] interesting photo management problem-- weeding out duplicates

Thu May 1 01:25:18 UTC 2014

GNU Image Finding Tool is something I've been wanting to look properly 
into, but haven't yet. It may be useful for you here if you have 
enhanced or slightly altered versions of pictures and want to locate 
them too.
http://www.gnu.org/software/gift/

I think it lets you sort image files by relevance to a sample image. 
Like most people in this era of digital cameras I have gazillions of 
photos on my computer. I've been very careful about keeping them 
organised, but on rare occasions my filing layout doesn't help. For 
instance if I need to find the original of a picture that had been later 
enhanced, cropped, and scaled down by a friend. It's a major drag to 
sift by hand through tens of thousands of photos. This tool might solve 
that.

I don't know if it's useful for what you want, but you never know... :)

Cheers,

	- Miriam

Carla Schroder wrote:
>
> Hola, techtalkers. I have an interesting problem to enliven your
> Sunday. My photo archives are a mess, because instead of setting up a
> sensible organization from the start I instantly devolved into a
> mess of random duplicate backup directories, with as many as five
> duplicates of the same photo scattered all over the place. It's 9000+
> images, so I'm not real eager to hunt down and delete the duplicates
> manually. So I came up with this:
>
> First make a list of the duplicates:
>
> find Pictures/  -type f -exec md5sum '{}' ';' | sort | uniq
> --all-repeated=separate -w 15 > dupes.txt
>
> This makes a nice text file with a blank line between each photo name:
>
>
> 5374b0c445690e735e5e10ba248f5ed0
> Pictures/Pictures-realhome/insurance/122005/14194820.JPG
> 5374b0c445690e735e5e10ba248f5ed0
> Pictures/Pictures-realhome/jdfd-xmas-2005/P1000048.JPG
>
> 5374e5d9486b223c508f81175cdf551a
> Pictures/pictures/canon-30d/IMG_0084.JPG
> 5374e5d9486b223c508f81175cdf551a
> Pictures/Pictures-realhome/random-pictures/canon-30d/IMG_0084.JPG
>
> 537b0f0dd3ea8e35465c6cb86d2faa67
> Pictures/pictures/105_PANA/P1050778.JPG
> 537b0f0dd3ea8e35465c6cb86d2faa67
> Pictures/Pictures-realhome/random-pictures/105_PANA/P1050778.JPG
>
> I like using md5sums to find the duplicates because it finds dupes with
> different filenames.
>
> Awk counts all the filenames without counting the blank lines:
>
> awk 'NF != 0 {++count} END {print count}' dupes.txt
> 9855
>
> Then I run a couple more awk incantations to count how many unique
> images there are, and I get 4301. Whee fun, eh?
>
> So I can generate a list of unique files with awk, and then
> use cp, rsync or mv to copy the list to a new directory. ExifTool is
> really slick for manipulating big batches of image files, but I'm still
> figuring how to use it.
>
> Another option is to delete the duplicates and leave the rest in place,
> but I haven't figured out how to do that.
>
> Thoughts? Brainstorms? I also looked at the Organize command included
> in Exiv2, but I couldn't get it to build on my system.
>
> Carla
>

-- 
If you don't have any failures then you're not trying hard enough.
  - Dr. Charles Elachi, director of NASA's Jet Propulsion Laboratory
-----
Website: http://miriam-english.org
Blogs:   http://miriam-e.dreamwidth.org
          http://miriam-e.livejournal.com