Does changing metadata prevent duplicate file detection?

(2 posts)
  • ender
    Member

    I just wanted to bounce some ideas/observations off of the community and see what others think (or what solutions they have found) with regard to these issues.

    Basically, the "problem" I'm trying to work around is being able to take advantage of the powerful tagging/rating features of acdsee without sacrificing the ability to
    1) reorganize my filestructure outside of acdsee without losing all my taggin/category data
    2) identify identical files as identical regardless of how I rate or tag them in acdsee.

    I realize ACDSee 2.5 allows you to embed rating/category info into an image as metadata, however, as I understand it changing metadata changes an image's file hash (which would lead duplicate finders to identify otherwise identical images as "different"). If anyone knows differently please advise.

    The reason I'm concerned with this in the first place (and don't just suck it up and always use acdsee to organize my files) is because I and several people I work with co-maintain a large collection of stock images (100,000+) for web and graphic design projects. We rely on duplicate image finders to keep the collection from becoming bloated with duplicates (especially when image sets/collections are merged or integrated into the base repository), and while I can make it a point to always use acdsee, I'm very uncomfortable with the notion that a lot of hard work tagging and rating images can be completely snuffed out if someone simply moves files around.

    If anyone has found any workarounds for this conundrum (or has even found a third party application that works well in combination with acdsee for tagging/rating purposes) I'd love to hear from you :)

    Posted On November 20, 2008 - 05:40 AM (1 year ago) (Permalink to this post)
  • ender
    Member

    I realize I'm responding to my own post, but I wanted to post this separately from my actual question.

    As far as solving the problem of having to rely on database entries that are very easily orphaned vs. being forced to embed tags into the image itself (changing the contents of the image and affecting the ability of "duplicate image finders" to function), I always wondered if there was a reason why an additional column couldn't be added to the acdsee database that stores an image's SHA1 or MD5 hash. Additional intrinsic attributes, for instance filesize, could be appended if necessary to prevent hash collisions. This is exactly how p2p file sharing networks (like eMule and Limewire) identify files. The magnet URI specification has demonstrated the effectiveness of this approach in tracking files in a distributed networks, why not apply the same technique to manage an image collection.

    Given that rating/categorization metadata is intended to describe a specific *image* (irrespective of its specific location) it makes much more sense from a design standpoint to associate said metadata with an image's "digital fingerprint" (i.e. hash value). This is because no matter where an image is moved, it will generate the same hash value. Despite the minor computational overhead of calculating the hash, keying organizational metadata to an image's hash would make the acdsee database significantly more versatile and powerful. Some advantages that come to mind in no particular order:

    1) Any image you have previously tagged/rated will be recognized and can have the appropriate rating/categorization info automatically associated with it (this will save wasting time on tagging the same images over and over as well as assure that your categorization ontology is synchronized across your entire image collection... after all if two images are identical it stands to reason that they should share the same tags)

    2) The categorization/rating information in the database would be infinitely more portable. Database's could easily be shared, merged, or backed up through a simple import/export interface. It's very unlikely that two different people will use an identical file structure, however identical images will always produce identical hashes. Leveraging this fact, teams of workers can collaborate and exchange categorization ontologies extremely easily (this allows for a much more efficient work flow in a group setting). Presently, the only way to share tagging/rating info with someone is to embed the info into the actual image and then transmit a copy of that image to said person (even if they already have a copy of the original). This becomes completely impractical if large amounts of data need to be exchanged. By using file hashes, all that is required to exchange category/rating info is the data contained within a given database. Because the data maintained in the database is text only, the data for several hundred gigabytes of image files can be condensed to several hundred megabytes... if that.

    3) Lastly, by keying database entries to an "intrinsic" property of an image (rather than something as tenuous as the file's location), it is trivial to make the database self-healing. If for the sake of efficiency, a column that keeps track of a file's specific location is necessary, upon "optimizing" the database, the database can automatically attempt to rebind orphaned entries, or simply remove the invalid "location" entry (while leaving the remainder of the data intact and ready to be reassociated to the appropriate file).

    Anyway, I realize this discussion is of a more technical nature and has more to do with the architecture of acdsee than with photography itself. That being said, I really think acdsee is a great application and am only posting this in hopes that someone that has some say so with the actual program development might see this and say "hey, that's not a bad idea" (or alternatively, make a post and say, that's not a bad idea... except for this that and the other).

    Well, if anyone else has any programming or database know-how and wants to pitch in their two cents as far as the pro's and con's of using a hash based database structure, I look forward to hearing from you.

    Posted On November 20, 2008 - 07:49 AM (1 year ago) (Permalink to this post)

Subscribe to this topic via RSS

Reply

You must log in to post.