Mar 272008
 

It’s been a while since I’ve posted a script; life has been distracting lately. I also wanted to let this current script mature a lot more before sharing it, as it has the potential to be destructive. Use wisely!

It’s name is linkdups, and it’s a Python program to recursively walk through a directory tree and hard-links any files together whose contents match exactly. That means that if you have two files, each taking up 10 Kb, afterwards they will be linked to the same contents for a total savings of 10 Kb.

This type of space savings only works for media that never changes. It’s great for use inside of archival disk images, website mirrors, backup directories, etc.

It seems that most programmers, at some point or another, writes a script to do exactly this kind of thing. My brother was telling me the other day about one he’d written.

The main purpose of this script is to be extremely efficient. I tested it against a directory hierarchy containing over 40 million entries, many of which required hard-linking. My requirements were that memory consumption never grew beyond a small, startup footprint, and that it not waste cycles computing unnecessary details (like checksumming everything and then looking for matches).

Here’s how the algorithm works:

  1. First, I divide the problem between large files (easy gains), and small files. The default cutoff is 16K. This division means you get the biggest savings up front. It also cuts down on memory usage for internal tables.

  2. Next, a table is created of how big each file in the target set (large or small) is. Groups are then made for like-size files; singletons are dropped.

  3. Within the like-size groups, MD5 is used for a quick content check. Files in a size group with matching checksums are considered candidates for linking.

  4. The candidates from step 4 are byte-wise compared to ensure an exact match. If any two match, the second one is removed and the first is hard-linked to the second’s name. A tally is then added of how many bytes were saved. This is repeated for all other files in the group. (Note: byte-wise comparisons are made only against the first member of a group, on the assumption that MD5 most likely guarantees a match).

  5. At the end of the run — or after Control-C has been pressed — the total byte savings is displayed in human readable form. Use the --verbose option to watch the algorithm do its work.

The script is also designed to be abortable. You can hit Control-C any time if you get bored, and your files will be left in a good state. It does as much linking as it can, but once Control-C is pressed, it stops wherever it is, wraps up the last task, and ends safely.

It can be downloaded from my FTP server: linkdups.py.

 Posted by at 11:40 pm

  4 Responses to “Script of the week: linkdups”

  1. I believe this shell script would do the trick on GNU systems:

    find . -type f -exec md5sum {} ; | sort -k1 | awk ‘BEGIN { shell = “/bin/bash”; } {if ($1 in md5) { printf “cmp -s %s %s && ln -f %s %sn”, md5[$1], $2, md5[$1], $2 | shell; } else { md5[$1] = $2; } } BEGIN { close(shell); }’

  2. It would work, but you would be computing the MD5 checksum of every file on your filesystem. My script only calculates checksums if there is hope of a match.

  3. ahh, that’s true. i don’t have a quick shell script, but it could be done.

    find . -type f -size +1 -exec du {} ; | sort -k1n | awk ‘{if ($1 in bs) { candidate[bs[$1]] = 1; candidate[$2] = 1; } else { bs[$1] = $2; } } END { shell = “/bin/bash”; for (file in candidate) {printf “md5sum “%s”n”, file | shell; } close(shell); }’ | sort -k1 | awk ‘BEGIN { shell = “/bin/bash”; } {if ($1 in md5) { printf “cmp -s %s %s && echo ln -f %s %sn”, md5[$1], $2, md5[$1], $2 | shell; } else { md5[$1] = $2; } } BEGIN { close(shell); }’

    Of course, this compares the md5 hashes off all the N candidate files, which would be expensive with large file systems (large numbers of files).

  4. This is a good program. I think it will be much more robust if it were able to have the capability to process where it left-off at. Sending a hard interrupt seems a bit extreme. I didn’t look at your code but from what you describe, it shouldn’t be that bad to return to some reference within the table to continue on. It also might be beneficial to have a ‘suspect’ created before you begin processing, that way you can return to a suspect position, following the suspect position, perhaps those get dumped to a ‘processed’ type table. Just thinking out loud. I like what you’ve put forth. Thanks!!

    Chinh

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>