Comparing and sorting files across backups?

Configure and optimize you computer for Audio.
RELATED
PRODUCTS

Post

I'd like to take several backup dumps and compare and copy out only the unique files between them on Windows 7. Anyone got suggestions? Free would be great, but let me know if you like something that also costs money too.

Devon
Simple music philosophy - Those who can, make music. Those who can't, make excuses.
Read my VST reviews at Traxmusic!

Post

As long as they aren't compressed (need legit backup softwre like Aomei Backupper (free!) to do incremental updates of compressed backups) you can use Create Synchronicity (Free!) to backup.

It has a few options including two way mirroring etc, can be scheduled as well - is excellent.
I play guitar

Post

Chickenman wrote:As long as they aren't compressed (need legit backup softwre like Aomei Backupper (free!) to do incremental updates of compressed backups) you can use Create Synchronicity (Free!) to backup.

It has a few options including two way mirroring etc, can be scheduled as well - is excellent.
Thanks for the suggestion, but I don't need backup software. I got that covered. :) I need to take several very large backups (200+GB apiece) and boil them all down to just the unique files between them and get rid of the duplicates. There are folders within folders that are identical even within the same root directories, so I need to parse these out as well. I just want one copy of every file within a set of probably 500GB of files. I'm guessing I have only about 150GB of unique data, and the rest is duplicated, sometimes up to 6 times! Ugh!

Devon
Simple music philosophy - Those who can, make music. Those who can't, make excuses.
Read my VST reviews at Traxmusic!

Post

Edit: triple post
Last edited by BertKoor on Sun Sep 25, 2016 8:28 am, edited 1 time in total.
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

Edit: double post
Last edited by BertKoor on Sun Sep 25, 2016 8:25 am, edited 1 time in total.
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

Free: kdiff3. Also try BeyondCompare from ScooterSoftware.

But what if you just restore all the backups to the same target location? When done in chronological order, updated files will get into their final state and deleted files will remain there. How far off is that of what you want?

Otherwise a powershell script can be written that list out the changes for each backup iteration, based on the md5 hash of each file.
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

BertKoor wrote:Free: kdiff3. Also try BeyondCompare from ScooterSoftware.

But what if you just restore all the backups to the same target location? When done in chronological order, updated files will get into their final state and deleted files will remain there. How far off is that of what you want?

Otherwise a powershell script can be written that list out the changes for each backup iteration, based on the md5 hash of each file.
The problem is folders got moved around, so yes, all the files are there (which is the current problem), I just need to weed out all the duplicate files. From what I can see with Everything (neat search program, BTW) I have up to 7-8 copies of the same file. This is probably 10+ years worth of backups, and things have moved over and over again. I've just dug myself a hole, and I'd like to stop digging. :)

Devon
Simple music philosophy - Those who can, make music. Those who can't, make excuses.
Read my VST reviews at Traxmusic!

Post

Ok, here's a powershell script I hacked together in an hour:

Code: Select all

Param (
    [String]$BackupRootFolderName = "C:\Temp",
    [String]$RestoreRootFolderName = "C:\Temp\Unique" # leave empty "" if you want to remove duplicates from the backup root folder
)

$Global:folderUnq = 0
$Global:folderDup = 0

Function calcMD5 ( [string]$file ) {
    $algo = [System.Security.Cryptography.HashAlgorithm]::Create("MD5")
    $stream = New-Object System.IO.FileStream($file, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read)

    $md5StringBuilder = New-Object System.Text.StringBuilder
    $algo.ComputeHash($stream) | % { 
        [void] $md5StringBuilder.Append($_.ToString("x2")) 
    }

    $stream.Dispose()
    return $md5StringBuilder.ToString()
}

Function LogFolderStats ($unq, $dup, $name) {
    Write-Host ("{0,8}  {1,9}  {2}" -f $unq, $dup, $name)
    $Global:folderUnq = 0
    $Global:folderDup = 0
}

$md5List = @()
$totalDupl = 0
$folder = ""
if ($RestoreRootFolderName -and -not (Test-Path $RestoreRootFolderName)) {
    $null = New-Item -Path $RestoreRootFolderName -ItemType Directory
}
ForEach ($file in Get-ChildItem -Path $BackupRootFolderName -File -Recurse) {
    if ($file.DirectoryName -ne $folder) {
        if ($folder) {
            LogFolderStats $folderUnq $folderDup $folder
        } else {
            LogFolderStats "Unique" "Duplicate" "in folder"
            LogFolderStats "------" "---------" "--------------------"
        }
        $folder = $file.DirectoryName
    }
    $md5Hash = calcMD5 $file.FullName
    if ($md5List.Contains($md5Hash)) {
        $totalDupl++
        $Global:folderDup++
        if (-not $RestoreRootFolderName) {
            $file.Delete()
        }
    } else {
        $md5List += $md5Hash
        $Global:folderUnq++
        if ($RestoreRootFolderName) {
            $newName = $file.FullName.Replace($BackupRootFolderName, $RestoreRootFolderName)
            [IO.DirectoryInfo]$newName |% {$_.parent.create()}
            [void]$file.CopyTo($newName, $true)
        }
    }
}
LogFolderStats $folderUnq $folderDup $folder
LogFolderStats "------" "---------" "--------------------"
LogFolderStats $md5List.Count $totalDupl "Grand Total"
Pause
Save this somewhere as Duplicate.ps1 but first change the two parameters in the first lines. Execute it from the File Explorer with a right-click and chose "Run with Powershell"

The first parameter points to the folder where the backed up files are. It could contain one subfolder per backup, or you could have restored them on top of each other so from updated files you only retain the last version.
The second parameter points to a folder where copies of unique files will be placed. The relative directory structure will be preserved.
If you are really brave, you can make this an empty string "" instead of a location. The script will then remove all the duplicates instead of copying the unique files. You had all backed up, so what's the risk ;-)

You can also let it do a "dry run" with parameter $RestoreRootFolderName = "" and line 49 (delete a file) commented out (put a # symbol at the start)

During execution it will produce some logging:

Code: Select all

  Unique  Duplicate  in folder
  ------  ---------  --------------------
      52          1  C:\Temp
      47          0  C:\Temp\B21090
      20          1  C:\Temp\Dump\20160703
      12          2  C:\Temp\Dump\20160706
      13          1  C:\Temp\Dump\20160707
       6          0  C:\Temp\LOG
     222          1  C:\Temp\phone\Facebook
      79          7  C:\Temp\phone\Messenger
      27          0  C:\Temp\phone\pictures
       4          0  C:\Temp\prod
       1          0  C:\Temp\PTO
      14          0  C:\Temp\PTO\CNTR
       2          0  C:\Temp\PTO\JOB
      26          0  C:\Temp\PTO\PROC
       3          0  C:\Temp\PTO\Script
      55          0  C:\Temp\Tabl179
       3          1  C:\Temp\test
       0         52  C:\Temp\Unique
       0         47  C:\Temp\Unique\B21090
       0         20  C:\Temp\Unique\Dump\20160703
       0         12  C:\Temp\Unique\Dump\20160706
       0         13  C:\Temp\Unique\Dump\20160707
       0          6  C:\Temp\Unique\LOG
       0        222  C:\Temp\Unique\phone\Facebook
       0         79  C:\Temp\Unique\phone\Messenger
       0         27  C:\Temp\Unique\phone\pictures
       0          4  C:\Temp\Unique\prod
       0          1  C:\Temp\Unique\PTO
       0         14  C:\Temp\Unique\PTO\CNTR
       0          2  C:\Temp\Unique\PTO\JOB
       0         26  C:\Temp\Unique\PTO\PROC
       0          3  C:\Temp\Unique\PTO\Script
       0         55  C:\Temp\Unique\Tabl179
       0          3  C:\Temp\Unique\test
  ------  ---------  --------------------
     586        600  Grand Total
Press Enter to continue...:
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

Wow, awesome script man! Going to save this one away, for sure! :)

However, someone had suggested to me to use CCleaner. Yes, it does find and show duplicates, but I also discovered looking through some of the results that there are some folders in there where duplicates NEED to remain the way they are. Ugh! So this is going to be even more complicated than I first imagined.

I think my biggest problem that I'm trying to work with is that I know there are files missing from the most current dataset that ARE in the old backups that I do not want deleted. :( I think my plan here is to look at the current set of data then compare to folders of older complete backups. If all the files are there, delete it, go to the next set. If not, extract those files out. The complicated bit is figuring out where to put the files that are missing back into a structure that's workable as well.

This is going to be a journey, and not completed in 15 minutes. :(

As always, BertKoor, thank you. I will look at the other programs as well that you suggested.

Devon
Simple music philosophy - Those who can, make music. Those who can't, make excuses.
Read my VST reviews at Traxmusic!

Post

DevonB wrote:there are some folders in there where duplicates NEED to remain the way they are. Ugh! So this is going to be even more complicated than I first imagined.
Oh boy, you got yourself into a fine mess :)
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

BertKoor wrote:
DevonB wrote:there are some folders in there where duplicates NEED to remain the way they are. Ugh! So this is going to be even more complicated than I first imagined.
Oh boy, you got yourself into a fine mess :)
You're telling me. Well, if one of these tools can at least show me if directory "A" has all the files that directory "b" has, I can divide and conquer at least. This was because of incrementals forever, and if something got moved, it just got copied again, not deleted. However, this also assured me that if something was deleted, I still had it too. Pros and cons.

Devon
Simple music philosophy - Those who can, make music. Those who can't, make excuses.
Read my VST reviews at Traxmusic!

Post

Another possibility: You can use FreeFileSync and compare left (original folder) to right (copied or second folder). Then you can uncheck the files you do NOT want to be synced/copied. This may seem a bit tedious, but may turn out easier than using a script.

Post

Skorpius wrote:Another possibility: You can use FreeFileSync and compare left (original folder) to right (copied or second folder). Then you can uncheck the files you do NOT want to be synced/copied. This may seem a bit tedious, but may turn out easier than using a script.
I'll take a look. I tried kdiff3 and WinMerge. Neither did what I hope they would do. I just need to know if files in directory A tree structure also exist in directory B structure. Fingers crossed!

Devon
Simple music philosophy - Those who can, make music. Those who can't, make excuses.
Read my VST reviews at Traxmusic!

Post

DevonB wrote:I just need to know if files in directory A tree structure also exist in directory B structure.
This shouldn't be a big problem with FreeFileSync. However, be careful to choose the right settings (blue and green gearwheel icons) before you copy (or delete) anything.

Post

Windows 7 has a powerful tool called robocopy built-in, which is like a smart Xcopy. By default it only copies new or updated files from the source to the target, but it has many options (for example, you can exclude certain file types or directories from the operation). Run it with the help switch from a command line to see all the options:

robocopy /?

I use it like this all the time: If I'm transferring a programming project (full of all its source and project files, etc) from one computer to another via a USB drive, I can do this on the "source" computer:

robocopy /s /r:1 /w:1 /ffs C:\projects\myproject E:\myproject

The first time I ever run that command, it copies the entire directory from C:\projects\myproject E:\myproject. The next time I run the same command it will only copy new or updated files from the source.

Breakdown of the above command:
/s - copy subdirectories
/r:1 - retry a maximum of one times on copy error before moving on to next file
/w:1 - wait 1 second between retry attempts
/ffs - use "fuzzy" timestamp logic to make up for the limitations of FAT32 when copying files from a local drive to USB storage (prevents incorrect time calculations due to filesystem differences)

For locating duplicates on a drive or folder, my favorite tool is dupeGuru. It also gives you several options of how to handle the duplicate files it finds.

Post Reply

Return to “Computer Setup and System Configuration”