Removing partial duplicate file names with awk

I needed to clean up a bunch of files recently that contained both a common and unique part, something like this:

Show_Episode1-ID_12345.mp4
Show_Episode1-ID_67890.mp4

Note that there are two copies of ‘Episode1’ with a diferent ID part. Obviously I would only like to keep one of each episodes and ignore the whole -ID… part. This is how I solved it:

for i in `ls -t *mp4|awk 'BEGIN{FS="-"}{if (++dup[$1] >= 2) print}'`; do mv -v $i dup; done

So what happened here?

  • The directory listing is sorted by timestamp (newest first) so it favors the most recent versions.
  • The awk FS (field separator) is set to “-” to use the common part of the file name as the first field.
  • Now awk loops over each file name. It uses the common part of the file name (“Show_Episode1”) as an index into an array. The default counter value is 0 and any repeated file names will increase it to a value of >= 2.
  • If the counter value is >= 2, awk prints the complete file name (using the ‘print’ command). Note that this part only prints duplicates, the first file is never printed.
  • The output of the above steps are fed into a ‘for’ loop to serve as input to the ‘mv’ command that moves only the duplicate files to a separate ‘dup’ dir.