~/blog/bash-script-image-sorting-humblebrag
Published on

My Weekend Plans - Netflix, Coffee, and Convincing Myself This Bash Script Deserves a Nobel Prize (It Doesn't, But It's Good)

1880 words10 min readI don't track
Views
Authors

Me having no social life

Maybe it's not a script?

Let's be honest. I don't have a script file that you can use. Instead, I have a bunch of commands you can use to achieve the same result. There is a reason why it's the case. Due to the nature of the task, we will have to play around with the commands and make minor changes to sort the images. It might be possible to create the perfect script that automatically deals with most edge cases. I am just too lazy to make one. Also, how else can I brag about how I managed to pull it off without making you run every one of those commands?

Still with me? Cool. Let's jump right into understanding how it works.

What to expect from this article

I have a bunch of commands that will help you sort your images into directories based on the year you captured the photo. Then, we can merge those sorted images with other previously sorted images in a different directory. We don't need much to accomplish this goal. There are a few commands and their variations. I'll explain how each command works in its respective section.

Sorting all images may not be possible. There are a couple of requirements each image must meet. These differentiate images that can and can not be sorted into the directories. These steps will help you sort most of the images, especially the recent images captured by smartphones. Out of the 20k+ photos I sorted, I had ~950 images which couldn't sorted. That's < 5% of the total set.

You can follow the same steps on Windows using Bash on Windows. It comes with Git for Windows. I must warn you that it is significantly slower than Linux for reasons I didn't care to explore. Also, you will have to use /bin/find instead of find since Windows has a utility with the same name.

If you ever need more information on the commands in this article, use explainshell. For regex use one of the following:

Backgroud

If you were like me, you and your friends got access to your first phone right around college time. There was an explosion in the number of photos and selfies taken and shared between friends. As you might have already guessed, managing your photo catalogue can become painful fast. Last I checked, I had 20k+ photos spread across directories, devices and services. In my case, a large portion of them were duplicates.

There is no easy way to sort these images manually. Believe me. I tried. Just going through them took me days. Manual sorting and merging is not an option. Given the volume, I looked for tools that support Linux. While tools like DigiKam can add tags to your images and make them more browsable, it didn't quite meet my expectations for sorting and merging files. Maybe it is possible. I don't have the energy to read through the docs.

That brings us to the present day.

The breakthrough

The thing with modern camera devices, especially smartphones, is that they save the images in a somewhat consistent manner. On most images, the creation date was in the file name. Even though the date formats were different, it was a good starting point.

So, this was my action plan:

  • Get all the images on my local machine. These included downloading the entire image from services like Google Photos, my old NAS, other external SSDs and my phone.
  • Remove any sidecar files that come with the images.
  • Get the year from the name of the files and create directories with the year names somehow.
  • Move files to the respective directory based on the year of capture.

Bringing the images together

As mentioned in the breakthrough section, bring all the images to a single system with enough horsepower and space to hold all the image files. You may not remember where you had saved all the other image files and their backups. That is fine. We can repeat the same steps in this article to sort those files and merge them with our main image repository later.

I started with two sources: Google Photos and a backup from my machine.

The best way to copy files without the pain of dealing with speed and duplication is to use the rsync command. It also helps us exclude those pesky sidecar files when copying and merging files from different sources.

rsync -avhru --progress --ignore-existing /path/to/source/ /path/to/destination

Remember to include --exclude '*.extension' in the command when you know there are sidecar files, like in the case of Google Photos takeout.

For the sake of easy reference, let's call the destination directory ~/combined/

I prefer to copy the files to a new directory. That way, if I mess up the command, I can always restart from the previous step. It is possible only because we left the source intact.

File Names: Gotta Catch 'em all!

We will be revisiting this step over and over again as we move forward with this activity. Getting the names of the files gives a point of reference we can use in other commands.

find ~/combined/ -type f > images.txt

Running this command will give us the list of file names under each directory. The best part is that the entry includes the full path to the file from the root directory (in this case, that would be ~/combined/).

Grep, the tool that never ceases to impress me

As you can already guess, we need to parse the date from the file names. grep is the perfect tool for this task. Like the find command, we will be revisitting this setup until we get most of the files sorted. But unlike find, we need to experiment with the regular expression a lot more to get this part to work.

There are different date variations in the name. But, in most cases, it is either YYYYMMDD or YYYY-MM-DD. There will be minor differences in the prefix and postfix values which come with the date. These could be an underscore _ or a hyphen -. Either way, we need to write regex to find them.

grep -oE '[\_|\-]20[0-9]{6}|[\_|\-]20[0-9]{2}\-[0-9]{2}\-[0-9]{2}' images.txt | cut -c 2-5 | sort -u > years.txt

With this regex, we are looking for a combination of the following:

  • _YYYYMMDD
  • -YYYYMMDD
  • _YYYY-MM-DD
  • -YYYY-MM-DD

The -oE flag returns just the date value in one of the formats. We only need the year value for now, so we will cut, sort and pull out the unique year values from the files. We will be creating those year directories using this data.

You will have to play around with the regex to get the most of the dates from the names.

Piping the results from sort to wc -l will allow you to see the coverage you made. I encourage you to view the raw grepped results before you slice and dice the values.

To see the values your regex didn't pick up properly, replace -o with -v in the grep command.A

Creating years of memory

With the year data stored in a file, we can use it to create the directories. In pure Linux user fashion, we will be doing that using bash.

tr -d '\r' < years.txt | xargs -I {} mkdir -p {}

Apart from creating the year directories, this approach will ensure that the tailing \r in the name does not end up as a part of the directory name. I faced this issue when I created the directories using bash on Windows.

Move those files

Since we have all we need, we can move the files to their correct directories.

while IFS= read -r filename; do year=$(echo "$filename" | grep -oE '[\_|\-]20[0-9]{6}|[\_|\-]20[0-9]{2}\-[0-9]{2}\-[0-9]{2}' | cut -c 2-5 | sort -u ); if [ -n "$year" ]; then mv "$filename" "$year/"; echo "Moved $filename to $year/"; else echo "No year found in $filename"; fi; done < images.txt

Before you run the command blindly, there are a couple of parameters you need to ensure are in order.

  • Use the same regex you used previously to find the year.
  • Avoid mixing regex that returns results that are different. If one regex looks for YYYYMMDD and the other looks for _YYYYMMDD, the cut command will return the wrong year values.
  • Ensure you are using the correct input file. As you will see from the next section, we are generating other input files. They are used for subsequent command invocations.

This step will help you move most of your files to the correct directory.

What do we have left?

There are always files that weren't sorted into any year folder. So, the best way to identify what's left is to run the find command again, only this time, you will have a second step to filter out files that were sorted.

find . -type f > images.txt
grep -vE '20[0-9]{2}\/' images.txt > remaining.txt

The regex will filter out all lines with year as a directory in the path. An example would be ./2020/IMG_20201103.jpg

Do it all again!!!!

At this point, it should be clear to you what we will be doing and why.

  • We will be identifying patterns in the remaining.txt file using grep,
  • find the years we have previously missed,
  • create those directories,
  • use the new regex to move the file to its respective directory,
  • find what's left and repeat.

Deal with the unsorted

After a couple of iterations, there can be a few files that were not sorted because they don't have the date value in their file name. For the time being, just store them in an unsorted directory.

mkdir -p unsorted
xargs -I{} mv {} ./unsorted/ < remaining.txt

Conclusion

As you find more image files from different places, you can follow these same steps again to sort those files in place. Once done, use the rsync command to merge those files with the main image repository. The flags we used for rsync ensure:

  • Duplicates are skipped
  • Skip already copied files
  • Restart from where the command stopped, in case the command was interrupted
  • The directory structure stays intact

It's a win-win!

For me, these steps helped combine several sources with more than 6k photos each and several gigabytes into a single 20k+ sorted and merged repository.

Plans for later

I'm not done with it just yet. There are other things I would like to do. Those include:

  • Fixing the name of files. I prefer a consistent format. I am aiming for something like IMG-YYYYMMDD_HHMMSS.ext
  • Fixing the metadata. Most of the metadata was lost or overwritten during previous backups
  • Use Mylio (Windows) and DigiKam (Linux) to add tags like family, friends and college.

I'll be experimenting further to see how far I can go with bash.

Until then! :)