notes on using the command line

md5sum

md5sum is a tool that creates md5 checksums (or hashes, if that's your preferred term). It can also verify checksums from a list.

md5sum is one of a family of checksum tools that differ only in algorithm and name (which generally ends in "sum"). If md5sum is installed on your system, you are likely to also have:

sha256sum
sha512sum

and a few others. I mention these other tools for a couple of reasons:

  1. They work just like md5sum, so everything I cover here should work with the other tools simply by replacing md5sum with the command you want.
  2. The md5 algorithm has been broken for security purposes for a long time, and you probably shouldn't use it. If you have a choice (i.e. are not working with legacy tools or data), you should use a stronger hash like sha256 or sha512.

Obligatory discliamer on the brokenness of md5

The first thing you're likely to hear these days if you tell someone with a computing background that you're using md5 is that "md5 is broken" and you shouldn't use it. I guess I've also already done that too.

One of the ways md5 is "broken" is that it's possible to generate so-called hash collisions, where two different files have the same md5 checksum. Since one of the principal uses of a checksum algorithm is to provide a way to uniquely identify files using unique hashes generated for each file, an algorithm that can't guarantee uniqueness isn't a particularly useful algorithm.

But if you're reading this, there's a good chance you're stuck using md5 for some reason. Maybe you're using an application that generates checksums but only supports md5. I can think of at least one media asset management system that does this, and it's not even an old piece of software. So the only way to verify the checksums generated by that system is to use md5.

Or maybe you're working with a set of data where someone previously generated md5 checksums and for consistency you're continuing to use md5 until you can move everything to a stronger hash.

I've been in these exact situations while working in digital archives contexts, so I've ended up using md5 a lot. But I've been seeing it as an intermediate step towards storing files with stronger hashes like sha256.

Generating checksums

md5sum can be run

  • interactively
  • using input from a pipe
  • using a file as input

The output will always be an md5 hash followed by one or two spaces and then a filename. Note that the filename will be '-' if the input to md5sum came from standard input.

Interactive use

To run md5sum interactively, simply type

$ md5sum

This will put md5sum into interactive mode where you can type some text and have md5sum generate an md5 hash for what you typed. To tell md5sum that you're done typing, press CTRL+D. If you press Enter instead of CTRL+D, md5sum will treat that as part of the input (as a "newline" character), so you have to type CTRL+D to signal that you're done.

Consider the word "dog" followed by "Enter":

$ md5sum
dog
362842c5bb3847ec3fbdecb7a84a8692  -

Now compare that with the word "dog" not followed by anything before pressing CTRL+D.

$md5sum
dog06d80eb0c50b49a509b49f2424e8c805  -

The md5sum output here is a bit hard to read because there's no line break, but you can see that the hash that immediately follows "dog" is different than the hash in the example where I typed "Enter" before ending input.

Input from a pipe

You can also achieve the same result - and more - by sending input to md5sum from a pipe.

$ echo "dog" | md5sum
362842c5bb3847ec3fbdecb7a84a8692  -

Note that the md5 value above is actually the same as the one generated interactively by typing "dog" and then "Enter". This is because echo prints a newline by default.

If you use the '-n' option to echo, which suppresses the newline, the result is the same as the second interactive result above:

$ echo -n "dog" | md5sum
06d80eb0c50b49a509b49f2424e8c805  -

Sending input to md5sum isn't limited to text. You can also just send bytes. This leads to some odd looking constructs like generating a checksum from an optical disk by "reading" the disk directly from the drive into md5sum:

$ cat /media/cdrom | md5sum
# I don't actually have an optical drive at the moment to show the output here.
# But it would look like any other md5sum output from a pipe.

Input as a file

Most of the time if I'm generated a checksum, it's for a file. This is pretty simple: just type md5sum followed by a path to a file.

$ md5sum content/commands/md5sum.md 
a0ae395984f1ae8e32cfa78a04298101  content/commands/md5sum.md

If you want to generate checksums for multiple files, you can use find. This is not the place for extensive notes on find, but here's an example of what you can do:

$ find content/ -type f -name "*.md" -exec md5sum {} \;
5cb8f8352974ecdfab6084c5da25d5a4  content/commands/cd.md
4c5dbabc8c1c4552650db118cc21767f  content/commands/ls.md
a0ae395984f1ae8e32cfa78a04298101  content/commands/md5sum.md
349090dda4b55971d016d6f812ab80d7  content/commands/pwd.md
bdd8463a2c20bd9941f258ac9550418f  content/commands/rsync.md
bcc27550df56e868222cdfe6be5121c6  content/commands/tmux.md
c160ddf4d4c00e68ce6f1bd2e81ccf2a  content/pages/contact.md
dd9a81ebfbc8805f1d5ccb8a83426b53  content/pages/home.md

The above command asked find to find all files ('-type f') in the "content" directory whose names end in ".md" ('-name "*.md"') and then run md5sum on each of them.

Recursing through subdirectories

By itself, md5sum will not run recursively through subdirectories, but if you combine it with find, many ways of grouping files become possible. Working off of the example above, if I had wanted to generate md5 hashes on all files in the "content" directory, not just files with names ending in ".md", I would have simply left off the '-name' option:

find content/ -type f -exec md5sum {} \;
# output ommitted

Note: There is another set of checksum tools whose names end in "deep" - md5deep, sha256deep, etc. - that does provide a recursive option ('-r'). Depending on your needs, that might be a better choice if you're trying to generate hashes for all files below a certain directory.

Checksum verification

md5sum has a nice feature that allows you to verify a set of checksums using a pre-existing list. The list must follow the same formatting conventions that md5sum uses when generating checksums: each line contains an md5 hash, followed by one or two spaces, and then a path to a file.

Let's go back to the list generated with the find command above. But instead of sending the output to the screen, let's send it to a file:

$ find content/ -type f -name "*.md" -exec md5sum {} \; > verify-example.txt

Now "verify-example.txt" holds the list of checksums and file paths. To verify the checksums, run md5sum with the -c option, followed by the path to the list of checksums you are trying to verify.

$ md5sum -c verify-example.txt
content/commands/cd.md: OK
content/commands/ls.md: OK
content/commands/md5sum.md: FAILED
content/commands/pwd.md: OK
content/commands/rsync.md: OK
content/commands/tmux.md: OK
md5sum: content/commands/nope.md: No such file or directory
content/commands/nope.md: FAILED open or read
content/pages/contact.md: OK
content/pages/home.md: OK
md5sum: WARNING: 1 listed file could not be read
md5sum: WARNING: 1 computed checksum did NOT match

md5sum is not very verbose about the results of a verification run. The only two possible results are essentially "OK" and "FAILED", along with some error messages.

Failures come in two types:

  1. "FAILED" means that the checksum for the current file does not match the checksum in the list.
  2. "FAILED open or read" means that the file couldn't be found at all.

In the output above, "md5sum.md" failed because it is being changed regularly, thus changing the checksum. (It's also the file I'm currently writing.) On the other hand, "nope.md" failed because it doesn't exist. I inserted an extra line into the checksum list to make this example more useful.

Line break compatibility warning: when receiving a list of checksums from another operating system, watch out for line breaks/end-of-line characters.

I've run into this problem numerous times where a Linux system will treat the list of checksums and files as invalid because it includes Windows-style line breaks. If you get an error message saying that none of the files could be found, or that the lines in the input file were improperly formatted, but everything looks right to you, there's a very good chance you're seeing line break incompatibility. Converting the input file to use the same line endings of the system you're on using a tool like dos2unix should fix that issue.