30 Using Checksums
31 Using Checksums
31.1 What is an MD5 Checksum?
At its core, an MD5 checksum is like a unique digital fingerprint for your file. It’s a short string of letters and numbers generated by a special algorithm. Even the tiniest change to a file will result in a completely different MD5 checksum. And this makes it perfect for verifying file integrity.
31.2 Why Use MD5 Checksums on Sasquatch?
When you’re moving large datasets or critical code to and from Sasquatch, there’s always a slim chance that a bit or two might get flipped during transfer. This could lead to corrupted data or unexpected errors in your computations. MD5 checksums provide a quick and reliable way to confirm that the file on Sasquatch is an exact, uncorrupted copy of your original.
31.3 How to Use MD5 Checksums: A Step-by-Step Guide
Let’s walk through the process with one file just to explain it clearly. We’ll assume you’re copying a file from your local machine to Sasquatch, but the process is identical for copying files from Sasquatch to your local machine.
31.3.1 Step 1: Generate the MD5 Checksum on Your Source File
Before you transfer your file, you’ll generate its MD5 checksum on the machine where the original file resides.
- On Linux: Open your terminal and navigate to the directory containing your file. Then, use the
md5sumcommand:
md5sum your_file.txt Note: Use
md5on some macOS versions orcertutil -hashfile [file] MD5on windows
You’ll see output similar to this:
d41d8cd98f00b204e9800998ecf8427e your_file.txt The long string of characters is your MD5 checksum.
31.3.2 Step 2: Transfer Your File to Sasquatch
Use your preferred method to copy the file to Sasquatch. This could be scp, rsync, or any other file transfer tool you’re accustomed to.
scp your_file.txt your_username@login-1.hpc:/path/to/destination/ 31.3.3 Step 3: Generate the MD5 Checksum on the Copied File (on Sasquatch)
Once the file is on Sasquatch, SSH into the cluster and navigate to the directory where you copied the file. Then, generate its MD5 checksum using the md5sum command, just like you did on your local Linux/macOS machine:
ssh your_username@login-1.hpc
cd /path/to/destination/
md5sum your_file.txt You’ll get output similar to what you saw in Step 1:
d41d8cd98f00b204e9800998ecf8427e your_file.txt 31.3.4 Step 4: Compare the Checksums
Now, compare the MD5 checksum you noted in Step 1 (from your original file) with the MD5 checksum you just generated on Sasquatch.
If the checksums are identical, congratulations! Your file has been copied successfully without any corruption.
If the checksums are different, it means the file was corrupted during transfer. You’ll need to delete the corrupted file on Sasquatch and re-transfer it.
31.4 Automating the Process for Multiple Files
Pay attention to this section, as this is what you will normally want to do.
If you’re dealing with many files, manually comparing checksums can be tedious! So here are a couple of tips for automating this:
31.4.1 1. Generating Checksums for Multiple Files into a single File
You can generate MD5 checksums for an entire directory and save them to a file:
# On any machine
cd /path/to/your/files/
md5sum * > checksums.md5 This will create a file named checksums.md5 that contains the checksum for each file in the directory.
31.4.2 2. Copying the Checksum File and Verifying
After copying all your files and the checksums.md5 file to Sasquatch, you can use md5sum with the -c (check) option to verify them all at once:
# On Sasquatch
cd /path/to/destination/
md5sum -c checksums.md5 md5sum will then go through each file listed in checksums.md5 and compare its stored checksum with the one it generates on Sasquatch. It will report OK for files that match and FAILED for any that don’t, and it will look something like this:
file1.txt: OK
file2.txt: FAILED
file3.txt: OK 31.5 Important Considerations
MD5 is for integrity, not security: While MD5 is excellent for checking file integrity, it’s not considered cryptographically secure for purposes like verifying software authenticity due to known vulnerabilities. For security-critical applications, other hashing algorithms like SHA-256 are preferred. However, for simply verifying file transfers, MD5 is perfectly adequate and widely available.
Time and Resources: For extremely large files (terabytes), generating checksums can take some time and consume CPU resources. Factor this into your workflow.