Some Linux CLI tools for CSV table analysis

OK, so here is my task. I must verify that a set of files on portable disk drive (about 20,000 files) match a list of files (and their respective MD5 checksum values) found in an accompanying manifest document (CSV format). I could (and have) write a simple Python module to perform this task, but I was curious just how straight forward this task would be if I were to use only CLI tools found on my local Linux Mint system. The tools I am thinking about are head, wc, find, xargs, md5sum, awk, cut, and comm. Ready, lets get to it...

0. A quick breakdown of the manifest file with wc and head

Since the manifest document ("data-manifest.csv")  is 19,574 lines by checking wc, taking a quick look to understand its format structure is simple with the head command:

wc -l data-manifest.csv
19574 data-manifest.csv
head data-manifest.csv
Case00_T-0_no-alter_1993-01-27_15:00:00,61a1c22f40405b3b971132074d22cea2
Case00_T-0_no-alter_1993-01-27_18:00:00,a40d2a6fcdabd1f42af64b2717ad3199
Case00_T-0_no-alter_1993-01-27_21:00:00,538b993fb7d2eb5e817d3d52fed64c30
Case00_T-0_no-alter_1993-01-28_00:00:00,2682531b6313d3d6aa43ef572e17a39a
Case00_T-0_no-alter_1993-01-28_03:00:00,5a1ed689b39d3ea6a0e4175bff04482c
Case00_T-0_no-alter_1993-01-28_06:00:00,73d29ba5ad523149fcde004fd61dae85
Case00_T-0_no-alter_1993-01-28_09:00:00,3d7625dec811cd58ba0ebf54cdf7e72b
Case00_T-0_no-alter_1993-01-28_12:00:00,2b300ba11e4829cfad87ad2305028e36
Case00_T-0_no-alter_1993-01-28_15:00:00,33fc29cd3d0a5c16399540422e442169
Case00_T-0_no-alter_1993-01-28_18:00:00,877a3f7751109394969ee71185668e08
2-column csv file.

1. Using find, xargs, and md5sum to generate checksum for all files on the disk

The find . -type f command performs the jobs of identifying all of the regular files (no directories or other device objects) from the top most level data directory. In this case, the data files are actually spread across 20 or so sub-directories, but that's not a problem since find will recurse into these directories too. Because there are just south of 20,000 data files, I will pipe | the output from find into the xargs command, which, according to the manpage, is used to "build and execute command lines from standard input". More importantly, it easily handles large volumes of input that is iteratively fed into yet other commands: in this specific case, the md5sum command. Ultimately, the md5sum generates the MD5 checksum of the file identified by find and channeled through xargs. When put together, these commands are strung together on a single line, executed, and the output redirected and appended to a file called "data-md5.tx":

find . -type f | xargs md5sum >> data-md5.txt

This process took about 3 hours and 5 minutes on my local system (interestingly, a similar process in Python required almost 4 hours of execution time). Content in the "data-md5.txt" file has the following structure:

head data-md5.txt
40733f32cbdfa478f4a18734392fade0  ./metadata.docx
7686390566c8b6311c71fd2183e3a158  ./C04/Case04_T-2_ten_2000-02-16_18:00:00
94e5107a167bfdc9f61218497e3421c7  ./C04/Case04_T-2_nin_2000-02-16_00:00:00
f83ec459f25fc7df195d74e3ac1bb56a  ./C04/Case04_T-4_nin_2000-02-13_03:00:00
ecc4c6d132f1c8d998e8527ab2e73d01  ./C04/Case04_T-2_no-alter_2000-02-15_00:00:00
06272c7795c77db5bfe4ae57a280243d  ./C04/Case04_T-4_ten_2000-02-12_06:00:00
63e61168e951d10580dd94e6d05e5b17  ./C04/Case04_T-3_nin_2000-02-16_00:00:00
223d2346a1d79783bd37d084fa3cd292  ./C04/Case04_T-1_fif_2000-02-17_18:00:00
1907c01a341c76de3733061997eddc07  ./C04/Case04_T-3_ten_2000-02-13_06:00:00
0844dd5c5ddb26e92dff62a911f2e64c  ./C04/Case04_T-4_nin_2000-02-14_15:00:00
2-column space delimited file.

To monitor the progress of this task, I use a simple combination of watch and wc -l to periodically (every 2 seconds) list the number of lines in the "data-md5.txt" file so that I can see how close it is getting to that magic value of 20,000:

watch wc -l data-md5.txt

2. Using awk and cut to make a comparable file format

Now that the report file ("data-md5.txt") has been generated, I will use the awk command to alter the structure of the file so that it matches the "CSV" (comma separated value) manifest document:

awk '{ print $2 "," $1}' data-md5.txt > data-md5-working.csv
head data-md5-working.txt
./metadata.docx,40733f32cbdfa478f4a18734392fade0
./C04/Case04_T-2_ten_2000-02-16_18:00:00,7686390566c8b6311c71fd2183e3a158
./C04/Case04_T-2_nin_2000-02-16_00:00:00,94e5107a167bfdc9f61218497e3421c7
./C04/Case04_T-4_nin_2000-02-13_03:00:00,f83ec459f25fc7df195d74e3ac1bb56a
./C04/Case04_T-2_no-alter_2000-02-15_00:00:00,ecc4c6d132f1c8d998e8527ab2e73d01
./C04/Case04_T-4_ten_2000-02-12_06:00:00,06272c7795c77db5bfe4ae57a280243d
./C04/Case04_T-3_nin_2000-02-16_00:00:00,63e61168e951d10580dd94e6d05e5b17
./C04/Case04_T-1_fif_2000-02-17_18:00:00,223d2346a1d79783bd37d084fa3cd292
./C04/Case04_T-3_ten_2000-02-13_06:00:00,1907c01a341c76de3733061997eddc07
./C04/Case04_T-4_nin_2000-02-14_15:00:00,0844dd5c5ddb26e92dff62a911f2e64c

With the report file structure transformed into a CSV format ("data-md5-working.csv"), I will use the cut command to remove the preceding directories from the file path – in this case, I use -c -7 arguments to the cut command to "character select" from character (or column) "7" to the end of the line:

cut -c 7- data-md5-working.csv > data-md5-report.csv

The final product is now structurally identical to the manifest document:

head data-md5-report.csv 
metadata.docx,40733f32cbdfa478f4a18734392fade0
Case04_T-2_ten_2000-02-16_18:00:00,7686390566c8b6311c71fd2183e3a158
Case04_T-2_nin_2000-02-16_00:00:00,94e5107a167bfdc9f61218497e3421c7
Case04_T-4_nin_2000-02-13_03:00:00,f83ec459f25fc7df195d74e3ac1bb56a
Case04_T-2_no-alter_2000-02-15_00:00:00,ecc4c6d132f1c8d998e8527ab2e73d01
Case04_T-4_ten_2000-02-12_06:00:00,06272c7795c77db5bfe4ae57a280243d
Case04_T-3_nin_2000-02-16_00:00:00,63e61168e951d10580dd94e6d05e5b17
Case04_T-1_fif_2000-02-17_18:00:00,223d2346a1d79783bd37d084fa3cd292
Case04_T-3_ten_2000-02-13_06:00:00,1907c01a341c76de3733061997eddc07
Case04_T-4_nin_2000-02-14_15:00:00,0844dd5c5ddb26e92dff62a911f2e64c
2-column csv file.

3. Using sort and comm to compare the manifest to the report

A quick check of the line count of the "data-md5-report.csv" file from wc shows that it contains 3 more lines than the manifest file, so I know immediately that there are some differences:

wc -l data-md5-report.csv 
19577 data-md5-report.csv

Now lets see if I can determine exactly what is different (and hoping that there are no checksum variations). For a line-by-line comparison, it is necessary to sort each file so that they are lexicographical in the same order. For this, I can use sort:

sort data-manifest.csv > data-manifest-sorted.csv
sort data-md5-report.csv > data-md5-report-sorted.csv

With both files now saved with equivalent sort ordering, I can use comm to compare the two files to see if there are any variations (I know that there are some variations based on the line count mismatch):

comm -13 data-manifest-sorted.csv data-md5-report-sorted.csv 
metadata.docx,40733f32cbdfa478f4a18734392fade0
data-manifest.xlsx,14d2def30bd89cf441f0cb1e1aa1ac55
output_variables.txt,8dbb4b06ccf6e714bf2a0af9358f0384

The comm command provides options to display unique lines in each file respectively or the common lines between the two files. In this case, using the -13 option, I am suppressing the unique lines from "data-manifest-sorted.csv" and the common lines from both files - in other words, I want to see only unique lines in "data-md5-report-sorted.csv".

Finally, I see that there are three additional records in "data-md5-report-sorted.csv", but it does seem clear that they are not actual data files, so I can easily ignore them. Mostly, I wanted to be sure that the records in the report file (which represent both the data file names and their checksums) match identically to those in the manifest file. They did!