Some Linux CLI tools for CSV table analysis
OK, so here is my task. I must verify that a set of files on portable disk drive (about 20,000 files) match a list of files (and their respective MD5 checksum values) found in an accompanying manifest document (CSV format). I could (and have) write a simple Python module to perform this task, but I was curious just how straight forward this task would be if I were to use only CLI tools found on my local Linux Mint system. The tools I am thinking about are head
, wc
, find
, xargs
, md5sum
, awk
, cut
, and comm
. Ready, lets get to it...
0. A quick breakdown of the manifest file with wc
and head
Since the manifest document ("data-manifest.csv") is 19,574 lines by checking wc
, taking a quick look to understand its format structure is simple with the head
command:
wc -l data-manifest.csv
19574 data-manifest.csv
1. Using find
, xargs
, and md5sum
to generate checksum for all files on the disk
The find . -type f
command performs the jobs of identifying all of the regular files (no directories or other device objects) from the top most level data directory. In this case, the data files are actually spread across 20 or so sub-directories, but that's not a problem since find
will recurse into these directories too. Because there are just south of 20,000 data files, I will pipe |
the output from find
into the xargs
command, which, according to the manpage, is used to "build and execute command lines from standard input". More importantly, it easily handles large volumes of input that is iteratively fed into yet other commands: in this specific case, the md5sum
command. Ultimately, the md5sum
generates the MD5 checksum of the file identified by find
and channeled through xargs
. When put together, these commands are strung together on a single line, executed, and the output redirected and appended to a file called "data-md5.tx":
find . -type f | xargs md5sum >> data-md5.txt
This process took about 3 hours and 5 minutes on my local system (interestingly, a similar process in Python required almost 4 hours of execution time). Content in the "data-md5.txt" file has the following structure:
To monitor the progress of this task, I use a simple combination of watch
and wc -l
to periodically (every 2 seconds) list the number of lines in the "data-md5.txt" file so that I can see how close it is getting to that magic value of 20,000:
watch wc -l data-md5.txt
2. Using awk
and cut
to make a comparable file format
Now that the report file ("data-md5.txt") has been generated, I will use the awk
command to alter the structure of the file so that it matches the "CSV" (comma separated value) manifest document:
awk '{ print $2 "," $1}' data-md5.txt > data-md5-working.csv
head data-md5-working.txt
./metadata.docx,40733f32cbdfa478f4a18734392fade0
./C04/Case04_T-2_ten_2000-02-16_18:00:00,7686390566c8b6311c71fd2183e3a158
./C04/Case04_T-2_nin_2000-02-16_00:00:00,94e5107a167bfdc9f61218497e3421c7
./C04/Case04_T-4_nin_2000-02-13_03:00:00,f83ec459f25fc7df195d74e3ac1bb56a
./C04/Case04_T-2_no-alter_2000-02-15_00:00:00,ecc4c6d132f1c8d998e8527ab2e73d01
./C04/Case04_T-4_ten_2000-02-12_06:00:00,06272c7795c77db5bfe4ae57a280243d
./C04/Case04_T-3_nin_2000-02-16_00:00:00,63e61168e951d10580dd94e6d05e5b17
./C04/Case04_T-1_fif_2000-02-17_18:00:00,223d2346a1d79783bd37d084fa3cd292
./C04/Case04_T-3_ten_2000-02-13_06:00:00,1907c01a341c76de3733061997eddc07
./C04/Case04_T-4_nin_2000-02-14_15:00:00,0844dd5c5ddb26e92dff62a911f2e64c
With the report file structure transformed into a CSV format ("data-md5-working.csv"), I will use the cut
command to remove the preceding directories from the file path – in this case, I use -c -7
arguments to the cut
command to "character select" from character (or column) "7" to the end of the line:
cut -c 7- data-md5-working.csv > data-md5-report.csv
The final product is now structurally identical to the manifest document:
3. Using sort
and comm
to compare the manifest to the report
A quick check of the line count of the "data-md5-report.csv" file from wc
shows that it contains 3 more lines than the manifest file, so I know immediately that there are some differences:
wc -l data-md5-report.csv
19577 data-md5-report.csv
Now lets see if I can determine exactly what is different (and hoping that there are no checksum variations). For a line-by-line comparison, it is necessary to sort each file so that they are lexicographical in the same order. For this, I can use sort
:
sort data-manifest.csv > data-manifest-sorted.csv
sort data-md5-report.csv > data-md5-report-sorted.csv
With both files now saved with equivalent sort ordering, I can use comm
to compare the two files to see if there are any variations (I know that there are some variations based on the line count mismatch):
comm -13 data-manifest-sorted.csv data-md5-report-sorted.csv
metadata.docx,40733f32cbdfa478f4a18734392fade0
data-manifest.xlsx,14d2def30bd89cf441f0cb1e1aa1ac55
output_variables.txt,8dbb4b06ccf6e714bf2a0af9358f0384
The comm
command provides options to display unique lines in each file respectively or the common lines between the two files. In this case, using the -13
option, I am suppressing the unique lines from "data-manifest-sorted.csv" and the common lines from both files - in other words, I want to see only unique lines in "data-md5-report-sorted.csv".
Finally, I see that there are three additional records in "data-md5-report-sorted.csv", but it does seem clear that they are not actual data files, so I can easily ignore them. Mostly, I wanted to be sure that the records in the report file (which represent both the data file names and their checksums) match identically to those in the manifest file. They did!