Non UTF-8 characters getting you down?

Need to identify non UTF-8 characters in a text file? Here's a fast and handy method using grep that highlights non UTF-8 characters with the "�" glyph:

grep -axvn '.*' data.csv 
491:2012-08-23 18:10:00,"Lovers_Leap","KD",28,1,3,3,2,3,1,NA,"very stagnant and scummy, wouldn�t swim for money ",0
815:2013-07-05 12:00:00,"Barkwood_Point ","JM�",28,1.5,4,2,1,1,1,NA,"",0
1415:2015-05-28 16:30:00,"Barkwood_Point","SM",23,NA,4,3,1,1,1,1,"Secchi hit bottom; raining/thunderstorm�",1
2491:2018-07-08 18:00:00,"Shepaug","RW",28,3.75,4,3,1,1,1,1,"microcystis appears to be � can't read the rest of note",0
2522:2018-05-29 18:55:00,"RT133","GLB",24,2.5,4,2,1,1,1,1,"took 5/19 samples today, hot � can't read notes ",0
2598:2018-08-25 16:50:00,"RT133","GLB",26,1,2,4,1,3,1,1,"in am large � can't read notes",0

Let's unpack the -axvn '.*' options:

  1. a - Process the file as if it were text (even if it were binary), thereby ignoring (and not crashing due to) the offending characters.
  2. x - Select only those matches that exactly match the whole line of any UTF-8 character.
  3. v - Invert the sense of matching, to select non-matching lines, thus displaying those lines that do not match the criteria.
  4. n - Prefix each line of output with the 1-based line number within its input file.
  5. '.*' - Match any and all characters.

So, in simple terms, print out the lines of the file that contain a non UTF-8 character (assuming that your system locale default encoding is set to UTF-8 – for example: LANG=en_US.UTF-8).

Now, if you want to simulate the grep output with Python see below:

with open("./data.csv", "r", errors="replace") as f: 
    lines = f.readlines()

for index, line in enumerate(d, start=1): 
    if "�" in line: 
        print(f"{index}: line)

The key in Python is to open the file with the errors="replace" argument so that the file parser will replace any non UTF-8 byte value with the system "non-printing character" glyph - on my system, this is �. Then, I simply iterate over each line in search of any containing �, and if so, print out the line number and the line it self.