Non UTF-8 characters getting you down?
Need to identify non UTF-8 characters in a text file? Here's a fast and handy method using grep
that highlights non UTF-8 characters with the "�" glyph:
grep -axvn '.*' data.csv
491:2012-08-23 18:10:00,"Lovers_Leap","KD",28,1,3,3,2,3,1,NA,"very stagnant and scummy, wouldn�t swim for money ",0
815:2013-07-05 12:00:00,"Barkwood_Point ","JM�",28,1.5,4,2,1,1,1,NA,"",0
1415:2015-05-28 16:30:00,"Barkwood_Point","SM",23,NA,4,3,1,1,1,1,"Secchi hit bottom; raining/thunderstorm�",1
2491:2018-07-08 18:00:00,"Shepaug","RW",28,3.75,4,3,1,1,1,1,"microcystis appears to be � can't read the rest of note",0
2522:2018-05-29 18:55:00,"RT133","GLB",24,2.5,4,2,1,1,1,1,"took 5/19 samples today, hot � can't read notes ",0
2598:2018-08-25 16:50:00,"RT133","GLB",26,1,2,4,1,3,1,1,"in am large � can't read notes",0
Let's unpack the -axvn '.*'
options:
a
- Process the file as if it were text (even if it were binary), thereby ignoring (and not crashing due to) the offending characters.x
- Select only those matches that exactly match the whole line of any UTF-8 character.v
- Invert the sense of matching, to select non-matching lines, thus displaying those lines that do not match the criteria.n
- Prefix each line of output with the 1-based line number within its input file.'.*'
- Match any and all characters.
So, in simple terms, print out the lines of the file that contain a non UTF-8 character (assuming that your system locale
default encoding is set to UTF-8 – for example: LANG=en_US.UTF-8
).
Now, if you want to simulate the grep
output with Python
see below:
with open("./data.csv", "r", errors="replace") as f:
lines = f.readlines()
for index, line in enumerate(d, start=1):
if "�" in line:
print(f"{index}: line)
The key in Python is to open the file with the errors="replace"
argument so that the file parser will replace any non UTF-8 byte value with the system "non-printing character" glyph - on my system, this is �. Then, I simply iterate over each line in search of any containing �, and if so, print out the line number and the line it self.