Load high, who's hitting us?...

Operating a fairly large open environmental data repository seems to be a good target for scrapers. When they hit us, along with Google, our system loads begin to peak above our alarm threshold. When this happens, I am always curious about the source of the scraping, including count statistics to see if there is a cluster of sources. Here is a simple one-liner to get a quick peek of those statistics:

grep -E -o "^([0-9]{1,3}[\.]){3}[0-9]{1,3}" access.log | sort | uniq -c | sort -n

It's a combination of an extended grep regex to get a list of only the IP addresses, a sort of the raw address data to provide fodder for uniq, in which I specify the count option, and then an ascending numerical sort of the count values, along with the corresponding IP addresses.

The final outcome looks like this:

1    104.208.25.229
1    104.208.31.220
1    104.208.32.116
1    104.208.32.158
1    104.208.32.245
1    104.208.32.248
1    104.208.33.161
1    104.208.33.58
1    104.208.34.140
1    104.208.34.172
.
.
.
51   44.206.229.210
55   13.58.19.146
55   34.231.43.114
61   54.200.250.148
78   52.205.195.10
83   66.249.69.42
85   34.236.26.31
199  108.0.0.0
780  66.249.69.40
1915 105.0.0.0
8999 66.249.69.38

The biggest hit is Google (66.249.69.38), followed by a host in South Africa (105.0.0.0).*

*How do I know the origin of these addresses, you ask? I use a Python tool I wrote called ip-lookup. Google's address looks like this:

ip-lookup 66.249.69.38
ip: 66.249.69.38
hostname: crawl-66-249-69-38.googlebot.com
type: ipv4
continent_code: NA
continent_name: North America
country_code: US
country_name: United States
region_code: FL
region_name: Florida
city: Valrico
zip: 33587
latitude: 27.963499069213867
longitude: -82.206298828125
location: {'geoname_id': 4176318, 'capital': 'Washington D.C.', 'languages': [{'code': 'en', 'name': 'English', 'native': 'English'}], 'country_flag': 'https://assets.ipstack.com/flags/us.svg', 'country_flag_emoji': '🇺🇸', 'country_flag_emoji_unicode': 'U+1F1FA U+1F1F8', 'calling_code': '1', 'is_eu': False}

The South African address looks like this:

ip-lookup 105.0.0.0
ip: 105.0.0.0
hostname: 105.0.0.0
type: ipv4
continent_code: AF
continent_name: Africa
country_code: ZA
country_name: South Africa
region_code: GT
region_name: Gauteng
city: Johannesburg
zip: 2000
latitude: -26.199169158935547
longitude: 28.0563907623291
location: {'geoname_id': 993800, 'capital': 'Pretoria', 'languages': [{'code': 'af', 'name': 'Afrikaans', 'native': 'Afrikaans'}, {'code': 'en', 'name': 'English', 'native': 'English'}, {'code': 'nr', 'name': 'South Ndebele', 'native': 'isiNdebele'}, {'code': 'st', 'name': 'Southern Sotho', 'native': 'Sesotho'}, {'code': 'ss', 'name': 'Swati', 'native': 'SiSwati'}, {'code': 'tn', 'name': 'Tswana', 'native': 'Setswana'}, {'code': 'ts', 'name': 'Tsonga', 'native': 'Xitsonga'}, {'code': 've', 'name': 'Venda', 'native': 'Tshivenḓa'}, {'code': 'xh', 'name': 'Xhosa', 'native': 'isiXhosa'}, {'code': 'zu', 'name': 'Zulu', 'native': 'isiZulu'}], 'country_flag': 'https://assets.ipstack.com/flags/za.svg', 'country_flag_emoji': '🇿🇦', 'country_flag_emoji_unicode': 'U+1F1FF U+1F1E6', 'calling_code': '27', 'is_eu': False}