Load high, who's hitting us?...
Operating a fairly large open environmental data repository seems to be a good target for scrapers. When they hit us, along with Google, our system loads begin to peak above our alarm threshold. When this happens, I am always curious about the source of the scraping, including count statistics to see if there is a cluster of sources. Here is a simple one-liner to get a quick peek of those statistics:
grep -E -o "^([0-9]{1,3}[\.]){3}[0-9]{1,3}" access.log | sort | uniq -c | sort -n
It's a combination of an extended grep
regex to get a list of only the IP addresses, a sort of the raw address data to provide fodder for uniq
, in which I specify the count option, and then an ascending numerical sort of the count values, along with the corresponding IP addresses.
The final outcome looks like this:
1 104.208.25.229
1 104.208.31.220
1 104.208.32.116
1 104.208.32.158
1 104.208.32.245
1 104.208.32.248
1 104.208.33.161
1 104.208.33.58
1 104.208.34.140
1 104.208.34.172
.
.
.
51 44.206.229.210
55 13.58.19.146
55 34.231.43.114
61 54.200.250.148
78 52.205.195.10
83 66.249.69.42
85 34.236.26.31
199 108.0.0.0
780 66.249.69.40
1915 105.0.0.0
8999 66.249.69.38
The biggest hit is Google (66.249.69.38
), followed by a host in South Africa (105.0.0.0
).*
*How do I know the origin of these addresses, you ask? I use a Python tool I wrote called ip-lookup
. Google's address looks like this:
ip-lookup 66.249.69.38
ip: 66.249.69.38
hostname: crawl-66-249-69-38.googlebot.com
type: ipv4
continent_code: NA
continent_name: North America
country_code: US
country_name: United States
region_code: FL
region_name: Florida
city: Valrico
zip: 33587
latitude: 27.963499069213867
longitude: -82.206298828125
location: {'geoname_id': 4176318, 'capital': 'Washington D.C.', 'languages': [{'code': 'en', 'name': 'English', 'native': 'English'}], 'country_flag': 'https://assets.ipstack.com/flags/us.svg', 'country_flag_emoji': '🇺🇸', 'country_flag_emoji_unicode': 'U+1F1FA U+1F1F8', 'calling_code': '1', 'is_eu': False}
The South African address looks like this:
ip-lookup 105.0.0.0
ip: 105.0.0.0
hostname: 105.0.0.0
type: ipv4
continent_code: AF
continent_name: Africa
country_code: ZA
country_name: South Africa
region_code: GT
region_name: Gauteng
city: Johannesburg
zip: 2000
latitude: -26.199169158935547
longitude: 28.0563907623291
location: {'geoname_id': 993800, 'capital': 'Pretoria', 'languages': [{'code': 'af', 'name': 'Afrikaans', 'native': 'Afrikaans'}, {'code': 'en', 'name': 'English', 'native': 'English'}, {'code': 'nr', 'name': 'South Ndebele', 'native': 'isiNdebele'}, {'code': 'st', 'name': 'Southern Sotho', 'native': 'Sesotho'}, {'code': 'ss', 'name': 'Swati', 'native': 'SiSwati'}, {'code': 'tn', 'name': 'Tswana', 'native': 'Setswana'}, {'code': 'ts', 'name': 'Tsonga', 'native': 'Xitsonga'}, {'code': 've', 'name': 'Venda', 'native': 'Tshivenḓa'}, {'code': 'xh', 'name': 'Xhosa', 'native': 'isiXhosa'}, {'code': 'zu', 'name': 'Zulu', 'native': 'isiZulu'}], 'country_flag': 'https://assets.ipstack.com/flags/za.svg', 'country_flag_emoji': '🇿🇦', 'country_flag_emoji_unicode': 'U+1F1FF U+1F1E6', 'calling_code': '27', 'is_eu': False}