Periodically I check out the Apache error and access logs in order to find things that I can fix or improve with a little work. Today I noticed a few errors for bad URLs coming from unfamiliar IP addresses. The error I got is:
(36)File name too long: [client xx.xx.xx.xx:xxxxx] AH00036: access to ….
Looking at the filename, it is way too long but it is also very peculiar. It starts off with a valid WordPress folder name but appends a long string of folder names where the folder names are prefixed with an underscore and suffixed with a comma. I checked this against the actual folders and it matches existing paths in my WordPress instance. The garbled path ends in an actual file name. To make it even weirder, this garbled path name has a plus sign followed by another garbled path name, multiple times. All together it’s about 800 characters long. WTF?…
My first thought was this was some kind of lame buffer overflow attempt. But it looked too regular and was based on real information that was extracted from my site’s file system (although all WordPress instances are pretty much the same). I did a reverse IP lookup and the client was identified as a Googlebot. Now I looked in the access log and there was a cluster of similar IPs hitting the site at that time preceded by fetches of robots.txt. Nothing unusual at the start but as time went on the URLs got more bizarre until they got too long.
If I had to guess (and this is only a guess) I would say the programmer of the bot got string buffers mixed up or forgot to reset one before reusing it. Which leads me to wonder if Google starts it’s junior programmers out on the bot team. Do they do regressive testing or are they so committed to streamlining DevOps that they don’t bother?