cancel
Showing results for 
Search instead for 
Did you mean: 

How do I find a list of bad spiders and crawlers?

deepee
Newbie
Posts: 5
Registered: ‎01-01-2008

How do I find a list of bad spiders and crawlers?

The tutorials on plusnet usertools are good value.
http://usertools.plus.net/tutorials/id/5 ; and /id/48
make good sense for keeping unwanted site traffic down. Fine, I now know how to keep GoogleImage's paws off my pics, how to stop other sites using my bandwidth to serve my binaries and quite a bit more. But can anyone tell me please where I can find a list of ill-behaved spiders so that I can put them into my robots.txt and.htaccess files to keep them out?
Wikipedia has a list in its robots.txt file, but I imagine that someone somewhere keeps an up-to-date and reasonably reliable list.
4 REPLIES 4
RobDickson
Grafter
Posts: 653
Thanks: 3
Registered: ‎06-08-2007

Re: How do I find a list of bad spiders and crawlers?

There are some useful links at the end of http://en.wikipedia.org/wiki/Spambot.
Adam1V
Grafter
Posts: 223
Registered: ‎31-07-2007

Re: How do I find a list of bad spiders and crawlers?

why would you want to keep google away? one of the top search engines which can bring you traffic and you want to close the doors to them  Huh
I personally use a sitemap, upload this to google webmaster tools which indicates to them when ive made changes.
RobDickson
Grafter
Posts: 653
Thanks: 3
Registered: ‎06-08-2007

Re: How do I find a list of bad spiders and crawlers?

I haven't read the articles that Deepee referred to, but I assume that he wants to use .htaccess to stop Google (and anybody else) using up his bandwidth. I assume he's not using robots.txt to keep Google away completely.
Google seems to have made some improvements over Christmas - I've found that Google indexes my site within a minute of me making any changes.
deepee
Newbie
Posts: 5
Registered: ‎01-01-2008

Re: How do I find a list of bad spiders and crawlers?

The idea was indeed to let Google and others in to the html files but to keep the GoogleImage bot, and others that would trough the images, out. Robots.txt will work with well-behaved bots, such as Google, but apparently not the badly behaved ones.
While keeping the bad bots out with htaccess it's also possible to prevent 'hotlinking'. http://usertools.plus.net/tutorials/id/48 gives a good introduction.