cancel
Showing results for 
Search instead for 
Did you mean: 

robots.txt confirmation...

Cathotel
Grafter
Posts: 129
Registered: ‎05-10-2007

robots.txt confirmation...

Hi,
Having researched robots.txt I've placed the following robots.txt on my webspace
# Block Googlebot
User-agent: googlebot
Disallow: /
I also noted the below would block all 'friendly' bots, but as the only ones visible in my stats appear to be google related, I wanted to see just them go to have confidence it was working.
# Block all Bots
User-agent: *
Disallow: /
My question is if my stats show for example a hostname entry = crawl-66-249-72-138.googlebot.com, is my 'User-agent: googlebot' entry sufficient as the help info below states, or do I have to specify the whole string as I have 4 or five variations of the crawl entries? 
"This is nowhere near as scary-looking as .htaccess. The User-agent value [font=Verdana]is a partial match [/font] for the requesting browser. Each agent is listed together with a list of disallowed files."
Any advice would be very much appreciated.
Pete
7 REPLIES
Cathotel
Grafter
Posts: 129
Registered: ‎05-10-2007

Re: robots.txt confirmation...

My confusion is/was trying to relate hostname entries in the log i.e. the ones that show the high kB F usage and the User-Agent names in the seperate sections of the stats summary.
I realise now that crawl-66-249-72-138.googlebot.com is a hostname and I only have Googlebot-Image/1.0 showing in my top 15 stats.
I guess I need to get the full stats to relate the two to know which User-Agent is taking up my Daily Transfer Allowance, i.e. is it Googlebot or Googlebot-Image.
Big assumption here by me, is that the hostnames would be different i.e. crawl-66-249-72-138.googlebot.com belongs to Googlebot and not Googlebot-Image?
Till I can understand guess I'm better off using User-Agent: *
Pete
Superuser
Superuser
Posts: 9,926
Thanks: 1,265
Fixes: 71
Registered: ‎06-04-2007

Re: robots.txt confirmation...

If a reputable robot is repeatedly taking up a significant amount of your daily allowance I think that suggests your files change frequently (eg blogs or active forums), so need indexing continuously. Disreputable robots could be downloading the complete website, perhaps seeking email addresses, but they wouldn't take any notice of robots.txt. Have you come across this Google explanation of robots.txt?
Image hotlinking is a potential user of bandwidth. Do you have any images on your site that other webmasters might be tempted to hotlink to?
David
Cathotel
Grafter
Posts: 129
Registered: ‎05-10-2007

Re: robots.txt confirmation...

Hi spraxyt,
Thanks for the info.
Saw that from another angle and applied the User-agent: * for now.  The website itself is quite small and being honest attracts little interest as it was for my partner's small catering project.
She then found eBay Smiley and has a lot of referred pictures on the webspace but some are fixed but many are added during each week.  Probably have to be quite desperate to hotlink old collectibles/vintage stuff?
Been the same for 3 years and suddenly two Sundays in a row we got the exceeded DTA by 22MB (250 allowed) and the next week 65MB and got archived.
Never had the stats turned on until it was restored and surprising to find so much going on outside of our uploading etc.
From the PN webalizer summary reports I can't see any tangible link between the hostname table showing highest kB F usage to the User Agent listing which only shows the top 15 Sad  In addition to the googlebots there are 3 or 4 btcentralplus hostnames that consistently use more daily b/w than we do but no clue as to their User-Agent. Sad
I probably need to move into full stats access/analysis or move to another paid host that backs up and doesn't gripe when the daily average 70000 kB F spikes twice a week to 320000 - 350000 and kills you..
Many thanks once again..
Pete

Midnight_Caller
Rising Star
Posts: 4,143
Thanks: 7
Fixes: 1
Registered: ‎15-04-2007

Re: robots.txt confirmation...

Try using .htaccess file this is mine as an example:

RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dhea-forum\.org\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?cpa\.net84\.net [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?computersphonesaccessories\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?forum\.dhea\.org\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dyslexia\.plushost\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?community\.plus\.net [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?usergroup\.plus\.net [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?skyuserradio\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?skyuser\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?tdadyslexia\.plus\.com [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dyslexia\.f9\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dhea\.org\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?cableforum\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?bigarchive\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^https://(.+\.)?nodpi\.org [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?nodpi\.org [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?jekyllandhyde\.xtreemhost\.com [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteRule .*\.(jpe?g|gif|bmp|png|pdf|doc|mid|wav|mp3)$ -

Important the file name must be:

.htaccess

Not "htaccess." or "htaccess.txt" use Notepad to compos the file with this will stop hot linking to your files.
Hope this helps.

Superuser
Superuser
Posts: 9,926
Thanks: 1,265
Fixes: 71
Registered: ‎06-04-2007

Re: robots.txt confirmation...

Quote from: Cathotel
From the PN webalizer summary reports I can't see any tangible link between the hostname table showing highest kB F usage to the User Agent listing which only shows the top 15 Sad  In addition to the googlebots there are 3 or 4 btcentralplus hostnames that consistently use more daily b/w than we do but no clue as to their User-Agent. Sad

The occurrence of significant btcentralplus hostnames using a lot of bandwidth sounds to me like a symptom of hot linking (perhaps via eBay). As you say though the webalizer summary doesn't give sufficient information for complete understanding.
Quote from: Cathotel
I probably need to move into full stats access/analysis…

Full stats would show you line by line each request including Hostname, Referer and User-Agent, the status returned (which might be 304 'not modified') and the size of the object returned. I think you do need that detail. For bots you will normally find the first request is for robots.txt, with what follows depending on the existence and content of that file.
David
David
Cathotel
Grafter
Posts: 129
Registered: ‎05-10-2007

Re: robots.txt confirmation...

Gents,
Many thanks for both inputs and I know from my IT past - gave up 2 yrs ago after 11 yrs Smiley that I will end up getting sucked into .htaccess and all that follows, so grateful for a practical example Smiley
David, It was a struggle to get the raw log data enabled and Matt Taylor via Twitter kindly stepped in and pointed support in the right direction.  Fingers crossed they should appear tomorrow then I can start following the guide and install/analyse the logs.
It's nice to find a forum where there is genuine help from experts and is much appreciated.
Regards and Happy Christmas!
Pete 
Plusnet Alumni (retired) orbrey
Plusnet Alumni (retired)
Posts: 10,540
Registered: ‎18-07-2007

Re: robots.txt confirmation...

Hi Cathotel,
Glad I was able to help, and hope the logs are now showing for you - please let us know if not.