Turn on suggestions
Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.
Showing results for
robots.txt confirmation...
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Help with my Plusnet services
- :
- Everything else
- :
- Re: robots.txt confirmation...
robots.txt confirmation...
18-12-2011 1:35 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Hi,
Having researched robots.txt I've placed the following robots.txt on my webspace
# Block Googlebot
User-agent: googlebot
Disallow: /
I also noted the below would block all 'friendly' bots, but as the only ones visible in my stats appear to be google related, I wanted to see just them go to have confidence it was working.
# Block all Bots
User-agent: *
Disallow: /
My question is if my stats show for example a hostname entry = crawl-66-249-72-138.googlebot.com, is my 'User-agent: googlebot' entry sufficient as the help info below states, or do I have to specify the whole string as I have 4 or five variations of the crawl entries?
"This is nowhere near as scary-looking as .htaccess. The User-agent value [font=Verdana]is a partial match [/font] for the requesting browser. Each agent is listed together with a list of disallowed files."
Any advice would be very much appreciated.
Pete
Having researched robots.txt I've placed the following robots.txt on my webspace
# Block Googlebot
User-agent: googlebot
Disallow: /
I also noted the below would block all 'friendly' bots, but as the only ones visible in my stats appear to be google related, I wanted to see just them go to have confidence it was working.
# Block all Bots
User-agent: *
Disallow: /
My question is if my stats show for example a hostname entry = crawl-66-249-72-138.googlebot.com, is my 'User-agent: googlebot' entry sufficient as the help info below states, or do I have to specify the whole string as I have 4 or five variations of the crawl entries?
"This is nowhere near as scary-looking as .htaccess. The User-agent value [font=Verdana]is a partial match [/font] for the requesting browser. Each agent is listed together with a list of disallowed files."
Any advice would be very much appreciated.
Pete
Message 1 of 8
(2,223 Views)
7 REPLIES 7
Re: robots.txt confirmation...
18-12-2011 4:56 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
My confusion is/was trying to relate hostname entries in the log i.e. the ones that show the high kB F usage and the User-Agent names in the seperate sections of the stats summary.
I realise now that crawl-66-249-72-138.googlebot.com is a hostname and I only have Googlebot-Image/1.0 showing in my top 15 stats.
I guess I need to get the full stats to relate the two to know which User-Agent is taking up my Daily Transfer Allowance, i.e. is it Googlebot or Googlebot-Image.
Big assumption here by me, is that the hostnames would be different i.e. crawl-66-249-72-138.googlebot.com belongs to Googlebot and not Googlebot-Image?
Till I can understand guess I'm better off using User-Agent: *
Pete
I realise now that crawl-66-249-72-138.googlebot.com is a hostname and I only have Googlebot-Image/1.0 showing in my top 15 stats.
I guess I need to get the full stats to relate the two to know which User-Agent is taking up my Daily Transfer Allowance, i.e. is it Googlebot or Googlebot-Image.
Big assumption here by me, is that the hostnames would be different i.e. crawl-66-249-72-138.googlebot.com belongs to Googlebot and not Googlebot-Image?
Till I can understand guess I'm better off using User-Agent: *
Pete
Message 2 of 8
(633 Views)
Re: robots.txt confirmation...
18-12-2011 10:49 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
If a reputable robot is repeatedly taking up a significant amount of your daily allowance I think that suggests your files change frequently (eg blogs or active forums), so need indexing continuously. Disreputable robots could be downloading the complete website, perhaps seeking email addresses, but they wouldn't take any notice of robots.txt. Have you come across this Google explanation of robots.txt?
Image hotlinking is a potential user of bandwidth. Do you have any images on your site that other webmasters might be tempted to hotlink to?
Image hotlinking is a potential user of bandwidth. Do you have any images on your site that other webmasters might be tempted to hotlink to?
David
Message 3 of 8
(633 Views)
Re: robots.txt confirmation...
19-12-2011 1:08 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Hi spraxyt,
Thanks for the info.
Saw that from another angle and applied the User-agent: * for now. The website itself is quite small and being honest attracts little interest as it was for my partner's small catering project.
She then found eBay and has a lot of referred pictures on the webspace but some are fixed but many are added during each week. Probably have to be quite desperate to hotlink old collectibles/vintage stuff?
Been the same for 3 years and suddenly two Sundays in a row we got the exceeded DTA by 22MB (250 allowed) and the next week 65MB and got archived.
Never had the stats turned on until it was restored and surprising to find so much going on outside of our uploading etc.
From the PN webalizer summary reports I can't see any tangible link between the hostname table showing highest kB F usage to the User Agent listing which only shows the top 15 In addition to the googlebots there are 3 or 4 btcentralplus hostnames that consistently use more daily b/w than we do but no clue as to their User-Agent.
I probably need to move into full stats access/analysis or move to another paid host that backs up and doesn't gripe when the daily average 70000 kB F spikes twice a week to 320000 - 350000 and kills you..
Many thanks once again..
Pete
Thanks for the info.
Saw that from another angle and applied the User-agent: * for now. The website itself is quite small and being honest attracts little interest as it was for my partner's small catering project.
She then found eBay and has a lot of referred pictures on the webspace but some are fixed but many are added during each week. Probably have to be quite desperate to hotlink old collectibles/vintage stuff?
Been the same for 3 years and suddenly two Sundays in a row we got the exceeded DTA by 22MB (250 allowed) and the next week 65MB and got archived.
Never had the stats turned on until it was restored and surprising to find so much going on outside of our uploading etc.
From the PN webalizer summary reports I can't see any tangible link between the hostname table showing highest kB F usage to the User Agent listing which only shows the top 15 In addition to the googlebots there are 3 or 4 btcentralplus hostnames that consistently use more daily b/w than we do but no clue as to their User-Agent.
I probably need to move into full stats access/analysis or move to another paid host that backs up and doesn't gripe when the daily average 70000 kB F spikes twice a week to 320000 - 350000 and kills you..
Many thanks once again..
Pete
Message 4 of 8
(633 Views)
Re: robots.txt confirmation...
19-12-2011 2:56 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Try using .htaccess file this is mine as an example:
Important the file name must be:
Not "htaccess." or "htaccess.txt" use Notepad to compos the file with this will stop hot linking to your files.
Hope this helps.
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dhea-forum\.org\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?cpa\.net84\.net [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?computersphonesaccessories\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?forum\.dhea\.org\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dyslexia\.plushost\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?community\.plus\.net [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?usergroup\.plus\.net [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?skyuserradio\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?skyuser\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?tdadyslexia\.plus\.com [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dyslexia\.f9\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?dhea\.org\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?cableforum\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?bigarchive\.co\.uk [NC]
RewriteCond %{HTTP_REFERER} !^https://(.+\.)?nodpi\.org [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?nodpi\.org [NC]
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?jekyllandhyde\.xtreemhost\.com [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteRule .*\.(jpe?g|gif|bmp|png|pdf|doc|mid|wav|mp3)$ -
Important the file name must be:
.htaccess
Not "htaccess." or "htaccess.txt" use Notepad to compos the file with this will stop hot linking to your files.
Hope this helps.
Message 5 of 8
(633 Views)
Re: robots.txt confirmation...
20-12-2011 2:08 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Quote from: Cathotel From the PN webalizer summary reports I can't see any tangible link between the hostname table showing highest kB F usage to the User Agent listing which only shows the top 15 In addition to the googlebots there are 3 or 4 btcentralplus hostnames that consistently use more daily b/w than we do but no clue as to their User-Agent.
The occurrence of significant btcentralplus hostnames using a lot of bandwidth sounds to me like a symptom of hot linking (perhaps via eBay). As you say though the webalizer summary doesn't give sufficient information for complete understanding.
Quote from: Cathotel I probably need to move into full stats access/analysis…
Full stats would show you line by line each request including Hostname, Referer and User-Agent, the status returned (which might be 304 'not modified') and the size of the object returned. I think you do need that detail. For bots you will normally find the first request is for robots.txt, with what follows depending on the existence and content of that file.
David
David
Message 6 of 8
(633 Views)
Re: robots.txt confirmation...
20-12-2011 5:23 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Gents,
Many thanks for both inputs and I know from my IT past - gave up 2 yrs ago after 11 yrs that I will end up getting sucked into .htaccess and all that follows, so grateful for a practical example
David, It was a struggle to get the raw log data enabled and Matt Taylor via Twitter kindly stepped in and pointed support in the right direction. Fingers crossed they should appear tomorrow then I can start following the guide and install/analyse the logs.
It's nice to find a forum where there is genuine help from experts and is much appreciated.
Regards and Happy Christmas!
Pete
Many thanks for both inputs and I know from my IT past - gave up 2 yrs ago after 11 yrs that I will end up getting sucked into .htaccess and all that follows, so grateful for a practical example
David, It was a struggle to get the raw log data enabled and Matt Taylor via Twitter kindly stepped in and pointed support in the right direction. Fingers crossed they should appear tomorrow then I can start following the guide and install/analyse the logs.
It's nice to find a forum where there is genuine help from experts and is much appreciated.
Regards and Happy Christmas!
Pete
Message 7 of 8
(633 Views)
Re: robots.txt confirmation...
21-12-2011 10:04 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Hi Cathotel,
Glad I was able to help, and hope the logs are now showing for you - please let us know if not.
Glad I was able to help, and hope the logs are now showing for you - please let us know if not.
Message 8 of 8
(633 Views)
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Help with my Plusnet services
- :
- Everything else
- :
- Re: robots.txt confirmation...