A tool to prevent site being archived?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Help with my Plusnet services
- :
- Everything else
- :
- A tool to prevent site being archived?
A tool to prevent site being archived?
11-10-2012 11:46 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Obviously there is nothing I can do to prevent this directly as I was not accessing the site myself. Still, it has given me to wonder whether, if this sort of event is at all common, one might have a script of some sort which silently watches the log, and when, for example, a single user has made, say, ten hits on the site in one day, he/she gets a notice saying that users are limited to twelve hits per day. Then, if there are two more hits the user is refused access for the next 24 hours.
Might this be possible to achieve?
Best regards, T.
Re: A tool to prevent site being archived?
13-10-2012 3:05 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Gabe
Re: A tool to prevent site being archived?
14-10-2012 1:39 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
http://www.robotstxt.org/
Re: A tool to prevent site being archived?
14-10-2012 7:37 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
i usually block the IP.
never found a suitable way of preventing randomers, though I have various items regulated via htaccess, etc.
also, these days, tend to keep a several-times-a-day watch on the raw stats log. Suspicious activity is usually obvious, single hits preceded by further scans, you get to know what IP to block before they come back looking for php registation files... you see them arrive later for exactly that one file but getting the 403. Ha!
pita, but part of life

funny you should mention NZ -- 202.175.135.* - DSLAK-NZ - Datacom Systems Ltd, just yesterday. Bonkers scan of everything...
Re: A tool to prevent site being archived?
22-10-2012 10:58 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Sorry slow to reply, I didn't get notified of any replies, so after a few days I stopped looking!
Quote from: Gabe That sounds automated. Not much point talking nicely to a bot. A spider trap might help - for which there are various recipes on the web.
I looked up 'spider trap' and it seems to be just what I need to avoid. Or is it like 'bandwidth', a word which means totally different things i different worlds? (I looked it up on Google, and it was described as an element on a page which tricks the Google spider into going around in circles.
Best, T.
Re: A tool to prevent site being archived?
22-10-2012 11:10 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Quote from: IanSn in my experience naughty hits like this are from sources that either don't read the robots file or ignore it.
i usually block the IP.
How would I do that?
Quote ... these days, tend to keep a several-times-a-day watch on the raw stats log. Suspicious activity is usually obvious, single hits preceded by further scans, you get to know what IP to block before they come back looking for php registation files... you see them arrive later for exactly that one file but getting the 403. Ha!
pita, but part of life
For my kind of site it is impossible to imagine checking its stats daily. But for rare cases I guess I could (if I knew how) block a particular IP. Could the '403' be configured to explain what is going on? Otherwise, for example, a non-amok attempted access would simply be annoyingly barred, and have no way to know how or why or what to do about it.
I don't know what a php registration file is, so I don't know or understand what any one would be seeking - afaik I don't host any?
Quote funny you should mention NZ -- 202.175.135.* - DSLAK-NZ - Datacom Systems Ltd, just yesterday. Bonkers scan of everything...
So I should block these people, and suggest to others that they do likewise, or write to them, or what?
Do ISP's other than PlusNet deal with this more effectively - I haven't heard of others dealing with this sort of grief?
Best, T.
Re: A tool to prevent site being archived?
22-10-2012 11:14 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Quote from: avatastic You could add a robots.txt to the root of your website, and hope that the crawlers honour it.
http://www.robotstxt.org/
What other effects would this have - e.g. would it effectively make the site invisible to Google? The site is charitable, not commercial, but part of the point is being able to be found!
Best, T.
Re: A tool to prevent site being archived?
22-10-2012 3:52 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
would tell all bots but Google to keep off the grass. But bad bots tend to ignore the notices.
You can also set a "Crawl-delay" (which Google ignores, but you can set it in Google Webmaster Tools), but that's more relevant to avoiding dips in response time on large sites than limiting daily bandwidth on small ones. Bots that respect robots.txt aren't usually the problem.
You can use a bad-bot list, but it's easy to go ott. An alternative or additional method is to set a spider trap. I'll give a brutal but effective version.
First you create a hidden link in your homepage to a folder that will act as your spider trap (there are various ways of doing that). You put it top and bottom of the page body, e.g.
<body>
<div style="display:none"><a href="/bath"><!-- bath time --></a></div>
...
<div style="display:none"><a href="/bath"><!-- bath time --></a></div>
</body>
Then you tell spiders to keep out of the bath by creating a robots.txt file and putting it in whichever folder corresponds to www.yoursite.com/robots.txt (that's where bots should look for it).
User-agent: *
Disallow: /bath/
Disallow: /bath
User-agent: Googlebot
Disallow: /bath/
Disallow: /bath
You say it with and without the trailing slash, because bots are thick, and you say it again to Googlebot, because it's been known to be hard of hearing, and it pays to be paranoid.
Then you put an index.php file in your /bath, such as
<!doctype html>
<html>
<head>
<title>Naughty Spider</title></head>
<body>
<p>Please respect the robots.txt</p>
<?php
$addr = $_SERVER['REMOTE_ADDR'];
file_put_contents("../.htaccess", "deny from $addr\n", FILE_APPEND | LOCK_EX);
?>
</body>
</html>
If you already have a .htaccess file, make sure it ends on a new line.
Then check your .htaccess file occasionally, just to make sure you're only trapping rude spiders. The idea is to forbid anything that tries to scrape your whole site without asking and not the polite spider or ordinary user.
Gabe
Re: A tool to prevent site being archived?
22-10-2012 4:36 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
@Tony -
Definitely agree - use a robots file.
But having said that, the worst hits I've experienced didn't even look at the robots file.
Normally there's a Control Panel that would allow you to add an IP, or a range to block. This adds the IP to that .htaccess file mentioned above.
Or you can edit the .htaccess file 'manually', though carefully. But if you're not familiar, the Control Panel would be best.
Sorry, by 'php registration file' I meant the file that's used when (if) people register on the site. (PHP is a computer language usually used for server-side databases.) 'Hackers' know what this registration file is called and will attempt a single hit on that file. If found, and if its an automated 'bot', it can fill the register form in seconds. Very annoying.
Single hits on vulnerable files just get immediately blocked around here!
In my experience writing to the abuse dept. of the providers results in goose egg. But I'd still encourage people to do it.
edit - typo. (dept. not debt. !!)
Re: A tool to prevent site being archived?
22-10-2012 4:56 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Any source exhibiting bot-like behaviour that doesn't check and respect robots.txt deserves to be blocked as a matter of course.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Help with my Plusnet services
- :
- Everything else
- :
- A tool to prevent site being archived?