Home » 2017 » May » 21 » Internet bot

1:55 AM
Internet bot

Internet bot An Internet Bot, also known as web robot, WWW robot or simply bot, is a software application that runs automated tasks (scripts) over the Internet.[1] Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone. The largest use of bots is in web spidering (web crawler), in which an automated script fetches, analyzes and files information from web servers at many times the speed of a human. More than half of all web traffic is made up of bots.[2]

Efforts by servers hosting websites to counteract bots vary. Servers may choose to outline rules on the behaviour of internet bots by implementing a robots.txt file: this file is simply text stating the rules governing a bot's behaviour on that server. Any bot interacting with (or 'spidering') any server that does not follow these rules should, in theory, be denied access to, or removed from, the affected website. If the only rule implementation by a server is a posted text file with no associated program/software/app, then adhering to those rules is entirely voluntary – in reality there is no way to enforce those rules, or even to ensure that a bot's creator or implementer acknowledges, or even reads, the robots.txt file contents. Some bots are "good" – i.e. search engine spiders – while others can be used to launch malicious and harsh attacks.

The standard was proposed by Martijn Koster,[1][2] when working for Nexor[3] in February 1994[4] on the www-talk mailing list, the main communication channel for WWW-related activities at the time. Charles Stross claims to have provoked Koster to suggest robots.txt, after he wrote a badly-behaved web crawler that caused an inadvertent denial of service attack on Koster's server.[5]

It quickly became a de facto standard that present and future web crawlers were expected to follow; most complied, including those operated by search engines such as WebCrawlerLycos and AltaVista.

When a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions, and crawl the entire site.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.[7]

A robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com. In addition, each protocol and port needs its own robots.txt file; http://example.com/robots.txt does not apply to pages under https://example.com:8080/ or https://example.com/.

Some major search engines following this standard include Ask,[8] AOL,[9] Baidu,[10] Bing,[11] DuckDuckGo, [12] Google,[13] Yahoo!,[14] and Yandex.[15]

The volunteering group Archive Team explicitly ignores robots.txt for the most part, viewing it as an obsolete standard that hinders web archival efforts. According to project leader Jason Scott, "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context."[16] For some years, the Internet Archive did not crawl sites with robots.txt, but in April 2017, it announced that it would no longer honour directives in the robots.txt files. “Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes".[17] This was in response to entire domains being tagged with robots.txt when the content became obsolete.

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory.[18] and relies on the compliance of the web robot. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk,[19] this sort of security through obscurity is discouraged by standards bodies. The National Institute of Standards and Technology (NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components."[20] In the context of robots.txt files, security through obscurity is not recommended as a security technique.

Views: 51 | Added by: admin | Tags: Internet bot | Rating: 5.0/1
Total comments: 0