Class: WebRobots
- Inherits:
-
Object
- Object
- WebRobots
- Defined in:
- lib/webrobots.rb,
lib/webrobots/robotstxt.rb
Defined Under Namespace
Classes: Error, ParseError, RobotsTxt
Instance Attribute Summary (collapse)
-
- (Object) user_agent
readonly
Returns the robot name initially given.
Instance Method Summary (collapse)
-
- (Boolean) allowed?(url)
Tests if the robot is allowed to access a resource at url.
-
- (Object) create_cache
:nodoc:.
-
- (Boolean) disallowed?(url)
Equivalent to !allowed?(url).
-
- (Object) error(url)
Returns an error object if there is an error in fetching or parsing robots.txt of the site url.
-
- (Object) error!(url)
Raises the error if there was an error in fetching or parsing robots.txt of the site url.
-
- (WebRobots) initialize(user_agent, options = nil)
constructor
Creates a WebRobots object for a robot named user_agent, with optional options.
-
- (Object) option(url, token)
Equivalent to option(url).
-
- (Object) options(url)
Returns extended option values for a resource at url in a hash with each field name lower-cased.
-
- (Object) reset(url)
Removes robots.txt cache for the site url.
-
- (Object) sitemaps(url)
Returns an array of Sitemap URLs.
Constructor Details
- (WebRobots) initialize(user_agent, options = nil)
Creates a WebRobots object for a robot named user_agent, with optional options.
-
:http_get => a custom method, proc, or anything that responds to .call(uri), to be used for fetching robots.txt. It must return the response body if successful, return an empty string if the resource is not found, and return nil or raise any error on failure. Redirects should be handled within this proc.
20 21 22 23 24 25 26 27 28 29 |
# File 'lib/webrobots.rb', line 20 def initialize(user_agent, = nil) @user_agent = user_agent @parser = RobotsTxt::Parser.new(user_agent) @parser_mutex = Mutex.new ||= {} @http_get = [:http_get] || method(:http_get) @robotstxt = create_cache() end |
Instance Attribute Details
- (Object) user_agent (readonly)
Returns the robot name initially given.
37 38 39 |
# File 'lib/webrobots.rb', line 37 def user_agent @user_agent end |
Instance Method Details
- (Boolean) allowed?(url)
Tests if the robot is allowed to access a resource at url. If a malformed URI string is given, URI::InvalidURIError is raised. If a relative URI or a non-HTTP/HTTPS URI is given, ArgumentError is raised.
43 44 45 46 47 48 |
# File 'lib/webrobots.rb', line 43 def allowed?(url) site, request_uri = split_uri(url) return true if request_uri == '/robots.txt' robots_txt = get_robots_txt(site) robots_txt.allow?(request_uri) end |
- (Object) create_cache
:nodoc:
32 33 34 |
# File 'lib/webrobots.rb', line 32 def create_cache Hash.new # Must respond to [], []=, and delete. end |
- (Boolean) disallowed?(url)
Equivalent to !allowed?(url).
51 52 53 |
# File 'lib/webrobots.rb', line 51 def disallowed?(url) !allowed?(url) end |
- (Object) error(url)
Returns an error object if there is an error in fetching or parsing robots.txt of the site url.
75 76 77 |
# File 'lib/webrobots.rb', line 75 def error(url) robots_txt_for(url).error end |
- (Object) error!(url)
Raises the error if there was an error in fetching or parsing robots.txt of the site url.
81 82 83 |
# File 'lib/webrobots.rb', line 81 def error!(url) robots_txt_for(url).error! end |
- (Object) option(url, token)
Equivalent to option(url).
63 64 65 |
# File 'lib/webrobots.rb', line 63 def option(url, token) (url)[token.downcase] end |
- (Object) options(url)
Returns extended option values for a resource at url in a hash with each field name lower-cased. See allowed?() for a list of errors that may be raised.
58 59 60 |
# File 'lib/webrobots.rb', line 58 def (url) robots_txt_for(url). end |
- (Object) reset(url)
Removes robots.txt cache for the site url.
86 87 88 89 |
# File 'lib/webrobots.rb', line 86 def reset(url) site, = split_uri(url) @robotstxt.delete(site) end |
- (Object) sitemaps(url)
Returns an array of Sitemap URLs. See allowed?() for a list of errors that may be raised.
69 70 71 |
# File 'lib/webrobots.rb', line 69 def sitemaps(url) robots_txt_for(url).sitemaps end |