urllib.robotparser — robots.txt 的解析器

源代码： Lib/urllib/robotparser.py

该模块提供一个单一类RobotFileParser，该类回答有关特定用户代理是否可以在发布robots.txt文件的网站上获取 URL 的问题。有关robots.txt文件结构的更多详细信息，请参见http://www.robotstxt.org/orig.html。

- class * urllib.robotparser. RobotFileParser(* url =''*)
- 此类提供了用于读取，解析和回答有关位于 url *的robots.txt文件的问题的方法。
set_url(* url *)
- 设置引用robots.txt文件的 URL。
read ( )
- 读取robots.txt URL 并将其提供给解析器。
parse(行)
- 解析 lines 参数。
can_fetch(* useragent ， url *)
- 如果允许* useragent 根据解析的robots.txt文件中包含的规则获取 url *，则返回True。
mtime ( )
- 返回上一次获取robots.txt文件的时间。这对于需要定期检查新的robots.txt文件的长期运行的网络蜘蛛非常有用。
modified ( )
- 将robots.txt文件的最后获取时间设置为当前时间。
crawl_delay(* useragent *)
- 从robots.txt返回所讨论的* useragent 的Crawl-delay参数的值。如果没有这样的参数，或者它不适用于指定的 useragent *，或者此参数的robots.txt条目具有无效的语法，则返回None。

3.6 版的新Function。

request_rate(* useragent *)
- 从robots.txt作为named tuple RequestRate(requests, seconds)返回Request-rate参数的内容。如果没有这样的参数，或者它不适用于指定的* useragent *，或者该参数的robots.txt条目具有无效的语法，则返回None。

3.6 版的新Function。

site_maps ( )
- 以list()的形式从robots.txt返回Sitemap参数的内容。如果没有这样的参数，或者此参数的robots.txt条目语法无效，则返回None。

3.8 版的新Function。

下面的示例演示了RobotFileParser类的基本用法：

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rrate = rp.request_rate("*")
>>> rrate.requests
3
>>> rrate.seconds
20
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True

Docs

Docs4dev

Title here

urllib.robotparser — robots.txt 的解析器