作者arlu (arlu)
看板Python
标题[问题] 抓取网页遇到的问题
时间Thu Aug 26 17:22:22 2010
大家好,
最近初学Python,想做个简单的抓网页程式,
我灌的是python3.1.1的版本,我用了urllib的class,以下为测试main
--------------------------------------------------------
import urllib.request
url="
http://google.com"
MyWeb=urllib.request.urlopen(url)
WebContent=MyWeb.read()
MyWeb.close()
print(WebContent)
--------------------------------------------------------
我发现如果打一些比较好抓的网页如(
http://google.com)
就会正确的将内容抓下来,但我打一些网站,像是(
http://www.wretch.cc/)
执行後就会出现以下讯息↓
Traceback (most recent call last):
File "html2spec.py", line 6, in <module>
MyWeb=urllib.request.urlopen(url)
File "C:\Python31\lib\urllib\request.py", line 119, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python31\lib\urllib\request.py", line 353, in open
response = meth(req, response)
File "C:\Python31\lib\urllib\request.py", line 465, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python31\lib\urllib\request.py", line 391, in error
return self._call_chain(*args)
File "C:\Python31\lib\urllib\request.py", line 325, in _call_chain
result = func(*args)
File "C:\Python31\lib\urllib\request.py", line 473, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
这是为什麽啊?还是我应该用什麽方式呢?
谢谢帮忙解答。
--
※ 发信站: 批踢踢实业坊(ptt.cc)
◆ From: 59.120.3.16
1F:推 cobrasgo:不就是错误码的最後一个字吗? 08/26 19:29
2F:推 Starwindd:HTTP Error 401: Unauthorized <- 就是它 08/26 22:04
3F:→ arlu:谢谢!但想请教一下是不是一定得透过抓封包的方式去看呢? 08/27 13:23
4F:推 areyo:可以用scrapy 但好像只支媛2.x版 08/27 13:48
5F:→ sunrise0406:使用CURL,伪装成browser即可抓到网页内容 10/07 14:08