作者mikemlbb (M)
看板R_Language
标题[问题] 抓取PTT网页,请问此程式码的错误在哪?
时间Tue Oct 18 15:23:22 2016
[问题类型]:
程式谘询(我想用R 做某件事情,但是我不知道要怎麽用R 写出来)
[软体熟悉度]:
使用者(已经有用R 做过不少作品)
[问题叙述]:
我照着书本输入以下程式码,
想尝试抓取笨版中之文章文字内容,
但程式码执行完後却出现:
Error in regexpr("www", line) :
argument "line" is missing, with no default
了解regexpr语法的用法後,
发现此程式的"line"字词与该语法之用法不同,
然而若是在此范例中,
要怎麽修改才能抓取到笨版当中的文章呢?
谢谢大家解惑
[程式范例]:
install.packages("XML")
install.packages("RCurl")
library(XML)
library(RCurl)
data <- list()
for(i in 1058:1118){
tmp <- paste(i, '.html', sep = '')
url <- paste('
https://webptt.com/cn.aspx?n=bbs/StupidClown/index',tmp,sep = '')
get_url <- getURL(url,ssl.verifypeer = FALSE)
html <- htmlParse(get_url)
url.list <- xpathSApply(html,"//div[@class='title']/a[@href]",xmlAttrs)
data <- rbind(data, paste('
https://webptt.com/cn.aspx?n=',url.list,sep = ''))
}
data <- unlist(data)
getdoc <- function(line){
start <- regexpr('www', line)[1]
end <- regexpr('html', line)[1]
if(start != -1 & end != -1){
url <- substr(line, start, end+3)
html <- htmlParse(getURL(url,ssl.verifypeer = FALSE),encoding = 'UTF-8')
doc <- xpathSApply(html, "//div[@id='main-container']",xmlValue)
name <- strsplit(url,'/')[[1]][4]
write(doc,gsub('html','txt',name))
}
}
getdoc()
sapply(data, getdoc)
setwd("C://Documents and Settings//12345//桌面//R_textmining")
write.table(getdoc,file = "getdoc.txt",row.names = F,quote = F)
[环境叙述]:
R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 3
[关键字]:
regexpr、xpathSApply、PTT爬虫
--
※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 118.163.143.251
※ 文章网址: https://webptt.com/cn.aspx?n=bbs/R_Language/M.1476775409.A.321.html
※ 编辑: mikemlbb (118.163.143.251), 10/18/2016 15:29:00
1F:推 clansoda: Hi,I am trying to solve your problem. 10/19 09:41
2F:→ clansoda: Would you tell me what your expected output is 10/19 09:41
3F:→ clansoda: The "data" dataframe contains 1220 URL characters 10/19 09:42
4F:→ mikemlbb: I'm trying to crawl the content of StupidClown site 10/21 02:23
5F:→ mikemlbb: XIncluding article title and content by no.1058 to11 10/21 02:25
6F:→ mikemlbb: But the code seem to be wrong.When I run "getdoc()" 10/21 02:27
7F:→ mikemlbb: The error will emerge then say "line" is not defined 10/21 02:29