作者celestialgod (天)
看板R_Language
标题Re: [问题] 整理资料
时间Sun Mar 19 11:05:30 2017
※ 引述《allen1985 (我要低调 拯救形象)》之铭言:
: [问题类型]:
: 效能谘询(我想让R 跑更快)
: [软体熟悉度]:
: 使用者(已经有用R 做过不少作品)
: [问题叙述]:
: 整理资料 不使用for loop
: [程式范例]:
: 资料如下:
: data <- matrix(c("S11","R1","O11",
: "S11","R2","O12",
: "O11","R3","O12",
: "S21","R1","O21",
: "S21","R2","O22",
: "O21","R3","O22",
: "S11","R1","O11",
: "S11","R2","O12",
: "O11","R3","O12"), ncol = 3, byrow = T)
: 我想要把资料整理成
: r.data <- matrix(c("S11","O11","O12", "2",
: "S21","O21","O22", "1"), ncol = 4, byrow = T)
: 其中第四个Column 放的是 这组资料出现几次
: 简单讲就是 原本的资料是三个rows为一组 我想把资料
: 每一个unique组别 抓出来 并算出他出现几次
: 我先用了很笨的两个for loops搞定 但想问问看有没有好的方法
: 基本上第一个for loop 先把资料整理成
: r.data <- matrix(c("S11","O11","O12",
: "S21","O21","O22"), ncol = 3, byrow = T)
: 也就是先把unique的算出来
: 第二个for loop再去算每组unique的 出现几次 变成想要的data.frame
: 谢谢
: 简单讲三个rows 是一组
提供四种解法:
dataMat <- matrix(c("S11","R1","O11",
"S11","R2","O12",
"O11","R3","O12",
"S21","R1","O21",
"S21","R2","O22",
"O21","R3","O22",
"S11","R1","O11",
"S11","R2","O12",
"O11","R3","O12"), ncol = 3, byrow = T)
# aggregate
colSplit <- split(dataMat, rep(1L:ncol(dataMat), each = nrow(dataMat)))
aggregate(rep(1, nrow(dataMat)), colSplit, sum)
# paste0
rowCollapse <- do.call(function(...) paste(..., sep = "_"),
split(dataMat, rep(1L:ncol(dataMat), each = nrow(dataMat))))
countRows <- table(rowCollapse)
cbind(data.frame(do.call(rbind,strsplit(names(countRows), "_")),
stringsAsFactors = FALSE), Freq = countRows)
# data.table
library(data.table)
DT <- data.table(dataMat)
DT[ , .N, by = .(V1, V2, V3)]
## note, column数众多下面这样也行
# DT[ , .N, by = eval(paste0("V", 1:ncol(DT)))]
## 或是by里面放你要算的column name的character vector也行
## ex:
# colsCoun <- c("V1", "V2", "V3")
# DT[ , .N, by = colsCoun]
# dplyr
library(dplyr)
DF <- as.data.frame(dataMat, stringsAsFactors = FALSE)
DF %>% group_by(V1, V2, V3) %>% summarise(count = n())
## note, column数众多下面这样也行
# DF %>% group_by_(.dots = paste0("V", 1:ncol(DF))) %>%
# summarise(count = n())
## or
# colsCoun <- c("V1", "V2", "V3")
# DF %>% group_by_(.dots = colsCoun) %>%
# summarise(count = n())
效率应该是:data.table > dplyr > aggregate > paste0
--
R资料整理套件系列文:
magrittr #1LhSWhpH (R_Language) https://goo.gl/72l1m9
data.table #1LhW7Tvj (R_Language) https://goo.gl/PZa6Ue
dplyr(上.下) #1LhpJCfB,#1Lhw8b-s (R_Language) https://goo.gl/I5xX9b
tidyr #1Liqls1R (R_Language) https://goo.gl/i7yzAz
pipeR #1NXESRm5 (R_Language) https://goo.gl/zRUISx
--
※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 36.235.90.162
※ 文章网址: https://webptt.com/cn.aspx?n=bbs/R_Language/M.1489892734.A.C86.html
1F:推 allen1985: 谢谢 又学到一课了! 03/19 11:49
不客气,欢迎多来发问XDD
2F:→ allen1985: 虽然这只解决第二个问题 这样写漂亮多了 03/19 11:51
unique column的部分,在算count的时候就算做拉~~~
3F:→ allen1985: 我的问题是 在原本的资料是三个rows为单位 03/19 13:05
4F:→ allen1985: 我会自己想一下的 03/19 13:05
没有注意看,抱歉QQ
这个也不难解决... 我写一下等我一下
5F:→ allen1985: 感谢 代替我老板感谢你... 03/19 13:08
搞定,请参考下面:
# aggregate
colSplit <- split(dataMat, rep(1L:ncol(dataMat), each = nrow(dataMat)))
idx <- rep(1:ceiling(nrow(dataMat)/3), each = 3L, length = nrow(dataMat))
aggregate(rep(1, nrow(dataMat)), c(colSplit, list(idx = idx)), sum)
# data.table
library(data.table)
DT <- data.table(dataMat)
DT[ , idx := rep(1:ceiling(nrow(DT)/3), each = 3L, length = nrow(DT))]
print(DT)
# V1 V2 V3 idx
# 1: S11 R1 O11 1
# 2: S11 R2 O12 1
# 3: O11 R3 O12 1
# 4: S21 R1 O21 2
# 5: S21 R2 O22 2
# 6: O21 R3 O22 2
# 7: S11 R1 O11 3
# 8: S11 R2 O12 3
# 9: O11 R3 O12 3
DT[ , .N, by = .(idx, V1, V2, V3)]
# dplyr
library(dplyr)
DF <- as.data.frame(dataMat, stringsAsFactors = FALSE)
DF %>% mutate(idx = rep(1:ceiling(nrow(DT)/3),each = 3L,length= nrow(DT))) %>%
group_by(idx, V1, V2, V3) %>% summarise(count = n())
# idx V1 V2 V3 count
# <int> <chr> <chr> <chr> <int>
# 1 1 O11 R3 O12 1
# 2 1 S11 R1 O11 1
# 3 1 S11 R2 O12 1
# 4 2 O21 R3 O22 1
# 5 2 S21 R1 O21 1
# 6 2 S21 R2 O22 1
# 7 3 O11 R3 O12 1
# 8 3 S11 R1 O11 1
# 9 3 S11 R2 O12 1
6F:→ allen1985: 再次感谢 让我研究一下 加到我的程式里 03/19 13:17
不客气,我一开始没有看懂你的问题,抱歉Orz
※ 编辑: celestialgod (36.235.90.162), 03/19/2017 13:20:23