J'ai légèrement modifié votre exemple pour inclure 2 identifiants différents. Aussi, je ne suis pas sûr de ce que vous entendez par "corrélation significative". Grande valeur, ou statistiquement significatif? J'ai inclus les deux cas ici.
1. Valeur de corrélation et la valeur p
library(dplyr)
# example dataset
dt = read.table(text="hh_ids date income consumption alcohol cleaning_materials clothing
KELDK01 2012-11-1 62.70588 40.52941 0 0.000000 0.000000
KELDK01 2012-12-1 17.64706 42.43530 0 1.058824 7.058824
KELDK01 2013-01-1 91.76471 48.23529 0 0.000000 0.000000
KELDK01 2013-02-1 91.76470 107.52940 0 0.000000 0.000000
KELDK01 2013-03-1 116.47060 114.47060 0 0.000000 0.000000
KELDK01 2013-04-1 124.41180 118.29410 0 2.705882 17.647060
KELDK02 2013-05-1 137.23530 105.00000 0 1.411765 1.882353
KELDK02 2013-06-1 131.52940 109.54120 0 4.352942 2.941176
KELDK02 2013-07-1 121.52940 113.47060 0 2.352941 25.882350
KELDK02 2013-08-1 123.32940 86.50588 0 2.588235 2.941176",
sep="", header=T, stringsAsFactors = F)
dt
# hh_ids date income consumption alcohol cleaning_materials clothing
# 1 KELDK01 2012-11-1 62.70588 40.52941 0 0.000000 0.000000
# 2 KELDK01 2012-12-1 17.64706 42.43530 0 1.058824 7.058824
# 3 KELDK01 2013-01-1 91.76471 48.23529 0 0.000000 0.000000
# 4 KELDK01 2013-02-1 91.76470 107.52940 0 0.000000 0.000000
# 5 KELDK01 2013-03-1 116.47060 114.47060 0 0.000000 0.000000
# 6 KELDK01 2013-04-1 124.41180 118.29410 0 2.705882 17.647060
# 7 KELDK02 2013-05-1 137.23530 105.00000 0 1.411765 1.882353
# 8 KELDK02 2013-06-1 131.52940 109.54120 0 4.352942 2.941176
# 9 KELDK02 2013-07-1 121.52940 113.47060 0 2.352941 25.882350
# 10 KELDK02 2013-08-1 123.32940 86.50588 0 2.588235 2.941176
# create a function that calculates correlation and p value given 2 vectors
Get_cor_and_pval = function(d,n1,n2,id){
# create 2 vectors based on names of variables and the id
x = d[,n1][dt$hh_ids==id]
y = d[,n2][dt$hh_ids==id]
# calculate correlation and p value
test = cor.test(x,y)
c = test$estimate # keep correlation value
p = test$p.value # keep p value
return(data.frame(c = c, p = p, row.names = NULL))
}
# specify combinations of variables to calculate correlation
names1 = "clothing"
names2 = c("income","consumption","alcohol","cleaning_materials")
dt_combs = expand.grid(names1=names1, names2=names2, stringsAsFactors = F)
dt_combs
# names1 names2
# 1 clothing income
# 2 clothing consumption
# 3 clothing alcohol
# 4 clothing cleaning_materials
# process to get correlations and p values for each variable combination and each id
dt %>%
select(hh_ids) %>% distinct() %>% # select unique ids
group_by(hh_ids) %>% # for each id
do(data.frame(.,dt_combs)) %>% # get all combinations of interest
rowwise() %>% # for each id and combination
do(data.frame(., # keep id and combination
Get_cor_and_pval(dt,.$names1,.$names2,.$hh_ids), # get correlation and p value
stringsAsFactors=F)) %>% # factor variables as character
ungroup() # forget groupings
# # A tibble: 8 x 5
# hh_ids names1 names2 c p
# * <chr> <fctr> <chr> <dbl> <dbl>
# 1 KELDK01 clothing income 0.1713298 7.455198e-01
# 2 KELDK01 clothing consumption 0.3220463 5.336309e-01
# 3 KELDK01 clothing alcohol NA NA
# 4 KELDK01 clothing cleaning_materials 0.9999636 1.989337e-09
# 5 KELDK02 clothing income -0.6526867 3.473133e-01
# 6 KELDK02 clothing consumption 0.5376850 4.623150e-01
# 7 KELDK02 clothing alcohol NA NA
# 8 KELDK02 clothing cleaning_materials -0.1416633 8.583367e-01
La dernière trame de données montre quelle est la corrélation entre toutes les paires d'intérêt, pour chaque id. La variable d'alcool est toujours 0 et crée ces valeurs NA. Vous pouvez utiliser vos propres filtres pour conserver les lignes que vous aimez.
Notez que pour 300 identifiants et 6 variables cela fonctionnera bien. Pour un plus grand nombre d'identifiants (millions) et pour de nombreuses variables, il peut devenir plus lent et il pourrait y avoir un moyen plus efficace de le faire.
2.Valeur de corrélation
Si vous êtes intéressé seulement dans les valeurs de corrélation et non les valeurs p, le code est beaucoup plus courte:
library(dplyr)
# example dataset
dt = read.table(text="hh_ids date income consumption alcohol cleaning_materials clothing
KELDK01 2012-11-1 62.70588 40.52941 0 0.000000 0.000000
KELDK01 2012-12-1 17.64706 42.43530 0 1.058824 7.058824
KELDK01 2013-01-1 91.76471 48.23529 0 0.000000 0.000000
KELDK01 2013-02-1 91.76470 107.52940 0 0.000000 0.000000
KELDK01 2013-03-1 116.47060 114.47060 0 0.000000 0.000000
KELDK01 2013-04-1 124.41180 118.29410 0 2.705882 17.647060
KELDK02 2013-05-1 137.23530 105.00000 0 1.411765 1.882353
KELDK02 2013-06-1 131.52940 109.54120 0 4.352942 2.941176
KELDK02 2013-07-1 121.52940 113.47060 0 2.352941 25.882350
KELDK02 2013-08-1 123.32940 86.50588 0 2.588235 2.941176",
sep="", header=T, stringsAsFactors = F)
dt %>%
group_by(hh_ids) %>% # for each id
do(data.frame(cor(.[,3:7]))[5,]) %>% # keep columns 3 to 7 (numeric columns), get the correlation matrix and keep row 5 (row for income and all other)
ungroup()
# # A tibble: 2 x 6
# hh_ids income consumption alcohol cleaning_materials clothing
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 KELDK01 0.1713298 0.3220463 NA 0.9999636 1
# 2 KELDK02 -0.6526867 0.5376850 NA -0.1416633 1
Et une alternative en utilisant le paquet corrr
ainsi
library(dplyr)
library(corrr)
# example dataset
dt = read.table(text="hh_ids date income consumption alcohol cleaning_materials clothing
KELDK01 2012-11-1 62.70588 40.52941 0 0.000000 0.000000
KELDK01 2012-12-1 17.64706 42.43530 0 1.058824 7.058824
KELDK01 2013-01-1 91.76471 48.23529 0 0.000000 0.000000
KELDK01 2013-02-1 91.76470 107.52940 0 0.000000 0.000000
KELDK01 2013-03-1 116.47060 114.47060 0 0.000000 0.000000
KELDK01 2013-04-1 124.41180 118.29410 0 2.705882 17.647060
KELDK02 2013-05-1 137.23530 105.00000 0 1.411765 1.882353
KELDK02 2013-06-1 131.52940 109.54120 0 4.352942 2.941176
KELDK02 2013-07-1 121.52940 113.47060 0 2.352941 25.882350
KELDK02 2013-08-1 123.32940 86.50588 0 2.588235 2.941176",
sep="", header=T, stringsAsFactors = F)
dt %>%
group_by(hh_ids) %>% # for each id
do(correlate(.[,3:7]) %>% focus(clothing)) %>% # keep columns 3 to 7, get correlations but return ones that have to do with variable "clothing"
ungroup()
# # A tibble: 8 x 3
# hh_ids rowname clothing
# <chr> <chr> <dbl>
# 1 KELDK01 income 0.1713298
# 2 KELDK01 consumption 0.3220463
# 3 KELDK01 alcohol NA
# 4 KELDK01 cleaning_materials 0.9999636
# 5 KELDK02 income -0.6526867
# 6 KELDK02 consumption 0.5376850
# 7 KELDK02 alcohol NA
# 8 KELDK02 cleaning_materials -0.1416633
Ça vous dérange de nous montrer ce que vous avez essayé jusqu'ici? – shayaa