2017-10-17 28 views
1

Je travaille avec des données de séquençage d'ARN monocellulaires qui sont récemment des échantillons de 10k-100k (cell s) x fonctionnalités 20k (gene s) de valeurs éparses, et comprend également un grand nombre de métadonnées, par exemple. le tissu ("Cerveau" vs "Foie") d'origine. Les métadonnées sont ~ 10-100 colonnes et je stocke comme pandas.DataFrame. En ce moment, je fais xarray.DataSets en dictant les métadonnées et en les ajoutant comme coordonnées. Il semble maladroit et sujettes aux erreurs depuis que je copie l'extrait entre les cahiers. Y a-t-il un moyen plus facile?Un moyen facile de créer un DataSet xarray à partir de métadonnées + valeurs?

cell_metadata_dict = cell_metadata.to_dict(orient='list') 
coords = {k: ('cell', v) for k, v in cell_metadata_dict.items()} 
coords.update(dict(gene=counts.columns, cell=counts.index)) 

ds = xr.Dataset(
    {'counts': (['cell', 'gene'], counts), 
    }, 
    coords=coords) 

EDIT:

Pour afficher des données d'exemple, voici le cell_metadata.head().to_csv():

cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex 
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 

et counts.iloc[:5, :20].to_csv()

cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik 
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37 
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65 

Re: pandas.DataFrame.to_xarray() - cela est incroyablement lent et il semble bizarre pour moi d'encoder tant à la fois numérique et categ données orales sous forme de MultiIndex à 100 niveaux. Cela, et chaque fois que j'ai essayé d'utiliser MultiIndex il me revient toujours de dire "oh et c'est pourquoi je n'utilise pas MultiIndex" et de revenir à avoir des métadonnées séparées et compte les données.

+0

Pouvez-vous fournir un exemple de votre DataFrame ('df.head()') et une description détaillée de votre Dataset ou DataArray cible. Avez-vous essayé d'utiliser la méthode to_xarray() de pandas? – jhamman

+0

Pour ajouter au commentaire de Joe, jetez un coup d'œil à la section [working with pandas] (http://xarray.pydata.org/en/stable/pandas.html) des documents de xarray pour voir si cela aide. Si vous pouvez définir le 'pandas.MultiIndex' approprié pour vos données, la conversion en xarray est * habituellement * assez facile. – shoyer

Répondre

0

Xarray utilise des étiquettes d'index/colonnes de pandas pour les métadonnées par défaut. Vous pouvez convertir en un seul appel de fonction lorsque toutes vos variables partagent les mêmes dimensions, mais si différentes variables ont des dimensions différentes, vous devez les convertir séparément des pandas et les assembler du côté des xarray. Par exemple:

import pandas as pd 
import io 
import xarray 

# read your data 
cell_metadata = pd.read_csv(io.StringIO(u"""\ 
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex 
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F""")) 
counts = pd.read_csv(io.StringIO(u"""\ 
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik 
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37 
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65""")) 

# build the output 
xarray_counts = xarray.DataArray(counts.set_index('cell'), dims=['cell', 'gene']) 
xarray_counts.coords.update(cell_metadata.set_index('cell').to_xarray()) 
print(xarray_counts) 

Il en résulte une belle, bien rangé xarray.DataArray pour compte:

<xarray.DataArray (cell: 5, gene: 20)> 
array([[308, 289, 81, 0, 4, 88, 52, 0, 0, 104, 65, 0, 1, 0, 
      9, 8, 12, 283, 12, 37], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [375, 325, 70, 0, 2, 72, 36, 13, 0, 60, 105, 0, 13, 0, 
      0, 29, 15, 264, 0, 65]]) 
Coordinates: 
    * cell       (cell) object 'A1-MAA100140-3_57_F-1-1' ... 
    * gene       (gene) object '0610005C13Rik' ... 
    Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717 
    Number of input reads   (cell) int64 502312 360285 431800 446705 918 
    EXP_ID      (cell) object '170928_A00111_0068_AH3YKKDMXX' ... 
    TAXON       (cell) object 'mus' 'mus' 'mus' 'mus' 'mus' 
    WELL_MAPPING     (cell) object 'MAA100140' 'MAA100140' ... 
    Lysis Plate Batch    (cell) float64 nan nan nan nan nan 
    dNTP.batch     (cell) float64 nan nan nan nan nan 
    oligodT.order.no    (cell) float64 nan nan nan nan nan 
    plate.type     (cell) object 'Biorad 96well' ... 
    preparation.site    (cell) object 'Stanford' 'Stanford' ... 
    date.prepared     (cell) float64 nan nan nan nan nan 
    date.sorted     (cell) int64 170720 170720 170720 170720 ... 
    tissue      (cell) object 'Liver' 'Liver' 'Liver' ... 
    subtissue      (cell) object 'Hepatocytes' 'Hepatocytes' ... 
    mouse.id      (cell) object '3_57_F' '3_57_F' '3_57_F' ... 
    FACS.selection    (cell) float64 nan nan nan nan nan 
    nozzle.size     (cell) float64 nan nan nan nan nan 
    FACS.instument    (cell) float64 nan nan nan nan nan 
    Experiment ID     (cell) float64 nan nan nan nan nan 
    Columns sorted    (cell) float64 nan nan nan nan nan 
    Double check     (cell) float64 nan nan nan nan nan 
    Plate       (cell) float64 nan nan nan nan nan 
    Location      (cell) float64 nan nan nan nan nan 
    Comments      (cell) float64 nan nan nan nan nan 
    mouse.age      (cell) int64 3 3 3 3 3 
    mouse.number     (cell) int64 57 57 57 57 57 
    mouse.sex      (cell) object 'F' 'F' 'F' 'F' 'F' 

Si vous voulez un Dataset à la place, mettre les DataArray objets dans le constructeur Dataset, par exemple,

# shouldn't really need to use .data_vars here, that might be an xarray bug 
>>> xarray.Dataset({'counts': xarray.DataArray(counts.set_index('cell'), 
...           dims=['cell', 'gene'])}, 
...    coords=cell_metadata.set_index('cell').to_xarray().data_vars) <xarray.Dataset> 

Dimensions:      (cell: 5, gene: 20) 
Coordinates: 
    * cell       (cell) object 'A1-MAA100140-3_57_F-1-1' ... 
    * gene       (gene) object '0610005C13Rik' ... 
    Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717 
    Number of input reads   (cell) int64 502312 360285 431800 446705 918 
    EXP_ID      (cell) object '170928_A00111_0068_AH3YKKDMXX' ... 
    TAXON       (cell) object 'mus' 'mus' 'mus' 'mus' 'mus' 
    WELL_MAPPING     (cell) object 'MAA100140' 'MAA100140' ... 
    Lysis Plate Batch    (cell) float64 nan nan nan nan nan 
    dNTP.batch     (cell) float64 nan nan nan nan nan 
    oligodT.order.no    (cell) float64 nan nan nan nan nan 
    plate.type     (cell) object 'Biorad 96well' ... 
    preparation.site    (cell) object 'Stanford' 'Stanford' ... 
    date.prepared     (cell) float64 nan nan nan nan nan 
    date.sorted     (cell) int64 170720 170720 170720 170720 ... 
    tissue      (cell) object 'Liver' 'Liver' 'Liver' ... 
    subtissue      (cell) object 'Hepatocytes' 'Hepatocytes' ... 
    mouse.id      (cell) object '3_57_F' '3_57_F' '3_57_F' ... 
    FACS.selection    (cell) float64 nan nan nan nan nan 
    nozzle.size     (cell) float64 nan nan nan nan nan 
    FACS.instument    (cell) float64 nan nan nan nan nan 
    Experiment ID     (cell) float64 nan nan nan nan nan 
    Columns sorted    (cell) float64 nan nan nan nan nan 
    Double check     (cell) float64 nan nan nan nan nan 
    Plate       (cell) float64 nan nan nan nan nan 
    Location      (cell) float64 nan nan nan nan nan 
    Comments      (cell) float64 nan nan nan nan nan 
    mouse.age      (cell) int64 3 3 3 3 3 
    mouse.number     (cell) int64 57 57 57 57 57 
    mouse.sex      (cell) object 'F' 'F' 'F' 'F' 'F' 
Data variables: 
    counts      (cell, gene) int64 308 289 81 0 4 88 52 0 ...