Comment construire une matrice clairsemée dans PySpark?

Je suis nouveau à Spark. Je voudrais faire une matrice clairsemée d'une matrice d'identifiant d'article id-utilisateur spécifiquement pour un moteur de recommandation. Je sais comment je ferais ça en python. Comment fait-on cela dans PySpark? Voici comment je l'aurais fait dans la matrice. La table ressemble à ceci maintenant.Comment construire une matrice clairsemée dans PySpark?

Session ID| Item ID | Rating 
    1   2  1 
    1   3  5

import numpy as np 

    data=df[['session_id','item_id','rating']].values 
    data 

    rows, row_pos = np.unique(data[:, 0], return_inverse=True) 
    cols, col_pos = np.unique(data[:, 1], return_inverse=True) 

    pivot_table = np.zeros((len(rows), len(cols)), dtype=data.dtype) 
    pivot_table[row_pos, col_pos] = data[:, 2]

Source

2016-06-30 ashish trehan

Jetez un oeil à sparsevector: https://spark.apache.org/docs/1.1.0/api/python/pyspark.mllib .linalg.SparseVector-class.html – Gopala

Comme ça:

>>> from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry 
>>> table = sqlContext.createDataFrame(
...  sc.parallelize([[1, 2, 1], [1, 3, 5]]) 
...) 
>>> mat = CoordinateMatrix(table.rdd.map(lambda row: MatrixEntry(*row)))

Source

2016-06-30 23:27:31

Comment construire une matrice clairsemée dans PySpark?

Répondre

Questions connexes