推荐人制度中的分裂

推荐人制度中的分裂

问题描述:

我有一个Userid,ItemID,Ratings的Spark数据框.我正在建立一个推荐系统.

I have a Spark Dataframe of Userid, ItemID, Ratings. I am building a recommender system.

数据如下:

originalDF.show(5)
+----+----+------+
|user|item|rating|
+----+----+------+
| 353|   0|     1|
| 353|   1|     1|
| 353|   2|     1|
| 354|   3|     1|
| 354|   4|     1|
+----+----+------+

它有56K唯一用户和8.5K唯一项目.

It has 56K unique users and 8.5K unique items.

因此,如果您看到每个UserID都有每个项目的记录(RoW)和相应的等级.因此,每个用户ID有多条记录.

So if you see each UserID has a record (RoW) for each Item and corresponding rating. So multiple records per user id.

现在,我通过随机分配0.6,0.2,0.2%的方式将其分为训练,验证和测试.因此,基本上有60%的随机记录用于培训,20%用于验证,其余20%用于测试,如下所示:

Now I split this into train, val and test by taking a random split of 0.6,0.2,0.2 %. So basically 60% of random records go for training, 20% for validation and remaining 20% for test as below:

random_split=originalDF.randomSplit(split_perc,seed=20)

return random_split[0],random_split[1],random_split[2]

这给我留下了以下数据集计数

This leaves me with following dataset counts

train,validation,test=train_test_split(split_sdf,[0.6,0.2,0.2])
​
print "Training size is {}".format(train.count())
print "Validation size is {}".format(validation.count())
print "Test size is {}".format(test.count())
'/'
print "Original Dataset Size is {}".format(split_sdf.count())
Training size is 179950
Validation size is 59828
Test size is 60223
Original Dataset Size is 300001

现在,我在训练数据上训练Spark pyspark.ml.ALS算法.

Now I train the Spark pyspark.ml.ALS algorithm on training data.

als = ALS(rank=120, maxIter=15, regParam=0.01, implicitPrefs=True)
model = als.fit(train)

当我从模型对象中检查userFactors和itemFactors时,我得到了:

When I check the userFactors and itemFactors from the model object I get this:

itemF=model.itemFactors
itemF.toPandas().shape
Out[111]:
(7686, 2)
In [113]:

userF=model.userFactors
userF.toPandas().shape
Out[113]:
(47176, 2)

这意味着它只给我一个No的预测因子矩阵.训练数据中唯一用户和项目的数量.

Which means it is only giving me a predicted factor matrix of the no. of unique users and items in training data.

现在如何获得每个用户的所有商品的预测?

Now how do I get prediction for all the items for each user?.

如果我这样做

prediction=model.transform(originalDF)

其中OriginalDF是整个数据集,它分为火车,价格和测试,这些数据可以为每个用户提供所有项目的预测吗?

where OriginalDF is the whole dataset which was broken into train,val and test would that give prediction for all items for each user?.

我的问题是,如果我的数据集有56K个用户X 8.5K个项目,那么我想查找相同的56K X8.5K的预测矩阵,而不仅仅是47K X7.6K的训练数据.

My question is if my dataset had 56K users X 8.5K items then I want to find prediction matrix for the same 56K X8.5K and not just the 47K X7.6K training data.

我在这里做错了什么?我了解该数据仅适用于47k X7.6K训练数据,而不适用于原始的56k X8.5K额定数据.那我将数据拆分为火车,val错误吗?

What am I doing wrong here?. I understand the data works only on 47k X7.6K training data instead of the original 56k X8.5K ratings data. So am I splitting the data into train,val wrong?

我知道,对于推荐系统,应该为每个用户随机屏蔽某些项目的某些等级,并使用其余部分进行训练并在这些屏蔽值上对其进行测试.我在这里做了同样的事情,因为用户的每条记录都是对不同项目的评价.当我们随机分割时,我们实际上是在掩盖某个用户的某些评分,而不是将其用于培训.

I know for recommender system one should randomly mask some ratings for some items for each user and use the remaining for training and test it on those masked values. I did the same here since each record for a user is a rating for a different item. When we split randomly we are essentially masking some of the rating for a user and not using them for training.

请告知.

在带有用户X项目矩阵(56K用户X 8.5项目)的典型推荐系统中

In a typical Recommender System with user X item matrix (56K users X 8.5 items)

基本上,我们为每个用户屏蔽(设为0)一些随机项目评分.然后,将整个矩阵传递给推荐算法,并将其分解为两个因子矩阵的乘积.

We basically mask (make it to 0) some random item ratings for each user. Then this whole matrix is passed to the recommender algo and it breaks it into a product of two factors matrix.

但是,在Spark中,我们不使用Userx项目矩阵.基本上,我们将每个项目列的评分作为每个用户的单独一行,而不是将8.5K的项目列作为评分.

However in Spark, we don't use a Userx item matrix. We basically put each item column ratings as individual row for each user instead of having 8.5K item columns.

因此,如果您看到原始用户项目矩阵中的掩盖(使某些项目等级为0),则与在spark数据帧中不为每个用户使用一些随机行相同.对?

So if you see masking (making some item ratings to 0) in original user-item matrix is then same as not using some random rows for each user in spark data frame. Right?

在这里,我找到了一种将数据拆分为火车和val的方法(这也是我使用的方法)

Here is I found one way to split (which is what I used too) the data into train and val

training_RDD, validation_RDD, test_RDD = small_ratings_data.randomSplit([6, 2, 2], seed=0L)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

我在这里也使用了类似的randomSplit东西.因此,我不确定这里出了什么问题.

I used the similar randomSplit thing here too. So I am not sure what is wrong here.

我可以理解,由于训练数据并不具有所有用户和项目,因此项目因子矩阵也将仅具有那么多的用户和项目因子.那么我该如何克服呢?最后,我基本上需要一个针对所有用户和物品的预测矩阵.

I can understand that since the training data does not have all users and items, the item factors matrix would also only have that many user and item factors. So how do I overcome that?. In the end I basically needs a matrix of predictions for all users and items.

所有ID:

  • 用户
  • 产品

必须出现在训练集中.使用随机拆分不是一种可以用来确保这一点的方法(它不等同于数据屏蔽).

for which you want predictions have to be present in the training set. Using random split is not a method which can be used to ensure that (it is not equivalent to data masking).