如何创建一个TF-IDF文本分类使用火花?
我有一个CSV格式如下文件:
I have a CSV file with the following format :
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
该product_idX是一个整数,product_titleX是一个字符串,例如:
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
我想从我的文件中创建的TF-IDF这样我就可以使用它的一个朴素贝叶斯分类器在MLlib。
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
我使用的火花斯卡拉到目前为止并使用我所的官方网页上找到的教程和伯克利AmpCamp 3 和4.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
所以我读文件:
val file = sc.textFile("offers.csv")
然后我在元组映射它 RDD [数组[字符串]]
val tuples = file.map(line => line.split(",")).cache
和我改造成元组对后 RDD [(智力,字符串)]
val pairs = tuples.(line => (line(0),line(1)))
但我坚持在这里,我不知道如何创建矢量从中把它变成TFIDF。
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
感谢
要做到这一点我自己(使用pyspark),我首先创建两个数据结构出文集的开始。首先是一个关键,价值结构
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
二是喜欢
token_id, [document_ids]
我会分别调用这些语料库和inv_index。
I'll call those corpus and inv_index respectively.
要获取TF,我们需要计算每个文档中的每个令牌出现的次数。因此,
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
东风简直是每学期的倒排索引的长度。从我们可以计算IDF。
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
现在,我们只需要做的term_id联接:
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
这是不是一个特别高性能的解决方案,但。调用收集,使以色列国防军进入驱动程序,以便它可用于加入似乎是错误的做法。
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
当然,这需要第一标记化和创建从在词汇一些token_id每个uniq的令牌的映射。
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
如果任何人都可以在这个提高的,我很感兴趣。
If anyone can improve on this, I'm very interested.