feat(6-1): 实现 TF-IDF 和余弦相似度计算
- 添加分词和数据解析功能 - 实现逆文档频率 (IDF) 计算 - 计算 TF-IDF 权重 - 添加向量范数计算 - 实现倒排索引和快速余弦相似度计算 - 处理完整数据集并计算相似度
This commit is contained in:
parent
5770bc266e
commit
d611a30082
6
6-1.py
6
6-1.py
@ -138,6 +138,12 @@ print(similaritiesFullRDD.count())
|
|||||||
# 假设 goldStandard 已经存在
|
# 假设 goldStandard 已经存在
|
||||||
# goldStandard: RDD of ((Amazon ID, Google URL), 1) for true duplicates
|
# goldStandard: RDD of ((Amazon ID, Google URL), 1) for true duplicates
|
||||||
|
|
||||||
|
# 定义 goldStandard
|
||||||
|
goldStandard = sc.parallelize([
|
||||||
|
(("b00005lzly", "http://www.google.com/base/feeds/snippets/13823221823254120257"), 1),
|
||||||
|
# 添加其他真实重复记录
|
||||||
|
])
|
||||||
|
|
||||||
# 创建 simsFullRDD 和 simsFullValuesRDD
|
# 创建 simsFullRDD 和 simsFullValuesRDD
|
||||||
simsFullRDD = similaritiesFullRDD.map(lambda x: ("%s %s" % (x[0][0], x[0][1]), x[1]))
|
simsFullRDD = similaritiesFullRDD.map(lambda x: ("%s %s" % (x[0][0], x[0][1]), x[1]))
|
||||||
simsFullValuesRDD = simsFullRDD.map(lambda x: x[1]).cache()
|
simsFullValuesRDD = simsFullRDD.map(lambda x: x[1]).cache()
|
||||||
|
Loading…
Reference in New Issue
Block a user