feat(6-1): 实现 TF-IDF 和余弦相似度计算
- 添加分词和数据解析功能 - 实现逆文档频率 (IDF) 计算 - 计算 TF-IDF 权重 - 添加向量范数计算 - 实现倒排索引和快速余弦相似度计算 - 处理完整数据集并计算相似度
This commit is contained in:
parent
5770bc266e
commit
d611a30082
6
6-1.py
6
6-1.py
@ -138,6 +138,12 @@ print(similaritiesFullRDD.count())
|
||||
# 假设 goldStandard 已经存在
|
||||
# goldStandard: RDD of ((Amazon ID, Google URL), 1) for true duplicates
|
||||
|
||||
# 定义 goldStandard
|
||||
goldStandard = sc.parallelize([
|
||||
(("b00005lzly", "http://www.google.com/base/feeds/snippets/13823221823254120257"), 1),
|
||||
# 添加其他真实重复记录
|
||||
])
|
||||
|
||||
# 创建 simsFullRDD 和 simsFullValuesRDD
|
||||
simsFullRDD = similaritiesFullRDD.map(lambda x: ("%s %s" % (x[0][0], x[0][1]), x[1]))
|
||||
simsFullValuesRDD = simsFullRDD.map(lambda x: x[1]).cache()
|
||||
|
Loading…
Reference in New Issue
Block a user