共计 724 个字符,预计需要花费 2 分钟才能阅读完成。
在上一篇文章中介绍了log likehood相似度函数,这里在贴上代码,这份代码是参考了mahout代码实现,想看mahout在这个源码实现的可以去看Apache官方源码,也是比较好理解的。
话不多说直接上代码,也是比较简单,熵是非归一化的,区别于常规的熵计算
def entropy(*elements):
sum = 0
result = 0.0
for element in elements:
result += xLogX(element)
sum += element
return xLogX(sum) - result
def xLogX(x)->float:
return 0.0 if x==0 else x * math.log(x)
def checkargs(*args):
for x in args:
if x<0: raise ValueError
def logLikelihoodRatio(k11, k12,k21,k22)->float:
checkargs(k11,k12,k21,k22)
#note that we have counts here, not probabilities, and that the entropy is not normalized.
rowEntropy = entropy(k11 + k12, k21 + k22);
columnEntropy = entropy(k11 + k21, k12 + k22);
matrixEntropy = entropy(k11, k12, k21, k22);
if rowEntropy + columnEntropy < matrixEntropy:
#round off error
return 0.0
return 2.0 * (rowEntropy + columnEntropy - matrixEntropy)
正文完
请博主喝杯咖啡吧!