用python计算lda语言模型的困惑度并作图

hao0089 · 发表于 2015-4-23 09:23:31

　　转载请注明：电子科技大学EClab——落叶花开http://www.iyunv.com/nlp-yekai/p/3816532.html
　　困惑度一般在自然语言处理中用来衡量训练出的语言模型的好坏。在用LDA做主题和词聚类时，原作者D.Blei就是采用了困惑度来确定主题数量。文章中的公式为：
　　perplexity=exp^{ - (∑log(p(w))) / (N) }
　　其中，P(W)是指的测试集中出现的每一个词的概率，具体到LDA的模型中就是P(w)=∑z p(z|d)*p(w|z)【z,d分别指训练过的主题和测试集的各篇文档】。分母的N是测试集中出现的所有词，或者说是测试集的总长度，不排重。
　　因而python程序代码块需要包括几个方面：
　　1.对训练的LDA模型，将Topic-word分布文档转换成字典，方便查询概率，即计算perplexity的分子
　　2.统计测试集长度，即计算perplexity的分母
　　3.计算困惑度
　　4.对于不同的Topic数量的模型，计算的困惑度，画折线图。
　　python代码如下：
　　

1 # -*- coding: UTF-8-*-
2 import numpy
3 import math
4 import string
5 import matplotlib.pyplot as plt
6 import re
7
8 def dictionary_found(wordlist):             #对模型训练出来的词转换成一个词为KEY,概率为值的字典。
9    word_dictionary1={}
10    for i in xrange(len(wordlist)):
11       if i%2==0:
12          if word_dictionary1.has_key(wordlist)==True:
13                word_probability=word_dictionary1.get(wordlist)
14                word_probability=float(word_probability)+float(wordlist[i+1])
15                word_dictionary1.update({wordlist:word_probability})
16          else:
17                word_dictionary1.update({wordlist:wordlist[i+1]})
18       else:
19          pass
20    return word_dictionary1
21
22 def look_into_dic(dictionary,testset):       #对于测试集的每一个词，在字典中查找其概率。
23    '''Calculates the TF-list for perplexity'''
24    frequency=[]
25    letter_list=[]
26    a=0.0
27    for letter in testset.split():
28       if letter not in letter_list:
29          letter_list.append(letter)
30          letter_frequency=(dictionary.get(letter))
31          frequency.append(letter_frequency)
32       else:
33          pass
34    for each in frequency:
35       if each!=None:
36          a+=float(each)
37       else:
38          pass
39    return a
40
41
42 def f_testset_word_count(testset):                                  #测试集的词数统计
43    '''reture the sum of words in testset which is the denominator of the formula of Perplexity'''
44    testset_clean=testset.split()
45    return (len(testset_clean)-testset.count("\n"))
46
47 def f_perplexity(word_frequency,word_count):          #计算困惑度
48    '''Search the probability of each word in dictionary
49    Calculates the perplexity of the LDA model for every parameter T'''
50    duishu=-math.log(word_frequency)
51    kuohaoli=duishu/word_count
52    perplexity=math.exp(kuohaoli)
53    return perplexity
54
55 def graph_draw(topic,perplexity):          #做主题数与困惑度的折线图
56    x=topic
57    y=perplexity
58    plt.plot(x,y,color="red",linewidth=2)
59    plt.xlabel("Number of Topic")
60    plt.ylabel("Perplexity")
61    plt.show()
62
63
64 topic=[]
65 perplexity_list=[]
66 f1=open('/home/alber/lda/GibbsLDA/jd/test.txt','r')    #测试集目录
67 testset=f1.read()
68 testset_word_count=f_testset_word_count(testset)       #call the function to count the sum-words in testset
69 for i in xrange(14):
70    dictionary={}
71    topic.append(5*(3i+1))                                                    #模型文件名的迭代公式
72    trace="/home/alber/lda/GibbsLDA/jd/stats/model-final-"+str(5*(i+1))+".txt" #模型目录
73    f=open(trace,'r')
74    text=f.readlines()
75    word_list=[]
76    for line in text:
77       if "Topic" not in line:
78          line_clean=line.split()
79          word_list.extend(line_clean)
80       else:
81          pass
82    word_dictionary=dictionary_found(word_list)
83    frequency=look_into_dic(word_dictionary,testset)
84    perplexity=f_perplexity(frequency,testset_word_count)
85    perplexity_list.append(perplexity)
86 graph_draw(topic,perplexity_list)
　　下面是画出的折线图，在拐点附近再调整参数（当然与测试集有关，有图为证～～），寻找最优的主题数。实验证明，只要Topic选取数量在其附近，主题抽取一般比较理想。

　　
　　
　　本人也是新手开始作研究～程序或者其他地方有错误的，希望大家指正。
　　

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 用python计算lda语言模型的困惑度并作图

浏览过的版块

扫码加入运维网微信交流群