设为首页 收藏本站
查看: 537|回复: 0

[经验分享] MMSEG的python实现

[复制链接]

尚未签到

发表于 2017-4-27 12:24:57 | 显示全部楼层 |阅读模式
  原文:http://yongsun.me/2013/06/simple-implementation-of-mmseg-with-python/
  Since I heard of MMSEG Chinese word segmentation algorithm (http://technology.chtsai.org/mmseg/) many years ago, I finally implemented it with Python as a programing practice in my team, dictionary file and character frequencies from mmseg4j project.

#!/usr/bin/python
# -*- encoding: UTF-8 -*-
import codecs
import sys
from math import log
from collections import defaultdict
class Trie (object):
class TrieNode:
def __init__ (self):
self.val = 0
self.trans = {}
def __init__ (self):
self.root = Trie.TrieNode()
def __walk (self, trienode, ch):
if ch in trienode.trans:
trienode = trienode.trans[ch]
return trienode, trienode.val
else:
return None, 0
def add (self, word, value=1):
curr_node = self.root
for ch in word:
try:
curr_node = curr_node.trans[ch]
except:
curr_node.trans[ch] = Trie.TrieNode()
curr_node = curr_node.trans[ch]
curr_node.val = value
def match_all (self, word):
ret = []
curr_node = self.root
for ch in word:
curr_node, val = self.__walk (curr_node, ch)
if not curr_node:
break
if val:
ret.append (val)
return ret
class Dict (Trie):
def __init__(self, fname):
super (Dict, self).__init__()
self.load(fname)
def load(self, fname):
file = codecs.open(fname, 'r', 'utf-8')
for line in file:
word = line.strip()
self.add(word, word)
file.close()
class CharFreqs (defaultdict):
def __init__ (self, fname):
super (CharFreqs, self).__init__(lambda:1)
self.load(fname)
def load (self, fname):
file = codecs.open(fname, 'r', 'utf-8')
for line in file:
ch, freq = line.strip().split()
self[ch] = freq
file.close()
class MMSeg:
class Chunk:
def __init__ (self, words, chrs):
self.words  = words
self.lens   = map(lambda x:len(x), words)
self.length = sum(self.lens)
self.mean   = float(self.length) / len(words)
self.var    = sum(map(lambda x: (x-self.mean)**2, self.lens)) / len(self.words)
self.degree = sum([log(float(chrs[x])) for x in words if len(x)==1])
def __str__ (self):
return ' '.join(self.words).encode('UTF-8') + \
"(%f %f %f %f)" % (self.length, self.mean, self.var, self.degree)
def __lt__ (self, other):
return (self.length,  self.mean,  -self.var,  self.degree) <  \
(other.length, other.mean, -other.var, other.degree)
def __init__(self, dic, chrs):
self.dic  = dic
self.chrs = chrs
def __get_chunks (self, s, depth=3):
ret = []
def __get_chunks_it (s, num, segs):
if (num == 0 or not s) and segs:
ret.append(MMSeg.Chunk(segs, self.chrs))
else:
m = self.dic.match_all(s)
if not m:
__get_chunks_it (s[1:], num-1, segs+[s[0]])
for w in m:
__get_chunks_it (s[len(w):], num-1, segs+[w])
__get_chunks_it (s, depth, [])
return ret
def segment (self, s):
while s:
chunks = self.__get_chunks(s)
best = max(chunks)
yield best.words[0]
s = s[len(best.words[0]):]
if __name__ == "__main__":
dic = Dict("dict.utf8")
chrs = CharFreqs("chars.utf8")
mmseg = MMSeg(dic, chrs)
enc = sys.getfilesystemencoding()
while True:
try:
s = raw_input ("Test String: ")
except:
break
print "Test Result: ",
for w in mmseg.segment(s.decode(enc)):
print w.encode(enc),
print '\n'
# -*- indent-tabs-mode: nil -*- vim:et:ts=4

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-370005-1-1.html 上篇帖子: Python-读取文件:API介绍 下篇帖子: Python 简易抓取界面
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表