Python低内耗读取文件的二分查找单词

cundeng · 发表于 2017-5-6 10:05:04

　　问题描述：有一个有序的单词文件如下图

　　要求写一个查找功能（防止文件过大时占用内存，建议文件不要一次全部读入内存）
　　使用二分查找法，输出匹配的结果
　　例如   输入>>   dis
　　输出>>   disadvantage         n. 不利,不利条件,损害,损失
                        discussion         n. 讨论; v. 讨论,商议
　　环境工具：CentOS6.3 Python2.6
　　解决过程：1. 首先要考虑不能一次将文件全部读入内存，而且单词的内容是一行行的，那么就考虑如何在不读入全部文件到内存的情况下，直接读文件的第几行
　　python 的文件操作里面有
　　seek函数动态定义文件指针的位置（从哪里开始读）
　　tell函数报告文件指针的位置（受seek、readline、read、readlines影响）
　　在一个单词文件比较规范的情况下，用seek、tell配合readline就可以直接读取文件的特定一行到内存
　　代码片段如下
　　startIndexList = [] #单词在文件的起始位置指针列表
   startIndexList.append(0)    #第一个单词的是0
   while self.fp.readline(): #一行行的读取文件，并记录单词的起始位置指针
        startIndexList.append(self.fp.tell()) #tell()报告当前位置指针
   del startIndexList[-1] #去掉文末指针
　　2. 关于二分查找法，就是在有序序列中，不断的取中间元素做比较，进行查找，能较大提高效率，涉及到Python的字符串匹配（in关键字）和字符串比较（cmp函数）
　　代码片段如下
　　def findword(self, wordSourceList):
   if(len(wordSourceList) == 1): #源单词列表只有一个单词
        self.fp.seek(wordSourceList[0],0) #找到相应位置的单词的起始位置
        wordSource = self.fp.readline().strip('\n') #读取相应元素的单词
        if self.wordTarget in wordSource: #模糊匹配成功，加入搜索结果列表
        self.resultList.append(wordSource)
   elif(len(wordSourceList) == 2): #源单词列表有两个单词
        self.fp.seek(wordSourceList[0],0) #找到相应位置的单词的起始位置
        wordSource = self.fp.readline().strip('\n') #读取相应元素的单词
        if self.wordTarget in wordSource: #模糊匹配成功，加入搜索结果列表
        self.resultList.append(wordSource)
        self.fp.seek(wordSourceList[indexList[1]],0)
        wordSource = self.fp.readline().strip('\n')
        if self.wordTarget in wordSource:
        self.resultList.append(wordSource)
   else: #源单词列表三个及以上的单词
        self.fp.seek(wordSourceList[len(wordSourceList)/2],0) #找到中间位置的单词的起始位置
        wordSource = self.fp.readline().strip('\n') #读取中间位置的单词
        if self.wordTarget in wordSource: #模糊匹配成功，加入搜索结果列表
        self.resultList.append(wordSource)
        if cmp(self.wordTarget, wordSource) == -1: #中间位置的单词和目标单词比较，目标单词可能在小的单词序列中
        wordSourceListSmall = wordSourceList[0:len(wordSourceList)/2]
        self.findword(wordSourceListSmall)
        else: #中间位置的单词和目标单词比较，目标单词可能在大的单词序列中
        wordSourceListBig = wordSourceList[(len(wordSourceList)/2)+1:len(wordSourceList)]
        self.findword(wordSourceListBig)
　　3. 如入dis查找结果如下图，源码见findword.py

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] Python低内耗读取文件的二分查找单词

浏览过的版块

扫码加入运维网微信交流群