利用pymongo操作mongoDB数据库

渡人自渡 · 发表于 2015-11-11 09:46:43

利用pymongo操作mongoDB数据库
　　

#连接数据库
def get_db():
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client.examples #'examples' here is the database name.it will be created if it does not exist.
#如果 examples不存在，那么就会新建它
return db
#插入操作
def add_city(db):
db.cities.insert({'name':'Chicago'}) #inser 插入一个字典
#获取数据
def get_city(db):
return db.cities.find_one()#从cities中返回任意一个数据
if __name__ == '__main__':
db = get_db()
add_city(db)
print get_city(db)

　　

上面只是操作mongoDB数据库的最简单的一个例子。

我们基于mongoDB的应用(APP)，pymongo模块，与mongoDB数据库，三者之间是什么关系呢？

我觉得可以表示为： APP <-------------->pymongo<-----BSON-------->mongoDB

 其中：BSON 为Binary Json

有了这个概念后，你就会理解为什么mongoDB是字典家族。

所以在mongoDB的操作中一定要建立一切皆为字典的基本认识。

步入正题，先说一下Query操作

<h2 style="margin-top: 5px; margin-bottom: 10px; line-height: 22.5px; font-family: 微软雅黑, Verdana, sans-serif, 宋体;">query = {'manuafacturer':'Porsche'}</h2>#字典结构表示要寻找的参数。{'manufacturer':'Porsche'}表示#manufacturer ='Porsche'#用SQL语句可以理解为SELECT * FROM autos WHERE manufacture='Porsche'
projection = {'_id':0,'name':1}#显示为1，不显示为0
db.myautos.find(query,projection)#查找制造商为保时捷的数据，但是不显示'_id',显示'name'
db.myautos.find(query,projection).count()#返回满足条件的数据的数量从json文件导入数据库：　　

在terminal下：
$mongoimport -db dbname -c collectionname --file inputfile.json
　　

比较操作符：

$gt $lt $lte $gte $ne 分别对应为：

大于(greater than) 小于(less than) 小于等于(less than equal) 大于等于(greater than equal) 不等于(not equal)

query = {'population':{'$gt':10000}} #人口大于10000
query = {'population':{'$gt':10000, '$lte':20000}} #人口大于10000小于等于20000
query = {'name':{'$gt':'X', '$lte':'Z'}}#name 头字母介于X Z之间
from datetime import datetime
query = {'foundationDate':{'$gt':datetime(1840,1,1), '$lte':datetime(2049,10,1)}}
#介于1840,1,1日和2049,10,1 的时间
存在操作符$exist

query = {'governmentType':{'$exist':1}} #1表示存在
query = {'governmentType':{'$exist':0}} #0表示不存在
正则表达式操作符$regex　　

query = {'motto':{'$regex':'[Ff]riendship'}}
　　

$in 与 $all

query = {'modelYears':{'$in':[1965,1967,1977,1987]}}#只要存在一个就可以
query = {'modelYears':{'$all':[1965,1967,1977,1987]}}#四个必须全部同时存在

 如果数据结构为：

{'dimension':{'width':25,
 'height'：30，
 'length':89}
........
}

 Query 字典可以为：

query = {'dimension.width':25}
city = db.cities.find(query)
for ele in city:
city['dimension'] = 66
#保存修改
db.cities.save(city)

update操作

db.cities.update({'name':'michael',
 'country':'china'},#条件
 {'$set':{'iso':1978}})#满足条件的条目中,有'iso'属性的,其值改为1978
db.cities.update({'name':'michael',
 'country':'china'},#条件
 {'$unset':{'iso':1978}}) #满足条件的条目中,有'iso'属性的,删除'iso'属性
#多个修改
db.cities.update({'name':'michael',
 'country':'china'},#条件
 {'$set':{'iso':1978}}, multi = True)

aggregate操作

我们考虑如下的数据结构：

{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
 "user_mentions" : [ ],
 "urls" : [ ],
 "hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
 "friends_count" : 145,
 "profile_sidebar_fill_color" : "E5507E",
 "location" : "Ireland :)",
 "verified" : false,
 "follow_request_sent" : null,
 "favourites_count" : 1,
 "profile_sidebar_border_color" : "CC3366",
 "profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
 "geo_enabled" : false,
 "created_at" : "Sun May 03 19:51:04 +0000 2009",
 "description" : "",
 "time_zone" : null,
 "url" : null,
 "screen_name" : "Catherinemull",
 "notifications" : null,
 "profile_background_color" : "FF6699",
 "listed_count" : 77,
 "lang" : "en",
 "profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
 "statuses_count" : 2475,
 "following" : null,
 "profile_text_color" : "362720",
 "protected" : false,
 "show_all_inline_media" : false,
 "profile_background_tile" : true,
 "name" : "Catherine Mullane",
 "contributors_enabled" : false,
 "profile_link_color" : "B40B43",
 "followers_count" : 169,
 "id" : 37486277,
 "profile_use_background_image" : true,
 "utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}

$group 操作

group = {'$group':{'_id':'$user.screen_name','count':{'$sum':1}}}
#group操作必须有个键是'_id'表示操作的对象，'$sum'表示求和操作
#上面这一行代码的意思是，统计各个'user.screen_name'的个数

$sort 操作，顾名思义，排序操作，其对某个键值进行升序或是降序操作

#接上段代码
sort = {'$sort:{'count':-1}} #按照'count'对应的值得降序排序

将group,sort整合到aggregate函数中，就能得到我们想要的结果

pipeline = [group,sort]
result = db.tweets.aggregate(pipeline)
#result 是一个字典。result['result'] 包含处理好的数据的列表
#整个操作就是，统计各个user.screen_name的数量，并倒序排列

上面仅仅是最简单的例子

下面我们继续讨论其他操作：

$match ,顾名思义，我更愿意叫他“过滤器”

好吧让我举个例子，我想找出数据库中谁的人气最旺！你给我个建议，怎么找到这个逗比？

￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥好好想想￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥

￥￥￥￥￥￥￥￥￥￥￥￥￥是国民老公，王思聪￥￥￥￥￥￥￥￥￥￥￥￥￥还是，臭脚，杨幂￥￥￥￥￥￥￥￥￥

￥￥￥￥￥￥￥￥￥是某个微博卖肉的小明星？￥￥￥￥￥￥￥还是传媒达人，谷大白话？￥￥￥￥￥￥￥￥￥￥￥￥

好吧，我只想到了个比值，用比值表示，是比值，不是逼值

比值 = 粉丝数/好友数

当然，关注数也还行，可我就是这么任性，像姜文大叔一样，怎么滴？

奥，你说，你要这样我不看了啦！

好！你不看就不看吧，小弟看了姜文大叔的电影，学了一个本事儿。

我悄悄告诉你：姜文的意思是：我拍电影不是给你看的，我是给自己看的。小弟不才，没钱拍电影，》》写博客不是给别人看的，是给自己看的。对！我在自言自语。另一个我在看博客。

扯回来$match ，不，扯回来找比值，看看我怎么找比值吧，不，是最大比值

match = {'$match':{'user.friends_count':{'$gt':0},'user.followers_count':{'$gt':0}}}
#确保好友数和粉丝数都是正数
project = {'$project':{'ratio':{'$divide':['$user.followers_count','$user.friends_count']},
 'screen_name':'$user.screen_name'}}
#创建'ratio'和'screen_name'两个键值，其中，'ratio'利用了'$divide'除法，对两个变量进行除法操作，当然，
#这个列表有先后顺序
#下面进行排序
sort = {'$sort':{'ratio':-1}} #降序排列
#选取第一位
limit = {'$limit':1}
#$limit 限制选择结果的个数
pipeline = [match, project, sort, limit]
result = db.tweets.aggregate(pipeline)

 $unwind 操作, 举例如下:

假设有这样的字典结构：
{
'id':'1',
'author':'jone',
'tags':['good','fun','good']
}进行db.article.aggregate操作
db.article.aggregate([{'$prject':{'author':1,'tags':1}},{'$unwind':'tags'}])结果为：
{'result':[{'_id':'XXXX','author':'jone','tags':'good'},
{'_id':'XXXX','author':'jone','tags':'fun'},
{'_id':'XXXX','author':'jone','tags':'good'}],
'ok':1}所以$unwind操作的操作对象是数组，如果不是数组会报错。他的作用就是将数组中的每个元素代替数组本身，最后产生多个item，新产生的item的数目自然就是
原来数组的长度。

$group操作
我们考虑最开始的twitter数据，如果我要找到哪一个微博文本被转发的平均次数最多，该如何写我们的aggregate呢？
首先要找到推文的hashtag，这里补充一下，上文中的twitter数据中的

"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]结构中，‘entiyies.hashtags’是个列表。所以我们可以进行$unwind操作
而’retweet_count‘标明了被转发的次数，进行平均计算就可以了。
unwind = {'$unwind':'$entities.hashtags'} group = {'$group':{'_id':'entities.hashtags.text','retweet_avg':{'$avg':'$retweet_avg'}}}注意：$group操作必须有'_id'属性，其次'entities.hashtags.text'还可以是自己起的名字，如’txt‘。’$avg‘是进行求平均值操作。类似的还有：
'$sum' '$first' '$last' '$max' '$min' 等
接着进行排序操作，这样所有操作就是如下：
unwind = {'$unwind':'$entities.hashtags'}group = {'$group':{'_id':'$entities.hashtags.text','retweet_avg':{'$avg':'$retweet_avg'}}}sort = {'$sort':{'retweet_avg':-1}}limit = {'$limit':1}pipeLine = [unwind,group,sort,limit]db.article.aggregate(pipeLine)毛主席教导的好，有矛就有盾，既有$unwind拆分数组，就收神器组成数组。你猜他会是什么呢？
$push, $addToSet
顾名思义，push和addToSet都是将元素组合到数组中，但是addToSet更加高级，Set是集合，所以addToSet形成的数组中没有重复元素。
push形成的数组中是可以有重复元素的。
这就是二者的不同之处。

版权声明：本文为博主原创文章，未经博主允许不得转载。

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 利用pymongo操作mongoDB数据库

浏览过的版块

扫码加入运维网微信交流群