MapReduce with MongoDB and Python[ZT]
MapReduce with MongoDB and Python从 Artificial Intelligence in Motion 作者:MarcelPinheiro Caraciolo (由于Artificial Intelligence in Motion发布的图在墙外,所以将图换到cnblogs)
Hi all,
Inthis post, I'll present a demonstration of a map-reduce example withMongoDB and server side JavaScript.Based on the fact that I've beenworkingwith this technology recently, I thought it would be useful topresent here a simple example ofhow it works and how to integrate withPython.
But What is MongoDb ?
For you, who doesn't know what is and the basics of how to use MongoDB, it is important to explain a little bit about the No-SQL movement.Currently, there are several databases that break with the requirementspresent in the traditional relational database systems. I present asfollows the main keypoints shown at several No-SQL databases:
[*]SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.)
[*]Doesn't guarantee atomic operations.
[*]Distributed and horizontally scalable.
[*]It doesn't have to predefine schemas. (Non-Schema)
[*]Non-tabular data storing (eg; key-value, object, graphs, etc).
Although it is not so obvious, No-SQL is an abbreviationto Not OnlySQL. The effort and development of this new approach have been doing alot of noise since 2009. You can find more information about it here and here. It is important to notice that the non-relational databases does notrepresent a complete replacement for relational databases. It isnecessary to know the pros and cons of each approach and decide the mostappropriate for your needs in the scenario that you're facing.
MongoDB is one of the most popular No-SQLtoday and what this article will focus on. It is a schemaless, documentoriented, high performance, scalable databasethat uses the key-valuesconcepts to store documents as JSON structured documents. It alsoincludes some relational database features such as indexing models anddynamic queries. It is used today in production in over than 40websites, including web services such as SourceForge, GitHub, Eletronic Arts and The New York Times..
One of the best functionalities that I like in MongoDb is the Map-Reduce. In the next section I will explainhow it works illustrated with a simple example using MongoDb and Python.
If you want to install MongoDb or get more information, you can download it here and read a nice tutorial here.
Map- Reduce
MapReduceis a programming model for processing and generating large data sets.It is a framework introduced by Google for support parallel computationslarge data sets spread over clusters of computers.Now MapReduce isconsidered a popular model in distributed computing, inspired by thefunctions map and reduce commonly used in functional programming.Itcan be considered'Data-Oriented' which process data in two primarysteps: Map and Reduce.On top of that, the query is now executed onsimultaneous data sources. The process of mapping the request of theinput reader to the data set is called 'Map', and the process ofaggregation of the intermediate results from the mapping function in aconsolidated result is called 'Reduce'.The paper about the MapReducewith more details it can be read here.
Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous is Hadoopand is implemented in Java as an open-source project.In MongoDB thereis also a similar implementation in spirit like Hadoop with all inputcoming from a collection and output going to a collection. For apractical definition, Map-Reduce in MongoDB is useful for batchmanipulation of data and aggregation operations.In real casescenarios, in a situation whereyou would have used GROUP BY in SQL, map/reduce is the equivalent tool in MongoDB.
Now thtat we have introduced Map-Reduce, let's see how access the MongoDB by Python.
PyMongo
PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. It's easy to install and to use. See here how to installand use it.
Map-Reduce in Action
Nowlet's see Map-Reduce in action. For demonstrate the map-reduce I'vedecided to used of the classical problems solved using it: WordFrequency count across a series of documents. It's a simple problem andis suited to being solved by a map-reduce query.
I'vedecided to use two samples for this task. The first one is a list ofsimple sentences to illustrate how the map reduce works.The second oneis the 2009 Obama's Speech at his election for president. It will beused to show a real example illustrated by the code.
Let'sconsider the diagram below in order to help demonstrate how themap-reduce could be distributed. It shows four sentences that are split in words and grouped by the function map and after reduced independently (aggregation)by the function reduce.This is interesting as it means our query can be distributed intoseparate nodes (computers), resulting in faster processing in word countfrequency runtime. It's also important to notice the example belowshows a balanced tree, but it could be unbalanced or even show someredundancy.
Map-Reduce Distribution
Some notes you need to know before developing your map and reduce functions:
[*] The MapReduce engine may invoke reduce functions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function:
for all k,vals : reduce( k, ) == reduce(k,vals)
[*] Currently, the return value from a reduce function cannot be an array (it's typically an object or a number)
[*]If you need to perform an operation only once, use a finalize function.
Let's go now to the code. For this task, I'll use the Pymongoframework, which has support for Map/Reduce. As I said earlier, theinput text will be the Obama's speech, which has by the way manyrepeated words. Take a look at the tags cloud (cloud of words which eachword fontsize is evaluated based on its frequency) of Obama's Speech.
Obama's Speech in 2009
For writing our map and reduce functions, MongoDBallows clients to send JavaScript map and reduce implementations thatwill get evaluated and run on the server. Here is our map function.
wordMap.js
As you can see the 'this'variable refers to the context from which the function is called. Thatis, MongoDB will call the map function on each document in thecollection we are querying, and it will be pointing to document where itwill have the access the key of a document such as 'text', by calling this.text.The map function doesn't return a list, instead it calls an emitfunction which it expects to be defined. This parameters of thisfunction (key, value) will be grouped with othersintermediate resultsfrom another map evaluations that have the same key (key, ) and passed to the function reduce that we will define now.
wordReduce.js
The reducefunction must reduce a list of a chosen type to a single value of thatsame type; it must be transitive so it doesn't matter how the mappeditems are grouped.
Now let's code our word count example using the Pymongo client and passing the map/reduce functions to the server.
mapReduce.py
Let's see the result now:
And it works! :D
WithMap-Reduce function the word frequency count is extremely efficient andeven performs better in a distributed environment. With this briefexperiment wecan see the potential of map-reduce model for distributedcomputing, specially on large data sets.
All code used in this article can be download here.
My next posts will be aboutperformance evaluation on machine learning techniques.Wait for news!
Marcel Caraciolo
References
[*]http://nosql.mypopescu.com/post/394779847/mongodb-tutorial-mapreduce
[*]http://fredzvt.wordpress.com/2010/04/24/no-sql-mongodb-from-introduction-to-high-level-usage-in-csharp-with-norm/
页:
[1]