设为首页 收藏本站
查看: 813|回复: 0

[经验分享] Yet another MongoDB Map Reduce tutorial [一篇英文的mongodbMopreduce 文章 推荐一下]

[复制链接]

尚未签到

发表于 2018-10-27 11:57:26 | 显示全部楼层 |阅读模式
  http://blog.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/
  Background

  As the>

  •   A problem solving approach is used, so we’ll take a problem, solve it in SQL first and then discuss Map Reduce.
  •   Lots of diagrams, so you’ll hopefully better understand how Map Reduce works.
  The Problem
  So without further ado, let us get started. We’ll use the GeoBytes’ free GeoWorldMap database. It is a database of countries, their states/regions and major cities. You can find this database onthis pageunder Geobytes’ Free Services section. The zip archive contains CSV files andinstructions on importing this data to MySQL are available here.
  The task is to find the 2 closest cities in each country, except in United States. (I excluded USA because over 75% of the cities in “cities” table are from USA, and by excluding it the results arrive much faster! Plus, it gives an additional flavor to the task.)

DSC0000.png This image on top displays field and corresponding datatypes of “cities” table. Note the fields CountryID, Latitude and Longitude.  Assumptions
  For sake of simplicity, we’ll represent earth as a 2D plane. The distance between any two points P1 (x1,y1) and P2 (x2,y2) on a 2D plane is computed as Square-Root of { (x1-x2)2+ (y1-y2)2}
SQL Solution
  If the distance between each pair of cities in a country were known then we could simply apply a GROUP BY statement where we divide the data by Country and find those two cities where the distance is minimum. Since data is not available in this form, let’s try to manipulate it to get the desired structure.
/* QUERY1 - VIEW: city_dist */  
create view city_dist as
  
select c1.CountryID,
  
c1.CityId, c1.City,
  
c2.CityId as CityId2, c2.City as City2,
  
sqrt(pow(c1.Latitude-c2.Latitude,2) + pow(c1.Longitude-c2.Longitude,2)) as Dist
  
from cities c1 inner join cities c2
  
where c1.CountryID = c2.CountryID /* Country should be same */
  
and c1.CityId < c2.CityId  /* Calculate distance between 2 cities only once */
  
and c1.CountryID  254 /* Don't include US cities */;
  Now that we have distance between each pair of cities, we can now group this data by country and then proceed to select those 2 cities that have the least value for “Dist” field but still greater than zero. This can be accomplished easily as shown below:
/* QUERY 2 */  
select city_dist.*
  
from (
  
select CountryID, min(Dist) as MinDist
  
from city_dist
  
where Dist > 0 /* Avoid cities which share Latitude & Longitude */
  
group by CountryID
  
) a inner join city_dist on a.CountryID = city_dist.CountryID and a.MinDist = city_dist.Dist;
  That completes our SQL solution to the given problem. (You can delete the View “city_dist” later)
  It is important to note the steps we followed. In the first step we performed all the computations (by calculating the distance between 2 cities of each country). In the next step we grouped (or divided) our results by country and selected those 2 cities where the value of distance was least. These steps can be represented graphically as shown below.
DSC0001.png

Map Reduce Solution
  We can easily import our “cities” table from MySQL to MongoDB using MongoVUE. Instruction on importing areavailable here. Once this is done, a sample document in MongoDB looks like this:
DSC0002.png

  Map Reduce is a 3 step approach to solving problems.
DSC0003.png

  Step 1 – Map
  Map step is used to group or divide data into sets based on a desired value (called Key). This is actually similar to Step 2 of SQL solution above. The Map step is accomplished by writing a JavaScript function, and the signature of this function is given below.
function /*void*/ MapCode() {  

  
}
  In other words the Map function takes no arguments and returns no data! That doesn’t seem much useful, does it? So lets explore it in greater detail.  Although Map function doesn’t take any arguments, it gets invoked on each document of the collection as a method. Since it is invoked as a method, it has access to “this” reference. So with “this” you can access any data within the “current” document. Something else that is available is the “emit” function and it takes two 2 arguments, first, the key on which you want to group the data. Second argument is the data itself that you want to group.
  When we write the Map function, we need to be careful about 3 things.

  •   Firstly, how do we want to divide or group the data? In other words, what is our key? Or what should be passed as the first parameter to “emit” function?
  •   Secondly, what part of the data will we we need or what part of data is extraneous? This helps un in determining the second parameter passed to “emit” function.
  •   Thirdly, in what form or structure do we need our data? This helps us refine the second parameter of “emit” function.
  Let’s find the answers to these questions.

  •   It should be quite evident that we will group our data based on “CountryID”. We used the same field in SQL too. So we’ll pass “CountryID” as the first parameter to “emit” function.
function MapCode() {  
emit(this.CountryID, ...);
  
}
  We certainly don’t care about RegionID, TimeZone, DmaID, County and Code for calculating closest cities. We can easily ignore these. Keys that seems helpful are CityId, City, Latitude and Longitude.
function MapCode() {  
emit(this.CountryID,
  
{
  
"city": this.City,
  
"lat":  this.Latitude,
  
"lon":  this.Longitude
  
});
  
}
  With this we have answered our second question as well, i.e. what data is extraneous and what is necessary. Now before we get to the third question above, lets understand a bit more about Reduce. After the Map step completes we obtains a bunch of key-value pairs. In our case, we’ll get a bunch of key-value pairs (where key is CountryID and value is a Json object) as shown in the image below:
DSC0004.png

  Reduce operation aggregates different values for each given key using a user defined function. In other words, Reduce operation will take up each key (or CountryID) and then pick up all the values (in our case Json objects) created from Map step and then one-by-one process them using a custom defined logic. Lets look at the signature of Reduce function.
function /*object*/ ReduceCode(key, arr_values) {  

  
}

  Reduce takes 2 parameters – 1) Key 2) An array of values (number of values outputted from Map step). Output of Reduce is an object. It is important to note that Reduce can be called multiple times on a single key! Yes, you read it correctly. It is not that difficult to think actually – consider a case where your data is huge and it lies on 2 different servers. It would be>  Here is a picture explaining Reduce step.
DSC0005.png

  The picture above shows Reduce being called twice. This is just can example. To be frank, we don’t know how MongoDB executes Reduce. We don’t know which key it is going to be reduced first and which key last. We also don’t know how many times it is going to call reduce for a key. This optimization is better left with MongoDB itself as it finds the most suitable parallel execution for every MapReduce command.
  What we do know is that if Reduce is executed more than once then the value returned will be passed in a subsequent reduce as part of input.
  For our given problem, we want Reduce to output all the cities of a given country (so that we can then try to find the closest two). So the expected format of final reduced value (rF) is:
{  
"data" : [
  
{ city E },
  
{ city B },
  
{ . . .  }
  
]
  
}
  But the input values in Reduce array (param 2) should have exactly the same format as the output, as the output may be intermediate and may participate in further Reduce. So lets mould the Map function to produce values in  the above desired format.
function MapCode() {  
emit(this.CountryID,
  
{ "data":
  
[
  
{
  
"city": this.City,
  
"lat":  this.Latitude,
  
"lon":  this.Longitude
  
}
  
]
  
});
  
}
  Our reduce function simply assimilates all the cities.
function ReduceCode(key, values) {  

  
var reduced = {"data":[]};
  
for (var i in values) {
  
var inter = values;
  
for (var j in inter.data) {
  
reduced.data.push(inter.data[j]);
  
}
  
}
  

  
return reduced;
  
}
  This brings us to Finalize step. Finalize is used to do any required transformation on the final output of Reduce. The function signature of Finalize is given below:
function /*object*/ FinalizeCode(key, value) {  

  
}
  The function takes a a key value pair, and outputs a value. After the Reduce is complete, MongoDB runs Finalize on each key’s final reduced value. The output of Finalize for all keys is put in a collection, and it is this collection which is the result of Map Reduce. You can give it a desired name, and if left unspecified, MongoDB selects a collection name for you.
DSC0006.png

  In our case, we’ll use Finalize to find the closest 2 cities out of all the given cities in a country. Here is the Finalize function.
function Finalize(key, reduced) {  

  
if (reduced.data.length == 1) {
  
return { "message" : "This Country contains only 1 City" };
  
}
  

  
var min_dist = 999999999999;
  
var city1 = { "name": "" };
  
var city2 = { "name": "" };
  

  
var c1;
  
var c2;
  
var d;
  
for (var i in reduced.data) {
  
for (var j in reduced.data) {
  
if (i>=j) continue;
  
c1 = reduced.data;
  
c2 = reduced.data[j];
  
d = Math.sqrt((c1.lat-c2.lat)*(c1.lat-c2.lat)+(c1.lon-c2.lon)*(c1.lon-c2.lon));
  
if (d < min_dist && d > 0) {
  
min_dist = d;
  
city1 = c1;
  
city2 = c2;
  
}
  
}
  
}
  

  
return {"city1": city1.name, "city2": city2.name, "dist": min_dist};
  
}
  This completes our MapReduce solutions as well. We just need to filter out US cities when we invoke this – that is easy enough to do with a simple condition:
{  
CountryID: { $ne: 254 }  /* 254 is US CountryID */
  
}
  
  
  
  Points to note

  •   While this is clearly not intended to be a benchmark, but still, the SQL solution took about 100 sec on my laptop (the view creation took only 1 sec, rest is spent in grouping and joins. Using a temp table/indexes would speed this up).
  •   Map Reduce took 6 seconds to run
  •   There are other SQL and MapReduce solutions to this problem. For example, you could open cursors in SQL and iterate through all the records in nested for loops. Similarly, you could do an 2 back to back MapReduce operations without resorting to use of Finalize step. I’ll try to explore these in a future post.
  If you want to learn about how to execute these steps in MongoVUE, then refer to thisstep-by-step tutorial.



运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-627101-1-1.html 上篇帖子: mongodb分片集群(sharding with replica set)配置 4servers-musicy 下篇帖子: linux 下mongoDB备份与恢复
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表