MongoDB的自定义序列化（Customizing serialization）

kidys · 发表于 2015-7-6 11:35:26

　　我最近一直在研究MongoDB，有些小心得。恰好发现原来博客园支持Live writer啊
　　兴奋异常，终于多年以后重回这里。以前一直用liver writer写 myspace和 wordpress
　　但是前者完了，后者FQ很烦。
　　====================================================
　　首先推荐一个MongoDB的查询分析器

　　MongoVUE
　　这个工具是非常好用，虽然超过试用期，但是仍然可以使用
　　只是只能开三个查询窗口而已。
　　
　　
　　以前一直使用db4o， protobuf.net  ，所以对mongoDB还是很适应的。
　　因为相似性太大。尤其是对象持久化的方式，细节略微不同而已。
　　
　　=============================================
　　1.需求：
　　我的一个新写的算法需要读取一个完整的collection，而这需要几十秒钟。
　　而一开始都是使用特性标注的自动序列化和反序列化，无论用任何方式调整，InsertBatch和
　　FindAll() 的性能都得不到提高。
　　
　　2.思考：
　　我一开始以为读取速度和保存奇慢无比，是因为mongoDB自己的问题。今天仔细想了想。问题关键在于写入硬盘的数据太多。
　　mongoDB的数据持久化是以BSON格式的。而这种格式的冗余还是相当大的。尤其是默认序列化和反序列化。
　　

　　"_id" : ObjectId("4f4e2a02c992571e54c30465"),
"value" : "xxxxx",
"chars" : [{
"words" : [{
"index" : 0,
"length" : 2,
"wordTypes" : 0
}]
}, {
"words" : [{
"index" : 0,
"length" : 2,
"wordTypes" : 0
}, {
"index" : 1,
"length" : 2,
"wordTypes" : 0
}]
},
　　
　　用mongoVUE查看最终数据格式，发觉主要存储空间消耗在意义不大的属性name上。计算一下就可以知道，名称几乎是值的5-10倍空间大小。
　　相比 protobuf，采用数字作为属性的名称，就十分节省空间了。
　　但是mongodb可以检索字段，而protobuf不可以，所以mongo没有采用protobuf的方式。
　　
　　我有一个collection有50000个document，平均一个document  4000byte，这真是令人吃惊的低效持久化啊。怪不得读取都需要几十秒钟。整个数据存储消耗了200m空间。
　　
　　由于看过mongoDB的官方文档
　　http://www.mongodb.org/display/DOCS/CSharp+Language+Center
　　所以对Customizing serialization有点印象。
　　
　　官方文档描述十分简略，只说了应该将类继承IBsonSerializable 接口，然后实现四个方法。但是没有示例，完全不知道如何具体操作。
　　public class MyClass : IBsonSerializable { // implement Deserialize method // implement Serialize method }
　　
　　好吧有google大神在。
　　stackoverflow是个好网站
　　http://stackoverflow.com/questions/7105274/storing-composite-nested-object-graph
　　
　　3.解决：
　　
　　第一部分：将对象变换成数字，节省名称和空间消耗
　　

      public UInt32 IntValue
      {
         get
         {
            var v1 = ((UInt32)WordTypes)  24);
      }
　　
　　以上没什么好讲的，无非左移右移，当然可能会出现数据类型溢出可能，如果有这种情况，换成Int64，或者适当修改。说明一下，这个三级对象我不准备在mongoDB中检索字段，而是只用于存储，至于检索是变换成另外字符串keyword的方式来检索。所以既然不需要检索，属性也就根本不需要有name，所以多个属性可以位或成一个数值，存放到数组中。对象都省了。第二部分

public partial class Sentence : IBsonSerializable
{
      public static int idSum;
      public bool GetDocumentId(out object id, out Type idNominalType, out IIdGenerator idGenerator)
      {
         id = this.Id = idSum++;
         idNominalType = typeof(int);
         idGenerator = null;
         return true;
      }
      public void Serialize(MongoDB.Bson.IO.BsonWriter bsonWriter, Type nominalType, IBsonSerializationOptions options)
      {
         bsonWriter.WriteStartDocument();
         bsonWriter.WriteInt32("_id", this.Id);  //10多个个字节，如果用objectId
         bsonWriter.WriteString("value", this.Value);//名称如果都改用几个字母可以节省十几个个字节
         bsonWriter.WriteString("words", this.WordStr);
         bsonWriter.WriteBoolean("isConf", this.IsConflict);
         bsonWriter.WriteStartArray("c");
         foreach (var item in Chars)
         {
            BsonSerializer.Serialize(bsonWriter, item.Words.Select(v=>v.IntValue).ToList());
         }

         bsonWriter.WriteEndArray();
         bsonWriter.WriteEndDocument();
      }
      public void SetDocumentId(object id)
      {
         throw new NotImplementedException();
      }
      public object Deserialize(MongoDB.Bson.IO.BsonReader bsonReader, Type nominalType, IBsonSerializationOptions options)
      {
         //bsonReader.ReadStartDocument();
         //this.Id = bsonReader.ReadInt32();
         //var value=bsonReader.ReadString("v");
         //var wordStr=bsonReader.ReadString("w");
         //bsonReader.ReadStartArray();
         //var list = new List();
         //while (bsonReader.ReadBsonType() != BsonType.EndOfDocument)
         //{
         // var element = BsonSerializer.Deserialize(bsonReader);
         // list.Add(element);
         //}
         //bsonReader.ReadEndArray();
         //var isConflict=bsonReader.ReadBoolean("i");
         //bsonReader.ReadEndDocument();

         if (nominalType != typeof(Sentence))
            throw new ArgumentException("不能序列化，因为类型定义不一致");
         var doc = BsonDocument.ReadFrom(bsonReader);
         this.Id = (Int32)doc["_id"];
         this.Value = (string)doc["value"];
         this.WordStr = (string)doc["words"];
         this.IsConflict = (bool)doc["isConf"];
         var list = (BsonArray)doc["c"];
         this.Chars = new List();
         for (int i = 0; i < list.Count; i++)
         {
            var ch = new CharObj { Index = i, Sen = this, Words=new List() };
            this.Chars.Add(ch);
            var words = (BsonArray)list;
            foreach (Int32 item in words)
            {
                  var wordObj = new WordObj((UInt32)item);
                  wordObj.Sen = this;
                  ch.Words.Add(wordObj);
            }
         }

         return this;
         //return new Sentence { Id=1,  IsConflict= true, Value="1", WordStr= "1"};
      }
}

　　　　主要有几个注意地方：
　　一个是Id的生成。我有点不明白为什么id赋值函数要弄的那么复杂的参数，但是这样可以绕过ObjectID的 guid式的id，使用int可以节省一些空间。
　　当然，如果整体对象比较大，还是用objectID吧。完全没必要用int，int也有很多问题，需要保存最大值在另外的collection，没法像ObjectId一样跨多个Collection。所以mongoDB设计Id 用ObjectId而不是int，是非常有道理的。如果对象整体比较大，还是没必要节省这十几个字节的消耗。
　　
　　二是Serialize 方法的实现中，必须要以bsonWriter.WriteStartDocument()开始 bsonWriter.WriteEndDocument() 结束，切记，否则会报出一个没法write的错误。
　　
　　三是如何对二层的集合进行写入，我原来是这样写的

         foreach (var item in Chars)
         {
            bsonWriter.WriteStartArray("words");
            foreach (var w in item.Words)
                  bsonWriter.WriteInt32((Int32)w.IntValue);
            bsonWriter.WriteEndArray();
         }
　　
　　但是mongoDB不支持这种嵌套式的持久化。
　　
　　必须改成

         foreach (var item in Chars)
         {
            BsonSerializer.Serialize(bsonWriter, item.Words.Select(v=>v.IntValue).ToList());
         }       那个注意虽然 BsonSerializer.Serialize的参数是一个IEnumerable 但是必须要ToList，否则不会保存成功数据第四，反序列化的时候不能直接用start end方式，必然会报错，只能先一次读取，再取字典值 4.对比
　　

　　
　　新的bson格式的存储比较紧凑了。
"_id" : ObjectId("4f4e2a02c992571e54c30465"),
"value" : "xxxxx",
"chars" : [{
"words" : [{
"index" : 0,
"length" : 2,
"wordTypes" : 0
}]
}, {
"words" : [{
"index" : 0,
"length" : 2,
"wordTypes" : 0
}, {
"index" : 1,
"length" : 2,
"wordTypes" : 0
}]
},　　
　　对比原来的，差距非常明显。

　　
　　用mongoVUE 查看平均 document大小，平均只有364byte了。原来可是吓死人的4000
　　而合计Size也从200m下降到17m
　　
　　而耗时   用我笔记本，耗时大概9秒钟。原来40秒以上。而用台式机硬盘快，可以快几倍，几秒钟内搞定。
　　
　　
　　5.其他
　　其实为什么要实现自定义的持久化方法，一当然是性能十分的让人忧虑。第二个则是对象关联指针的重新绑定问题。
　　原来从数据库读取的数据，需要手工恢复相互关联的指针，现在可以在反序列化函数中直接完成这个操作。
　　也就是说，一旦查询出来的对象，都已经和内存对像一摸一样了。
　　好处是大大降低了程序的复杂度。
　　
　　使用mongoDB数据对象，犹如内存对象一样进行指针操作。然后自动永久化数据。
　　呃。我发觉爱上mongoDB了。虽然它还有不少缺点。

账号		自动登录	找回密码
密码			立即注册

wirelessnetview好用的无线分析工具

Red Hat RHCE 8 (EX294) Cert Guide

Shell从入门到精通（阿良）

亿图图示专家(EDraw Max) V7.9 中文破解版

zabbix3.4.1安装部署+微信推送信息+大屏显

Red Hat OpenShift I: Containers & Kubern

2025 年，C++ 还能“硬核”多久？

[经验分享] MongoDB的自定义序列化（Customizing serialization）

浏览过的版块

扫码加入运维网微信交流群