solr实现结果分组、字段折叠

aa0660 发表于 2016-12-16 09:22:22

solr实现结果分组、字段折叠

引言

字段合并和结果分组是同样的Solr特征的不同的方式思考。
字段合并是将一组结果相同的field合并，，例如：大多数搜索引擎如谷歌合并后只有一个或两项显示，随着一个链接点击看看从网站更多的结果。合并也可以用来抑制重复的文件。

结果分组是使用一个共同field值分组document，返回顶部的document组，顶部的document是基于分组的document. 一个例子是一个搜索在百思买的常用术语如dvd，显示前3个结果的每个类别（“电视和视频”，“电影”，“计算机”，等）

快速启动

如果你还没有准备好，请先下载solr相关文件，然后参考【solr入门.doc】完成搭建。

现在开启结果分组并且请求一个查询，我们第一次尝试在制造商名称分组（manu_exact field）
你现在只能在单值的域组！
...&q=solr+memory&group=true&group.field=manu_exact
http://192.168.2.89:8080/solr/collection1/select?q=*%3A*&wt=xml&indent=true&group=true&group.field=KR_UID

Group分组返回的结果是：

[...]
"grouped":{
    "manu_exact":{
      "matches":6,
      "groups":[{
          "groupValue":"Apache Software Foundation",
          "doclist":{"numFound":1,"start":0,"docs":[
              {
                "id":"SOLR1000",
                "name":"Solr, the Enterprise Search Server"}]
          }},
        {
          "groupValue":"Corsair Microsystems Inc.",
          "doclist":{"numFound":2,"start":0,"docs":[
              {
                "id":"VS1GB400C3",
                "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"}]
          }},
        {
          "groupValue":"A-DATA Technology Inc.",
          "doclist":{"numFound":1,"start":0,"docs":[
              {
                "id":"VDBDB1A16",
                "name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"}]
          }},
        {
          "groupValue":"Canon Inc.",
          "doclist":{"numFound":1,"start":0,"docs":[
              {
                "id":"0579B002",
                "name":"Canon PIXMA MP500 All-In-One Photo Printer"}]
          }},
        {
          "groupValue":"ASUS Computer Inc.",
          "doclist":{"numFound":1,"start":0,"docs":[
              {
                "id":"EN7800GTX/2DHTV/256M",
                "name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}]
          }}]}}

response 表明有6条匹配我们的结果，为每一个独特的group.field值，一个得分最高的文档doclist返回。该doclist也返回该组中的总的匹配数为“numfound”。该group本身也按最高的文档的得分在每一组显示。

我们可以找到最高分值的的document，同时匹配任意查询与group.query命令(像facet.query)。例如：我们可以利用这一结果查询前3名的document在不同的价格范围内：
...&q=memory&group=true&group.query=price:&group.query=price:&group.limit=3

[...]
"grouped":{
    "price:":{
      "matches":5,
      "doclist":{"numFound":1,"start":0,"docs":[
          {
            "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail",
            "price":74.99}]
      }},
    "price:":{
      "matches":5,
      "doclist":{"numFound":3,"start":0,"docs":[
          {
            "name":"Canon PIXMA MP500 All-In-One Photo Printer",
            "price":179.99},
          {
            "name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail",
            "price":185.0},
          {
            "name":"ASUS Extreme N7800GTX/2DHTV (256 MB)",
            "price":479.95}]
      }}
[...]

从上面的反应,通过查询“memory”可以返回5条 document。当然，1的价格低于100美元，3有100美元以上的价格。总计不达5因为一个document被不存在的价格，因此不匹配group.query。

我们可以使用的一组命令展现”main result”,通过添加参数group.main=true，虽然这一结果格式不拥有尽可能多的信息，它可以为现有的Solr客户端更容易解析。
...&q=solr+memory&group=true&group.field=manu_exact&group.main=true

"response":{"numFound":6,"start":0,"docs":[
      {
        "id":"SOLR1000",
        "name":"Solr, the Enterprise Search Server",
        "manu":"Apache Software Foundation"},
      {
        "id":"VS1GB400C3",
        "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail",
        "manu":"Corsair Microsystems Inc."},
      {
        "id":"VDBDB1A16",
        "name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM",
        "manu":"A-DATA Technology Inc."},
      {
        "id":"0579B002",
        "name":"Canon PIXMA MP500 All-In-One Photo Printer",
        "manu":"Canon Inc."},
      {
        "id":"EN7800GTX/2DHTV/256M",
        "name":"ASUS Extreme N7800GTX/2DHTV (256 MB)",
        "manu":"ASUS Computer Inc."}]
}

请求参数

参数名

参数值

描述

group

true/false

如果设置true,打开结果分组

group.field

Group based on the unique values of a field. The field must currently be single-valued and must be either indexed, or be another field type that has a value source and works in a function query - such as ExternalFileField. Note: for Solr 3.x versions the field must by a string like field such as StrField or TextField, otherwise a http status 400 is returned.

group.func

Group based on the unique values of a function query. http://onexin.iyunv.com/source/plugin/onexin_bigdata/file:///C:UsersADMINI~1AppDataLocalTempmsohtmlclip11clip_image001.png Solr4.1
WARNING: If this parameter is set to true on a sharded environment, all the documents that belong to the same group have to be located in the same shard, otherwise the count will be incorrect. If you are using SolrCloud, consider using "custom hashing"

group.truncate

true/false

If true, facet counts are based on the most relevant document of each group matching the query. Same applies for StatsComponent. Default is false. http://onexin.iyunv.com/source/plugin/onexin_bigdata/file:///C:UsersADMINI~1AppDataLocalTempmsohtmlclip11clip_image001.png Solr4.0
WARNING: If this parameter is set to true on a sharded environment, all the documents that belong to the same group have to be located in the same shard, otherwise the count will be incorrect. If you are using SolrCloud, consider using "custom hashing"

group.cache.percent

If > 0 enables grouping cache. Grouping is executed actual two searches. This option caches the second search. A value of 0 disables grouping caching. Default is 0. Tests have shown that this cache only improves search time with boolean queries, wildcard queries and fuzzy queries. For simple queries like a term query or a match all query this cache has a negative impact on performance

说明：
1、 任何数量的一组命令（group.field，group.func，group.query）可以在一个单一的请求指定。
2、 Solr3.5以后，group命令也支持分布式查询，目前group.truncate和group.func是唯一不支持分布式搜索参数。

已知的限制

1、multi-valued字段不支持分组。

Solrj使用例子

SolrServer server = this.getSolrServer();
                    SolrQuery param = new SolrQuery();
                    param.setQuery(QUERY_CONTENT);
                    param.setRows(QUERY_ROWS);
                    param.setParam(GroupParams.GROUP, GROUP);
                    param.setParam(GroupParams.GROUP_FIELD, GROUP_FIELD);
                    param.setParam(GroupParams.GROUP_LIMIT, GROUP_LIMIT);
                    QueryResponse response = null;
                    try {
                               response = server.query(param);
                    } catch (SolrServerException e) {
                               logger.error(e.getMessage(), e);
                    }
                    Map<String, Integer> info = new HashMap<String, Integer>();
                    GroupResponse groupResponse = response.getGroupResponse();
                    if(groupResponse != null) {
                               List<GroupCommand> groupList = groupResponse.getValues();
                               for(GroupCommand groupCommand : groupList) {
                                         List<Group> groups = groupCommand.getValues();
                                         for(Groupgroup : groups) {
                                                   info.put(group.getGroupValue(), (int)group.getResult().getNumFound());
                                         }
                               }
                    }

示例2

SolrQuery SolrQuery = new SolrQuery("*:*");
        solrQuery.addFilterQuery("display:1");
        solrQuery.addFilterQuery("activityBeginTime:[* TO NOW]");
        solrQuery.addFilterQuery("activityEndTime:");
        solrQuery.setGroup(true);
        solrQuery.setParam(GroupParams.GROUP_QUERY, {"id:1","id:2"});
        solrQuery.setParam(GroupParams.GROUP_LIMIT, pageSize + "");
        solrQuery.setParam(GroupParams.GROUP_OFFSET, pageSize * (page - 1) + "");
        solrQuery.setParam(GroupParams.GROUP_LIMIT, "1");
        solrQuery.setParam(GroupParams.GROUP_SORT, "id desc", "sort asc");
        solrQuery.setRows(0);

     QueryResponse qr = searchSource.query(searchQuery, SolrRequest.METHOD.POST);
     GroupResponse groupResponse = qr.getGroupResponse();
        List<GroupCommand> list = groupResponse.getValues();

          for (GroupCommand gc : list) {
                        List<Group> gs = gc.getValues();
                        if (CollectionUtils.isNotEmpty(gs)) {
                            for (Group g : gs) {
                                SolrDocumentList sds = g.getResult();
                                if (CollectionUtils.isNotEmpty(sds)) {
                                     for (SolrDocument doc : sds) {
                                          String id= doc.getFieldValue("id").toString();
                                        }
                                    }
                                }
                            }
                        }
                    }

http://localhost:8080/solr/collection1/select?q=*%3A*&wt=xml&indent=true&group=true&group.field=KV_TITLE&group.query=KR_UID:001&group.limit=5

页: [1]

运维网's Archiver

solr实现结果分组、字段折叠