[ lucene扩展 ] Apache Lucene Faceted Search User's Guide [转]
Apache LuceneFaceted Search
User's Guide
Table of Contents
[*]Introduction
[*]Facet Features
[*]Indexing Categories Illustrated
[*]Accumulating Facets Illustrated
[*]Indexed Facet Information
[*]Taxonomy Index
[*]Facet Parameters
[*]Advanced Faceted Examples
[*]Optimizations
[*]Concurrent Indexing and Search
Introduction
A category is an aspect of indexed documents which can be used to classifythe documents. For example, in a collection of books at an online bookstore,categories of a book can be its price, author, publication date, binding type,and so on.
In faceted search, in addition to the standard set of search results, wealso get facet results, which are lists of subcategories for certaincategories. For example, for the price facet, we get a list of relevant priceranges; for the author facet, we get a list of relevant authors; and so on. Inmost UIs, when users click one of these subcategories, the search is narrowed,or drilled down, and a new search limited to this subcategory (e.g., to aspecific price range or author) is performed.
Note that faceted search is more than just the ordinary fielded search. Infielded search, users can add search keywords like price:10 orauthor:"Mark Twain" to the query to narrow the search, but thisrequires knowledge of which fields are available, and which values are worthtrying. This is where faceted search comes in: it provides a list of usefulsubcategories, which ensures that the user only drills down into usefulsubcategories and never into a category for which there are no results. Inessence, faceted search makes it easy to navigate through the search results.The list of subcategories provided for each facet is also useful to the user initself, even when the user never drills down. This list allows the user to seeat one glance some statistics on the search results, e.g., what price rangesand which authors are most relevant to the given query.
In recent years, faceted search has become a very common UI feature insearch engines, especially in e-commerce websites. Faceted search makes it easyfor untrained users to find the specific item they are interested in, whereasmanually adding search keywords (as in the examples above) proved toocumbersome for ordinary users, and required too much guesswork,trial-and-error, or the reading of lengthy help pages.
See http://en.wikipedia.org/wiki/Faceted_search formore information on faceted search.
Facet Features
First and main faceted search capability that comes tomind is counting, but in fact faceted search is more than facet counting. Wenow briefly discuss the available faceted search features.
Facet Counting
Which of the available subcategories of a facet should a UI display? Aquery in a book store might yield books by a hundred different authors, butnormally we'd want do display only, say, ten of those.
Most available faceted search implementations use counts to determine theimportance of each subcategory. These implementations go over all searchresults for the given query, and count how many results are in eachsubcategory. Finally, the subcategories with the most results can be displayed.So the user sees the price ranges, authors, and so on, for which there are mostresults. Often, the count is displayed next to the subcategory name, inparentheses, telling the user how many results he can expect to see if hedrills down into this subcategory.
The main API for obtaining facet counting is CountFacetRequest, as in the following code snippet:
new CountFacetRequest(new CategoryPath("author"),10));
A detailed code example using count facet requests isshown below - see Accumulating Facets.
Facet Associations
So far we've discussed categories as binary features, where a documenteither belongs to a category, or not.
While counts are useful in most situations, they are sometimes notsufficiently informative for the user, with respect to deciding whichsubcategory is more important to display.
For this, the facets package allows to associate a value with a category.The search time interpretation of the associated value is applicationdependent. For example, a possible interpretation is as a match level(e.g., confidence level). This value can then be used so that a document thatis very weakly associated with a certain category will only contribute littleto this category's aggregated weight.
Multiple FacetRequests
A single faceted accumulation is capable of servicing multiple facetrequests. Programmatic, this is quite simple - wrap all the facet requests ofinterest into the facet-search-parameters which are passed to a facetsaccumulator/collector (more on these objects below). The results would becomprised of as many facet results as there were facet requests.
However there is a delicate limitation: all facets maintained inthe same location in the index are required to be treated the same. See thesection on Indexing Parameters for an explanation onmaintaining certain facets at certain locations.
Facet Labels at Search Time
Facets results always contain the facet (internal) ID and (accumulated)value. Some of the results also contain the facet label, AKA the category name.We mention this here since computing the label is a time consuming task, andhence applications can specify with a facet request to return top 1000 facetsbut to compute the label only for the top 10 facets. In order to compute labelsfor more of the facet results it is not required to perform accumulation again.
See FacetRequest.getNumResults(), FacetRequest.getNumLabel() and FacetResultNode.getLabel(TaxonomyReader).
Indexing CategoriesIllustrated
In order to find facets at search time they must first be added to theindex at indexing time. Recall that Lucene documents are made of fields fortextual search. The addition of categories is performed by an appropriate DocumentBuilder - or CategoryDocumentBuilder in our case.
Indexing therefore usually goes like this:
[*]For eachinput document:
[*]Create afresh (empty) Lucene Document
[*]Parse inputtext and add appropriate text search fields
[*]Gather allinput categories associated with the document and create aCategoryDocumentBuilder with the list of categories
[*]Build thedocument - this actually adds the categories to the Lucene document.
[*]Add thedocument to the index
Following is a code snippet for indexing categories. Thecomplete example can be found in package org.apache.lucene.facet.example.simple.SimpleIndexer.
[*]IndexWriter writer= ...
[*]TaxonomyWritertaxo = new DirectoryTaxonomyWriter(taxoDir, OpenMode.CREATE);
[*]...
[*]Document doc = newDocument();
[*]doc.add(new Field("title",titleText, Store.YES, Index.ANALYZED));
[*]...
[*]Listcategories = new ArrayList();
[*]categories.add(newCategoryPath("author", "Mark Twain"));
[*]categories.add(newCategoryPath("year", "2010"));
10....
11.DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo);
12.categoryDocBuilder.setCategoryPaths(categories);
13.categoryDocBuilder.build(doc);
14.writer.addDocument(doc);
We now explain the steps above, following the code line numbers:
(4)
Document contains not only text search fields but alsofacet search information.
(7)
Prepare a container for document categories.
(8)
Categories that should be added to the document areaccumulated in the categories list.
(11)
A CategoryDocumentBuilder is created, setwith the appropriate list of categories, and invoked to build - that is, topopulate the document with categories. It is in this step that the taxonomyis updated to contain the newly added categories (if not already there) - seemore on this in the section about the Taxonomy Index below. This line could be mademore compact: one can create a single CategoryDocumentBuilder cBuilder and reuse it like this:
[*]DocumentBuildercBuilder = new CategoryDocumentBuilder(taxo);
[*]cBuilder.setCategoryPaths(categories).build(doc);
(14)
Add the document to the index. As a result, categoryinfo is saved also in the regular search index, for supporting facetaggregation at search time (e.g. facet counting) as well as facet drill-down.For more information on indexed facet information see below the section Indexed Facet Information.
AccumulatingFacets Illustrated
Facets accumulation reflects a set of documents over some facet requests:
[*]Document set - a subset of the index documents, usually documents matching a userquery.
[*]Facet requests - facet accumulation specification, e.g. count a certain facet dimension.
FacetRequest is a basiccomponent in faceted search - it describes the facet information need. Everyfacet request is made of at least two fields:
[*]CategoryPath - root category of the facet request. The categories that arereturned as a result of the request will all be descendants of this root
[*]Number of Results - number of sub-categories to return (at most).
There are other parameters to a facet request, such as -how many facetresults to label-, -how deep to go from the request root when servingthe facet request- and more - see the API Javadocs for FacetRequest and itssubclasses for more information on these parameters. For labels in particular,see the section Facet Labels at Search Time.
FacetRequest in anabstract class, open for extensions, and users may add their own requests. Themost often used request is CountFacetRequest - used forcounting facets.
Facets accumulation is - not surprisingly - driven by a FacetsAccumulator. The most used one is StandardFacetsAccumulator, however there are also accumulators that support sampling - to be usedin huge collections, and there's an adaptive facets accumulator which appliessampling conditionally on the statistics of the data. While facets accumulatorsare very extendible and powerful, they might be too overwhelming for beginners.For this reason, the code offers a higher level interface for facetsaccumulating: the FacetsCollector. It extends Collector, and as such canbe passed to the search() method of Lucene's IndexSearcher. In casethe application also needs to collect documents (in addition toaccumulating/collecting facets), it can wrap multiple collectors with MultiCollector. Most code samples below use FacetsCollector due to its simple interface. It is quite likely that FacetsCollector should suffice the needs of most applications, thereforewe recommend to start with it, and only when needing more flexibility turn todirectly use facets accumulators.
Following is a code snippet from the example code - the complete examplecan be found under org.apache.lucene.facet.example.simple.Searcher:
[*]IndexReaderindexReader = IndexReader.open(indexDir);
[*]Searcher searcher =new IndexSearcher(indexReader);
[*]TaxonomyReadertaxo = new DirectoryTaxonomyReader(taxoDir);
[*]...
[*]Query q = new TermQuery(newTerm(SimpleUtils.TEXT, "white"));
[*]TopScoreDocCollectortdc = TopScoreDocCollector.create(10, true);
[*]...
[*]FacetSearchParamsfacetSearchParams = new FacetSearchParams();
[*]facetSearchParams.addFacetRequest(newCountFacetRequest(
[*]new CategoryPath("author"),10));
11....
12.FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams,indexReader, taxo);
13.searcher.search(q, MultiCollector.wrap(topDocsCollector,facetsCollector));
14.List res = facetsCollector.getFacetResults();
We now explain the steps above, following the code line numbers:
(1)
Index reader and Searcher are initialized as usual.
(3)
A taxonomy reader is opened - it provides access to thefacet information which was stored by the Taxonomy Writer at indexing time.
(5)
Regular text query is created to find the documentsmatching user need, and a collector for collecting the top matching documentsis created.
(8)
Facet-search-params is a container for facet requests.
(10)
A single facet-request - namely a count facet request -is created and added to the facet search params. The request should returntop 10 Author subcategory counts.
(12)
Facets-Collector is the simplest interface for facetsaccumulation (counting in this example).
(13)
Lucene search takes both collectors - facets-collectorand top-doccollector, both wrapped by a multi-collector. This way, a singlesearch operation finds both top documents and top facets. Note however thatfacets aggregation takes place not only over the top documents, but ratherover all documents matching the query.
(14)
Once search completes, facet-results can be obtainedfrom the facetscollector.
Returned facet results are organized in a list, conveniently ordered thesame as the facet-requests in the facet-search-params. Each result howevercontains the request for which it was created.
Here is the (recursive) structure of the facet result:
[*]Facet Result
[*]FacetRequest - the request for whichthis result was obtained.
[*]ValidDescendants - how manyvalid descendants were encountered over the set of matching documents(some of which might have been filtered out because e.g. only top 10results were requested).
[*]Root ResultNode - root facet result forthe request
[*]Ordinal - unique internal ID of the facet
[*]Label - full label of the facet (possibly null)
[*]Value - facet value, e.g. count
[*]Sub-results-nodes - child result nodes (possibly null)
Note that not always there would be sub result nodes - this depends on therequested result mode:
[*]PER_NODE_IN_TREE - a tree, and so there may be sub results.
[*]GLOBAL_FLAT - here the results tree would be rather flat,with only (at most) leaves below the root result node.
Indexed FacetInformation
When indexing a document to which categories were added, information onthese categories is added to the search index, in two locations:
[*]CategoryTokens are added to thedocument for each category attached to that document. These categories canbe used at search time for drill-down.
[*]A special CategoryList Token is added to each document containing information on all thecategories that were added to this document. This can be used at searchtime for facet accumulation, e.g. facet counting.
When a category is added to the index (that is, when a document containinga category is indexed), all its parent categories are added as well. Forexample, indexing a document with the categoryactually addedthree nodes: one for "/author", one for "/author/American" and one for "/author/American/Mark Twain".
An integer number - called ordinal is attached to each category the firsttime the category is added to the taxonomy. This allows for a compactrepresentation of category list tokens in the index, for facets accumulation.
One interesting fact about the taxonomy index is worth knowing: once acategory is added to the taxonomy, it is never removed, even if all relateddocuments are removed. This differs from a regular index, where if alldocuments containing a certain term are removed, and their segments are merged,the term will also be removed. This might cause a performance issue: largetaxonomy means large ordinal numbers for categories, and hence large categoriesvalues arrays would be maintained during accumulation. It is probably not areal problem for most applications, but be aware of this. If, for example, anapplication at a certain point in time removes an index entirely in order torecreate it, or, if it removed all the documents from the index in order tore-populate it, it also makes sense in this opportunity to remove the taxonomyindex and create a new, fresh one, without the unused categories.
Facet Parameters
Facet parameters control how categories and facets are indexed andsearched. Apart from specifying facet requests within facet search parameters,under default settings it is not required to provide any parameters, as thereare ready to use working defaults for everything.
However many aspects are configurable and can be modified by providingaltered facet parameters for either search or indexing.
Facet Indexing Parameters
Facet Indexing Parameters are consulted with during indexing. Among severalparameters it defines, the following two are likely to interest manyapplications:
[*]Category listdefinitions - in the index, facetsare maintained in two forms: category-tokens (for drill-down) andcategory-list-tokens (for accumulation). This parameter allows to specify,for each category, the Lucene term used for maintaining thecategory-list-tokens for that category. The default implementation in DefaultFacetIndexingParams maintains this information for all categoriesunder the same special dedicated term. One case where it is needed tomaintain two categories in separate category lists, is when it is knownthat at search time it would be required to use different types ofaccumulation logic for each, but at the same accumulation call.
[*]Partition size - category lists can be maintained in apartitioned way. If, for example, the partition size is set to 1000, adistinct sub-term is used for maintaining each 1000 categories, e.g. term1for categories 0 to 999, term2 for categories 1000 to 1999, etc. Thedefault implementation in DefaultFacetIndexingParams maintains category lists in a single partition, hence it defines thepartition size as Integer.MAX_VALUE. The importance of this parameter is on allowing to handle verylarge taxonomies without exhausting RAM resources. This is because atfacet accumulation time, facet values arrays are maintained in the size ofthe partition. With a single partition, the size of these arrays is as thesize of the taxonomy, which might be OK for most applications. Limitedpartition sizes allow to perform the accumulation with less RAM, but withsome runtime overhead, as the matching documents are processed for each ofthe partitions.
See the API Javadocs of FacetIndexingParams for additionalconfiguration capabilities which were not discussed here.
Facet Search Parameters
Facet Search Parameters, consulted at search time (during facetsaccumulation) are rather plain, providing the following:
[*]Facetindexing parameters - which werein effect at indexing time - allowing facets accumulation to understandhow facets are maintained in the index.
[*]Container offacet requests - therequests which should be accumulated.
CategoryLists, Multiple Dimensions
Category list parameters which are accessible through the facet indexingparameters provide the information about:
[*]Lucene Termunder which category information is maintained in the index.
[*]Encoding (anddecoding) used for writing and reading the categories information in theindex.
For cases when certain categories should be maintained in differentlocation than others, use PerDimensionIndexingParams, which returns adifferent CategoryListParams object for each dimension. This is a goodopportunity to explain about dimensions. This is just a notion: the top element- or first element - in a category path is denoted as the dimension of thatcategory. Indeed, the dimension stands out as a top important part of thecategory path, such as "Location" for the category "Location/Europe/France/Paris".
Advanced Faceted Examples
We now provide examples for more advanced facet indexing and search, suchas drilling-down on facet values and multiple category lists.
Drill-Down with Regular Facets
Drill-down allows users to focus on part of the results. Assume acommercial sport equipment site where a user is searching for a tennis racquet.The user issues the query tennis racquet and as result is shown a pagewith 10 tennis racquets, by various providers, of various types and prices. Inaddition, the site UI shows to the user a break down of all available racquetsby price and make. The user now decides to focus on racquets made by Head,and will now be shown a new page, with 10 Head racquets, and new break down ofthe results into racquet types and prices. Additionally, the application canchoose to display a new breakdown, by racquet weights. This step of moving fromresults (and facet statistics) of the entire (or larger) data set into aportion of it by specifying a certain category, is what we call Drilldown.We now show the required code lines for implementing such a drill-down.
[*]Query baseQuery =queryParser.parse("tennis racquet");
[*]Query q2 = DrillDown.query(baseQuery,new CategoryPath("make", "head"), 10));
In line 1 the original user query is created and then used to obtaininformation on all tennis racquets.
In line 2, a specific category from within the facet results was selectedby the user, and is hence used for creating the drill-down query.
Please refer to SimpleSearcher.searchWithDrillDown() for a more detailed code example performing drill-down.
Multiple CategoryLists
The default is to maintain all categories information in a single list.While this will suit most applications, in some situations an application maywish to use multiple category lists, for example, when the distribution of somecategory values is different than that of other categories and calls for usinga different encoding, more efficient for the specific distribution. Another exampleis when most facets are rarely used while some facets are used very heavily, soan application may opt to maintain the latter in memory - and in order to keepmemory footprint lower it is useful to maintain only those heavily used facetsin a separate category list.
First we define indexing parameters with multiple category lists:
[*]PerDimensionIndexingParamsiParams = new PerDimensionIndexingParams();
[*]iParams.addCategoryListParams(newCategoryPath("Author"),
[*] new CategoryListParams(new Term("$RarelyUsed","Facets")));
[*]iParams.addCategoryListParams(newCategoryPath("Language"),
[*] new CategoryListParams(new Term("$HeavilyUsed","Ones")));
This will cause the Language categories to be maintained in one categorylist, and Author facets to be maintained in a another category list. Note thatany other category, if encountered, will still be maintained in the defaultcategory list.
These non-default indexing parameters should now be used both at indexingand search time. As depicted below, at indexing time this is done when creatingthe category document builder, while at search time this is done when creatingthe search parameters. Other than that the faceted search code is unmodified.
[*]DocumentBuildercategoryDocBuilder = new CategoryDocumentBuilder(taxo, iParams);
[*]...
[*]FacetSearchParamsfacetSearchParams = new FacetSearchParams(iParams);
A complete simple example can be found in package org.apache.lucene.facet.example.multiCL under the example code.
Optimizations
Faceted search through a large collection of documents with large numbersof facets altogether and/or large numbers of facets per document is challengingperformance wise, either in CPU, RAM, or both. A few ready to use optimizationsexist to tackle these challenges.
Sampling
Facet sampling allows to accumulate facets over a sample of the matchingdocuments set. In many cases, once top facets are found over the sample set,exact accumulations are computed for those facets only, this time over theentire matching document set.
Two kinds of sampling exist: complete support and wrapping support. Thecomplete support is through SamplingAccumulator and is tied to anextension of the StandardFacetsAccumulator and has the benefit of automatically applying otheroptimizations, such as Complements. The wrapping support is through SamplingWrapper and can wrap any accumulator, and as such, provides morefreedom for applications.
Complements
When accumulating facets over a very large matching documents set,possibly almost as large as the entire collection, it is possible to speed upaccumulation by looking at the complement set of documents, and then obtainingthe actual results by subtracting from the total results. It should be notedthat this is available only for count requests, and that the first invocation thatinvolves this optimization might take longer because the total counts have tobe computed.
This optimization is applied automatically by StandardFacetsAccumulator.
Partitions
Partitions are also discussed in the section about Facet Indexing parameters.
Facets are internally accumulated by first accumulating all facets andlater on extracting the results for the requested facets. During this process,accumulation arrays are maintained in the size of the taxonomy. For a verylarge taxonomy, with multiple simultaneous faceted search operations, thismight lead to excessive memory footprint. Partitioning the faceted informationallows to relax the memory usage, by maintaining the category lists in severalpartitions, and by processing one partition at a time. This is automaticallydone by StandardFacetsAccumulator. However the default partition size is Integer.MAX_VALUE, practically setting to a single partition, i.e. nopartitions at all.
Decision to override this behavior and use multiple partitions must betaken at indexing time. Once the index is created and already contains categorylists it is too late to modify this.
See FacetIndexingParams.getPartitionSize() for API to alter this default behavior.
ConcurrentIndexing and Search
Sometimes, indexing is done once, and when the index is fully prepared,searching starts. However, in most real applications indexing is incremental(new data comes in once in a while, and needs to be indexed), and indexingoften needs to happen while searching is continuing at full steam.
Luckily, Lucene supports multiprocessing - one process writing to an indexwhile another is reading from it. One of the key insights behind how Luceneallows multiprocessing is Point In Time semantics. The idea is that whenan IndexReader is opened, it gets a view of the index at the pointin time it was opened. If an IndexWriter in a different process or threadmodifies the index, the reader does not know about it until a new IndexReader is opened (or the reopen() method of an existing IndexReader is called).
In faceted search, we complicate things somewhat by adding a second index- the taxonomy index. The taxonomy API also follows point-in-time semantics,but this is not quite enough. Some attention must be paid by the user to keepthose two indexes consistently in sync:
The main index refers to category numbers defined in the taxonomy index.Therefore, it is important that we open the TaxonomyReader after opening the IndexReader. Moreover, every time an IndexReaderis reopen()ed, the TaxonomyReader needs to be refresh()'ed as well.
But there is one extra caution: whenever the application deems it has writtenenough information worthy a commit, it must first call commit() for the TaxonomyWriter and only after that call commit() for the IndexWriter. Closing the indices should also be done in this order -first close the taxonomy, and only after that close the index.
To summarize, if you're writing a faceted search application wheresearching and indexing happens concurrently, please follow these guidelines (inaddition to the usual guidelines on how to use Lucene correctly in theconcurrent case):
[*]In the indexingprocess:
[*]Before awriter commit()s the IndexWriter, it must commit() the TaxonomyWriter.Nothing should be added to the index between these two commit()s.
[*]Similarly,before a writer close()s the IndexWriter, it must close() theTaxonomyWriter.
[*]In thesearching process:
[*]Open theIndexReader first, and then the TaxonomyReader.
[*]After areopen() on the IndexReader, refresh() the TaxonomyReader. No searchshould be performed on the new IndexReader until refresh() has finished.
Note that the above discussion assumes that the underlying file-system onwhich the index and the taxonomy are stored respects ordering: if index A iswritten before index B, then any reader finding a modified index B will alsosee a modified index A.
Note: TaxonomyReader's refresh() is simpler than IndexReader's reopen(). Whilethe latter keeps both the old and new reader open, the former keeps only thenew reader. The reason is that a new IndexReader might havemodified old information (old documents deleted, for example) so a thread whichis in the middle of a search needs to continue using the old information. With TaxonomyReader, however, we are guaranteed that existing categories arenever deleted or modified - the only thing that can happen is that newcategories are added. Since search threads do not care if new categories areadded in the middle of a search, there is no reason to keep around the oldobject, and the new one suffices.
However, if the taxonomy index was recreated since the TaxonomyReader was opened or refreshed, this assumption (thatcategories are forevr) no longer holds, and refresh() will throwan InconsistentTaxonomyException, guiding the application to open a new TaxonomyReader for up-to-date taxonomy data. (Old one can be closed assoon as it is no more used.)
页:
[1]