|
转载地址:http://knowlspace.wordpress.com/2011/06/15/different-ways-to-implement-autosuggest-using-solr/
There are currently five techniques that can be used to create an auto-suggest functionality:
1- The TermsComponent
2- Facet Prefixes
3- The new Suggester component
4- Edge N-Grams
5- Wildcard queries.
TermsComponent
Implementing an autosuggest with the TermsComponent is probably the easiest way of doing it. The TermsComponent is a low level Solr component that returns all the terms indexed for one field for all the documents in the index. It also contains a parameter called “terms.prefix”, which restricts the terms returned by the component to only those that start with that prefix. So, using this component for autosuggest is as easy as querying it, setting the value of “terms.prefix” to the text entered by the user.
Unfortunately this has big limitations. First, this component will show the “indexed” terms, and not the stored, so an extra field with no analysis should be used for it. But there is another problem related to this. If the term indexed is “Vostro”, and the user enters “vos” (with lowercase), then the terms component won’t return “Vostro” as it starts with upper case.
Faceting to suggest
Faceting is sometimes used for autosuggesting. The idea is similar to the TermsComponent approach, as faceting also has a “facet.prefix” parameter. By faceting on a field that contains the product names and using the facet.prefix parameter with the user entered text, the returned facets could be the suggestions. Unfortunately, this approach suffers the same problems as the TermsComponent approach.
Suggester
This is a new component available in version 3.1 of Solr. Suggester reuses much of the SpellCheckComponent infrastructure, so it also reuses many common SpellCheck parameters, such as spellcheck=true or spellcheck.build=true, etc. The way this component is configured in solrconfig.xml is also very similar. It is technically a spellchecker but instead of correcting misspelled words it returns a list of suggested words.
It was developed with performance and versatility in mind. The other approaches weren’t thought as suggestion components in the first place but components that may be used to implement the autosuggest use case. The Suggester is a component made from scratch.The suggester obtains the suggestions from an external dictionary or a field.
Edge Ngrams
Edge Ngrams are substrings of the term that contain the first letters of it. For example the Edge Ngrams of the term “house” are “h”, “ho”, “hou”, “hous” and “house”
The idea is to associate each of this Ngrams with the full word. Usually this is accomplished with a specialized field for the suggestions with a special analysis. Suggestion of text with multiple words can be easily accomplished using this approach.
For this example, the user is searching for discs, and the system should recommend “Dark side of the moon” when the user begins to type “side”. For this , the schema of the recommendation index would consist of an Edge Ngrams field, that is, a field that at least has the following filters:
Whitespace tokenizer
Lowercase filter
Edge Ngrams filter
Applying this chain to the title of that disk will produce:
Original text: Dark side of the moon
Whitespace tokenizer: Dark | side | of | the | moon
Lowecase filter: dark | side | of | the | moon
EdgeNgrams filter: d | da | dar | dark | s | si | sid | side | o | of | t | th | the | m | mo | moo | moon
The best way of implementing this approach for this example is to add an extra field named “edge_title” or similar, that must be indexed with the analysis chain provided above (not necessarily stored if the title is being stored on other field). The auto-suggest should issue queries like:
…&q=edge_title:[user-entered-text]&fl=title
The query analysis chain to apply should be the same as in the indexing phase, except for the EdgeNgrams filter that should not be applied in the query.
There is a drawback with this approach that is the disk space usage. When using edge-ngrams, the index will grow significantly.
Execute Wildcard queries
There are two problems with this approach. Wildcard queries are not as fast as regular queries. Autosuggestion must be fast, and with a relatively large index, this approach wont probably achieve the necessary speed.
The other big issue with this approach is the analysis. When a query contains wildcards, Solr don’t analyze it. So, if there is a small difference between the text entered by the user and the indexed text (case, etc), Solr won’t suggest that document, even when the user enters the text correctly. In the first example, if the user enters “Vostro” or “Dell”, Solr won’t suggest “Dell Vostro”, as that field was lower-cased on index time.
One advantage of this approach against all the others is that when the user enters a part of the word, which is not the first part of it, like “str”, “Dell Vostro” could be suggested. |
|