设为首页 收藏本站
查看: 930|回复: 0

[经验分享] Solr common question

[复制链接]

尚未签到

发表于 2015-7-17 08:33:56 | 显示全部楼层 |阅读模式
original page url is http://khaidoan.wikidot.com/solr





Apache Solr

  http://www.datastax.com/dev/blog/dse-solr-backup-restore-and-re-index
http://rockyj.in/2012/05/08/setting_up_solr.html
http://charlesleifer.com/blog/solr-ubuntu-revisited/
http://wiki.apache.org/solr/SolrCloud/
  We do not want to have to write a separate init script for Solr. If we are already running Tomcat, and if Tomcat already have an init script, we should deploy Solr using Tomcat.
  My unanswered questions on Solr
Unread articles
Miscellaneous
Running Apache Solr in a cloud
  I was having a problem with using wildcard. It seems that the wildcard does not work when it is at the end of a word. If I search for patient (* is not appended to the search term), it returns results. If I search for patient (* is appended to the search term), it does not return result. To support wild card search:


  • Try using EdgeNGrams. Just add the edgytext field type to schema.xml, and change the field type of the field you want to search
  • Use EDisMax (ExtendedDismaxQParser). It handle both trailing and leading wildcards.
  I was also having a problem with phrase synonym. Try to use the \ to escape spaces: hold\ up, delay
  What is the most popular trick for dealing with Solr?
  Use copyField to copy the original field to another field (which has a different data type and other field attributes)
  What are the requirements for a field to be sortable?
  A field needs to be indexed, not be multi-valued, and it should not have multiple tokens (either there is no text analysis or it yields just one token).
  StrField (by default) are not analyzed. They generate just one token.
  Because of the special text analysis restrictions of fields used for sorting, text fields in your schema that need to be sortable will usually be copied to another field and analyzed differently.
  How does Lucene sort text?
  Lucene sorts text by the internal Unicode code point. For most users, this is just fine. Internationalization sensitive users may want a local specific option.
  What is a phrase?
  A phrase is a group of words surrounded by double quotes such as "hello dolly"
  How to group multiple clauses into a single field?
  Lucene supports using parentheses to group multiple clauses to a single field. To search for a title that contains both the word "return" and the phrase "pink panther" use the query:



title:(+return +"pink panther")

  How to escape special characters?
  Lucene supports escaping special characters that are part of the query syntax. The current list special characters are + - && || ! ( ) { } [ ] ^ " ~ * ? :
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:



\(1\+1\)\:2

  Why is "Medical" highlighted when user search for "medication"?
  This is because of SnowballPorterFilter. If we don't want this to happen, we should remove the SnowballPorterFilter, or see if we can use protwords.txt
  What are the differences between string and text?
  String is not analyzed. Text is analyzed. (String is not tokenized. Text is tokenized). An indexed string might be useful for faceting, highlighting.
  What is the purpose of the dataDir directive in solrconfig.xml?
  Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home.
  What is fuzzy search?
  Fuzzy search is a search for words that are similar in spelling.
  Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:



roam~

  This search will find terms like foam and roams. tarting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:



roam~0.8

  The default that is used if the parameter is not given is 0.5.
  What is Searcher?
  "Searcher" tends to refer to an instance of the SolrIndexSearcher class. This class is responsible for executing all searches done against the index, and manages several caches. There is typically one Searcher per SolrCore at any given time, and that searcher is used to execute all queries against that SolrCore, but there may be additional Searchers open at a time during cache warming (in which and "old Searcher" is still serving live requests while a "new Searcher" is being warmed up).
  How to implement partial word?
  Lucene fundamentally search on words. Using advance ngram analysis, it can do partial words too.
  Does Lucene support wild card search in a phrase?
  No. Lucene supports single and multiple character wild card searches within single terms (not within a phrase queries).
  How to specify the search query?



q=solr+rocks

  How to query multiple fields?



q=myField:Java AND otherField:developerWorks
q=title:"The Right Way" AND text:go
q=title:"The Right Way" AND go (we did not specify the field, the default search field is assumed)

  How to search for documents containing Lucene in the title field, and Java in the content field? (the name of the fields are title and content)



q=title:Lucene AND content:Java

  When querying Solr, can the field name be omitted?
  Yes. In the following examples:



q=solr+rocks
q=title:"The Right Way" AND go

  the field names are omitted. In the second example, the field name was specify for the first field, but was omitted for the second field. When the field name is omitted, the defaultSearchField (defined in schema.xml) is used.
  How to search for a range?



q=age:[18 TO 35]
title:{Aida TO Carmen} // This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen.

  Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
  How to do an open-ended range query?



field:[* TO *]

  What is fq, and how is it different from q?
  fq specifies an optional filtering query. The results of the query are restricted to searching only those results returned by the filter query. Filtered queries are cached by Solr. They are very useful for improving speed of complex queries. The value for fq is any valid query that could be passed to the q parameter, not including sort information.
  I was working on a application that takes a search term from the user, and the application adds other conditions before sending it to Solr. These other conditions are finite (does not change based on user input). We use fq to handle these conditions. So to take advantage of fq, we need to analyze our query mix, see what parts are used most frequent. For those that are used most frequent, use fq.
  "fq" stands for Filter Query. This parameter can be used to specify a query that can be used to restrict the super set of documents that can be returned, without influencing score. It can be very useful for speeding up complex queries since the queries specified with fq are cached independently from the main query. Caching means the same filter is used again for a later query (i.e. there's a cache hit). See SolrCaching to learn about the caches Solr uses. See FilterQueryGuidance for an explanation of how filter queries may be used for increased efficiency.
  The fq param can be specified multiple times. Documents will only be included in the result if they are in the intersection of the document sets resulting from each fq. In the example below, only documents which have a popularity greater then 10 and have a section of 0 will match.



  fq=popularity:[10 TO *]
& fq=section:0

  Filter Queries can be complicated boolean queries, so the above example could also be written as a single fq with two mandatory clauses:



fq=+popularity:[10 TO *] +section:0

  The document sets from each filter query are cached independently. Thus, concerning the previous examples: use a single fq containing two mandatory clauses if those clauses appear together often, and use two separate fq params if they are relatively independent.
  "fq" is a filter, therefore does not influence the scores. You must always have a "q" parameter.
  How to specify which fields you want in the result?



fl=name,id,score
fl=*,score

  The set of fields to be returned can be specified as a space (or comma) separated list of field names. The string "score" can be used to indicate that the score of each document for the particular query should be returned as a field, and the string "*" can be used to indicate all stored fields the document has.
  How to specify the starting offset into the result set?



start=15

  Useful for paging through results. This returns results starting with fifteenth ranked result.
  How to specify the format of the result?



wt=json

  How to specify the number of rows / records / documents to return?



rows=10

  How to specify that you want highlighting enabled?



hl=true

  q.op: specifies the default operator for query expressions. Possible values are "AND" or "OR"
  How to specify the default search field?
  Use df parameter, which overrides the default search field defined in schema.xml
  How to boost a term? How to give a term more weight?
  Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.
  Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for



jakarta apache

  and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:



jakarta^4 apache

  This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:



"jakarta apache"^4 "Apache Lucene"

  By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2)
  How to sort the result?



sort=price+desc
sort=inStock+asc,price+desc

  Can I do relevance and sorting together?


  • &sort=score asc,date desc,title asc
  • Boost functions, or function queries, may also be what you're looking for. See FunctionQuery and Boost function (bf) to increase score of documents whose date is closest to NOW
  slop is related to phrase searching.
  Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)
  You can sort by index id using sort=_docid_ asc or sort=_docid_ desc
  As of Solr1.5, sorting can also be done by any single-valued function (as in FunctionQuery)
  The common situation for sorting on a field that you do want to be tokenized for searching is to use a  to clone your field. Sort on one, search on the other.
  Multiple sort orderings can be separated by a comma:



sort=inStock desc, price asc

  Sorting can also be done on the result of a function:



sort=sum(x_td, y_td) desc

  What is LocalParam?
  LocalParam is a way to provide additional information about each query parameter / argument. Assume that we have the existing query parameter:



q=solr+rocks

  We can prefix this query string with {!} to provide more information to the query parser:



q={!}solr+rocks

  The above code does not provide more information to the query parser. The {!} is the syntax for LocalParams.
  To provide additional information to the query parser:



q={!q.op=AND df=title}solr+rocks

  The above change the default operator to "AND" and the default search field to "title".
  To indicate a LocalParam, the argument is prefixed with curly braces whose contents begin with an exclamation point and include any number of key=value pairs separated by whitespace.
  Values in the key-value pairs may be quoted via single or double quotes, and backslash escaping works within quoted strings.
  There may only be one LocalParams prefix per argument.
  How to specify that the search term should be search in multiple columns?
  Use LocalParam qf:



q={!type=dismax qf='myfield yourfield'}solr+rocks

  How to filter for 'not equal'?
  Use the - sign:



fq=-iMemberId:351

  Of course, there should be other ways to express 'not equal'.
  How to search for words that is spell similarly to a given word?
  This is known as Fuzzy search. Fuzzy search is a search for words that are similar in spelling. Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:



roam~

  This search will find terms like foam and roams.
  Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:



roam~0.8

  The default that is used if the parameter is not given is 0.5.
  How to search for documents that contains 'apache' and 'jakarta' within 10 words apart from each other?
  Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:



"jakarta apache"~10

  How can I search for one term near another term (say, "batman" and "movie")?
  A proximity search can be done with a sloppy phrase query. The closer together the two terms appear in the document, the higher the score will be. A sloppy phrase query specifies a maximum "slop", or the number of positions tokens need to be moved to get a match.
  This example for the standard request handler will find all documents where "batman" occurs within 100 words of "movie":



q=text:"batman movie"~100

  The dismax handler can easily create sloppy phrase queries with the pf (phrase fields) and ps (phrase slop) parameters:



q=batman movie&pf=text&ps=100

  The dismax handler also allows users to explicitly specify a phrase query with double quotes, and the qs(query slop) parameter can be used to add slop to any explicit phrase queries:



q="batman movie"&qs=100

  How can I increase the score for specific documents?
  index-time boosts: To increase the scores for certain documents that match a query, regardless of what that query may be, one can use index-time boosts.
  Index-time boosts can be specified per-field also, so only queries matching on that specific field will get the extra boost. An Index-time boost on a value of a multiValued field applies to all values for that field.
  Index-time boosts are assigned with the optional attribute "boost" in the  section of the XML updating messages.
  Query Elevation Component: To raise certain documents to the top of the result list based on a certain queries, one can use the QueryElevationComponent.
  How can I change the score of a document based on the *value* of a field?
  Use a FunctionQuery as part of your query.
  Solr can parse function queries in the following syntax.
  Some examples:



  # simple boosts by popularity
q=%2Bsupervillians+_val_:"popularity"
defType=dismax&qf=text&q=supervillians&bf=popularity
# boosts based on complex functions of the popularity field
q=%2Bsupervillians+_val_:"scale(popularity,0,100)"
defType=dismax&qf=text&q=supervillians&bf=sqrt(popularity)

  How are documents scored?
  Basic scoring factors:


  • tf stands for term frequency - the more times a search term appears in a document, the higher the score
  • idf stands for inverse document frequency - matches on rarer terms count more than matches on common terms
  • coord is the coordination factor - if there are multiple terms in a query, the more terms that match, the higher the score
  • lengthNorm - matches on a smaller field score higher than matches on a larger field
  • index-time boost - if a boost was specified for a document at index time, scores for searches that match that document will be boosted.
  • query clause boost - a user may explicitly boost the contribution of one part of a query over another.
  See the Lucene scoring documentation for more info.
  How can I boost the score of newer documents?


  • Do an explicit sort by date (relevancy scores are ignored)
  • Use an index-time boost that is larger for newer documents
  • Use a FunctionQuery to influence the score based on a date field. In Solr 1.3, use something of the form recip(rord(myfield),1,1000,1000). In Solr 1.4, use something of the form recip(ms(NOW,mydatefield),3.16e-11,1,1)
  See ReciprocalFloatFunction and BoostQParserPlugin.
  A full example of a query for "ipod" with the score boosted higher the newer the product is:



q={!boost b=recip(ms(NOW,manufacturedate_dt),3.16e-11,1,1)}ipod

  One can simplify the implementation by decomposing the query into multiple arguments:



q={!boost b=$dateboost v=$qq}&dateboost=recip(ms(NOW,manufacturedate_dt),3.16e-11,1,1)&qq=ipod

  Now the main "q" argument as well as the "dateboost" argument may be specified as defaults in a search handler in solrconfig.xml, and clients would only need to pass "qq", the user query.
  To boost another query type such as a dismax query, the value of the boost query is a full sub-query and hence can use the {!querytype} syntax. Alternately, the defType param can be used in the boost local params to set the default type to dismax. The other dismax parameters may be set as top level parameters.



q={!boost b=$dateboost v=$qq defType=dismax}&dateboost=recip(ms(NOW,manufacturedate_dt),3.16e-11,1,1)
&qf=text&pf=text&qq=ipod

  How do I give a very low boost to documents that match my query?
  In general the problem is that a "low" boost is still a boost, it can only improve the score of documents that match. One way to fake a "negative boost" is to give a high boost to everything that does *not* match. For example:



bq=(*:* -field_a:54)^10000

  If "bq" supports pure negative queries then you can simplify that to bq=-field_a:54^10000
  How to tell Solr to output debug information?
  Use debugQuery parameter. If this parameter is present (regardless of its value) then additional debugging information will be included in the response, including "explain" info for each of the documents returned. This debugging info is meant for human consumption… its XML format could change in the future.
  We can also use the 'debug' parameter:



&debug=true

  How to include even more debug information?
  Use explainOther parameter.
  Why doesn't document id:juggernaut appear in the top 10 results for my query?
  Since debugQuery=on only gives you scoring "explain" info for the documents returned, the explainOther parameter can be used to specify other documents you want detailed scoring info for:



q=supervillians&debugQuery=on&explainOther=id:juggernaut

  Now you should be able to examine the scoring explain info of the top matching documents, compare it to the explain info for documents matching id:juggernaut, and determine why the rankings are not as you expect.
  How to specify the query parser?
  Use the defType parameter.
  What is LocalParam type?
  Specify the query parser. If a LocalParams value appears without a name, it is given the implicit name of "type". This allows short-form representation for the type of query parser to use when parsing a query string. Thus:



q={!dismax qf=myfield}solr+rocks

  is equivalent to:



q={!type=dismax qf=myfield}solr+rocks

  What is LocalParam v?
  A "v" within local parameters is an alternate way to specify the value of that parameter. For example,



q=solr+rocks

  is equivalent to:



q={!v='solr+rocks'}

  What is LocalParam parameter dereferencing / indirection?
  Parameter dereferencing or indirection allows one to use the value of another argument rather than specifying it directly. This can be used to simplify queries, decouple user input from query parameters, or decouple front-end GUI parameters from defaults set in solrconfig.xml. For example,



q=solr+rocks

  is equivalent to



q={!v=$qq}&qq=solr+rocks

  How to specify the query parser?
  Users can specify the type of a query in most places that accept a query string using LocalParams syntax. For example, the following query string specifies a lucene/solr query with a default operator of "AND" and a default field of "text":



q={!lucene q.op=AND df=text}myfield:foo +bar -baz

  In standard Solr search handlers, the defType param can be used to specify the default type of the main query (ie: the q param) but it only affects the main query — The default type of all other query parameters will remain "lucene".
  q={!func}popularity is thus equivalent to defType=func&q=popularity in the standard Solr search handler.
  What are the differences between default Solr query parser and Lucene query parser?
  The standard Solr Query Parser syntax is a superset of the Lucene Query Parser syntax.
  Differences in the Solr Query Parser include:


  • Range queries [a TO z], prefix queries a*, and wildcard queries a*b are constant-scoring (all matching documents get an equal score). The scoring factors tf, idf, index boost, and coord are not used. There is no limitation on the number of terms that match (as there was in past versions of Lucene). Lucene 2.1 has also switched to use ConstantScoreRangeQuery for its range queries.
  • A * may be used for either or both endpoints to specify an open-ended range query. field:[* TO 100] finds all field values less than or equal to 100. field:[100 TO *] finds all field values greater than or equal to 100. field:[* TO *] matches all documents with the field.
  • Pure negative queries (all clauses prohibited) are allowed. -inStock:false finds all field values where inStock is not false. -field:[* TO *] finds all documents without a value for field.
  • A hook into FunctionQuery syntax. Quotes will be necessary to encapsulate the function when it includes parentheses. Example: _val_:"recip(rord(myfield),1,2,3)"
  • Nested query support for any type of query parser (via QParserPlugin). Quotes will often be necessary to encapsulate the nested query if it contains reserved characters. Example: _query_:"{!dismax qf=myfield}how now brown cow"
  How to do an open-ended range query?
  A * may be used for either or both endpoints to specify an open-ended range query. field:[* TO 100] finds all field values less than or equal to 100. field:[100 TO *] finds all field values greater than or equal to 100. field:[* TO *] matches all documents with the field.
  How to select documents that has NULL value?
  Pure negative queries (all clauses prohibited) are allowed. -inStock:false finds all field values where inStock is not false. -field:[* TO *] finds all documents without a value for field.
  If negative queries does not work, prepend it with *:*



*:* -fieldName:value



Field1:Val1 AND (*:* NOT Field2:Val2)

  That should be equivalent to Field1:Val1 -Field2:Val2. You only need the *:* trick if all of the clauses of a boolean query are negative.
  How to specify the type of query for the main query parameter?
  In standard Solr search handlers, the defType param can be used to specify the default type of the main query (ie: the q param) but it only affects the main query — The default type of all other query parameters will remain "lucene". For example,



q={!func}popularity

  is thus equivalent to:



defType=func&q=popularity

  in the standard Solr search handler.
  How to specify the type of query for other query parameter?
  Users can specify the type of a query in most places that accept a query string using LocalParams syntax. The following query string specifies a lucene/solr query with a default operator of "AND" and a default field of "text":



q={!lucene q.op=AND df=text}myfield:foo +bar -baz

  The above changes the type of query parser for the main query parameter (q). But this syntax can be applied to other query parameter (using this syntax, we can change the query parser using for other query parameter)
  Does Lucene support wild card search?
  Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).
  What is the wild card character that match single character?
  To match a single character, use '?'. Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries)
  What is the wild card character that match multiple characters?
  To match multiple characters, use '*'. Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries)
  Can we have a wild card character at the beginning of a search term?
  No. You cannot use a * or ? symbol as the first character of a search. There may be some way to work around this.
  How to make other data type searchable / query-able?
  For the most part, Lucene only deals with strings, so integers, floats, dates, and doubles require special handling to be searchable.
  If we have a integer field, and you were not able to query on it with fieldName:some_value, you need to index that field.
  When should we use standard request handler? When should we use the dismax request handler?
  The standard request handler uses SolrQuerySyntax to specify the query via the q parameter, and it must be well formed or an error will be returned. It's good for specifying exact, arbitrarily complex queries.
  The dismax request handler has a more forgiving query parser for the q parameter, useful for directly passing in a user-supplied query string. The other parameters make it easy to search across multiple fields using disjunctions and sloppy phrase queries to return highly relevant results.
  For servicing user-entered queries, start by using dismax.
  How can I search for a term in more than one fields?
  If we are using the standard request handler, use q:



q=title:superman subject:superman

  If we are using the dismax request handler, specify the query fields using the qf param:



q=superman&qf=title subject

  How can I make a search term in the title field score higher than in the subject field?
  For the standard request handler, "boost" the clause on the title field:



q=title:superman^2 subject:superman

  Using the dismax request handler, one can specify boosts on fields in parameters such as qf:



q=superman&qf=title^2 subject

  How can I make exact-case matches score higher?
  A query of "Penguin" should score documents containing "Penguin" higher than docs containing "penguin".
  The general strategy is to index the content twice, using different fields with different fieldTypes (and different analyzers associated with those fieldTypes). One analyzer will contain a lowercase filter for case-insensitive matches, and one will preserve case for exact-case matches.
  Use copyField commands in the schema to index a single input field multiple times.
  Once the content is indexed into multiple fields that are analyzed differently, query across both fields.
  How can I make queries of "spiderman" and "spider man" match "Spider-Man"?
  WordDelimiterFilter can be used in the analyzer for the field being queried to match words with intra-word delimiters such as dashes or case changes.
  How to work with date?
  See http://wiki.apache.org/solr/SolrQuerySyntax
  For a specific search term / phrase / query, can we put specific documents at the top of the result?
  Yes. Take a look at the elevate.xml file:










  




  If this file is found in the config directory, it will only be loaded once at startup. If it is found in Solr's data directory, it will be re-loaded every commit.
  What is the purpose of copyField?
  The  mechanism lets you create the all field without manually adding all the content of your document to a separate field. Copy fields are convenient ways to index the same content in multiple ways. For instance, if you wanted to provide both exact matching respecting case and matching ignoring case, you could use a copy field to automatically analyze the incoming content. Content would then be indexed exactly as received, with all letters in lowercase.
  What is dynamicField?
  One of the powerful features of Lucene is that you don't have to pre-define every field when you first create your index. Even though Solr provides strong datatyping for fields, it still preserves that flexibility using "Dynamic Fields". Using  declarations, you can create field rules that Solr will use to understand what datatype should be used whenever it is given a field name that is not explicitly defined, but matches a prefix or suffix used in a dynamicField.
  For example the following dynamic field declaration tells Solr that whenever it sees a field name ending in "_i" which is not an explicitly defined field, then it should dynamically create an integer field with that name…





  Dynamic fields are special kinds of fields that can be added to any document at any time with the attributes defined by the field declaration. The key difference between a dynamic field and a regular field is that dynamic fields do not need to have a name declared ahead of time in the schema.xml. Solr applies the glob-like pattern in the name declaration to any incoming field name not already declared and processes the field according to the semantics defined by its  declaration. For instance,  means that a field named myRating_i would be treated as an sint by Solr, even though it wasn't declared as a field. This is convenient, for example, when letting users define the content to be searched.





  If at index-time a document contains a field that isn't matched by an explicit field definition, but does have a name matching this pattern (that is, ends with _dt such as updated_dt), then it gets processed according to that definition. This also applies to searching the index.
  A dynamic field is declared just like a regular field in the same section. However, the element is named dynamicField, and it has a name attribute that must start or end with an asterisk (the wildcard). If the name is just *, then it is the final fallback.
  Using dynamic fields is most useful for the * fallback if you decide that all fields attempted to be stored in the index should succeed, even if you didn't know about the field when you designed the schema. It's also useful if you decide that instead of it being an error, such unknown fields should simply be ignored (that is, not indexed and not stored).
  Why might I want to index single data in multiple field?
  It often make sense to take what's logically speaking a single field (e.g. product name) and index it into several different Solr fields, each with different field options and/or analyzers.
  If I had a field with a list of authors, such as: 'Schildt, Herbert; Wolpert, Lewis; Davies, P.', I might want to index the same data differently in three different fields (perhaps using the Solr copyField directive):


  • For searching: Tokenized, case-folded, punctuation-stripped: schildt / herbert / wolpert / lewis / davies / p
  • For sorting: Untokenized, case-folded, punctuation-stripped: schildt herbert wolpert lewis davies p
  • For faceting: Primary author only, using a solr.StringField: Schildt, Herbert
  What is 'term vectors'?
  The storage of Lucene term vectors can be triggered using the following field options (in schema.xml):


  • termVectors=true|false
  • termPositions=true|false
  • termOffsets=true|false
  These options can be used to accelerate highlighting and other anciliary functionality, but impose a substantial cost in terms of index size. They are not necessary for typical uses of Solr (phrase queries, etc., do not require these settings to be present).
  How to copy all source field to a single field?
  A common requirement is to copy or merge all input fields into a single solr field. This can be done as follows:





  Should we do synonym expansion both at index-time and at query-time?
  No. Synonym expansion should be done either index-time or query-time, but not both, as that would be redundant.
  Why is index-time synonym expansion considered better than query-time synonym expansion?
  For a variety of reasons, it is usually better to do this at index-time:


  • A synonym containing multiple words (example: i pod) isn't recognized correctly at query-time because the query parser tokenizes on whitespace.
  • The IDF component of Lucene's scoring algorithm will be much higher for documents matching a synonym appearing rarely, as compared to its equivalents that are common. This reduces the scoring effectiveness.
  • Prefix, wildcard, and fuzzy queries aren't analyzed, and thus won't match synonyms.
  Why is index-time synonym expansion considered less flexible than query-time synonym expansion?
  Index-time synonym expansion is less flexible than query-time synonym expansion because with index-time synonym expansion, changes to the synonyms will require a complete re-index to take effect. Moreover, the index will get larger if you do index-time expansion.
  It's plausible to imagine the issues above being rectified at some point. However, until then, index-time is usually best.
  Is there any alternative for synonym expansion?
  Alternatively, you could choose not to do synonym expansion. This means that for a given synonym term, there is just one term that should replace it. This requires processing at both index-time and query-time to effectively normalize the synonymous terms. However, since there is query-time processing, it suffers from the problems mentioned above with the exception of poor scores, which isn't applicable (see why index-time synonym expansion is considered better than query-time synonym expansion). The benefit to this approach is that the index size would be smaller, because the number of indexed terms is reduced.
  You might also choose a blended approach to meet different goals. For example, if you have a huge index that you don't want to re-index often but you need to respond rapidly to new synonyms, then you can put new synonyms into both a query-time synonym file and an index-time one. When a re-index finishes, you empty the query-time synonym file. You might also be fond of the query-time benefits, but due to the multiple word term issue, you decide to handle those particular synonyms at index-time.
  What is the purpose of StopFilterFactory?
  Filter out "stop words", words that are very common, such as "a", "the", "is", "are" …
  For indexes with lots of text, common uninteresting words like "the", "a", and so on, make the index large and slow down phrase queries. To deal with this problem, it is best to remove them from fields where they show up often. Fields likely to contain more than a sentence are ideal candidates.
  The trade-off when omitting stop words from the index is that those words are no longer query-able. This is usually fine, but in some circumstances like searching for To be or not to be, it is obviously a problem. See "shingling".
  How to determine which words appear commonly in your index?
  We should determine which words that appear frequently in our index can be considered as stop words. This help reduce the size of the index.
  In order to determine which words appear commonly in your index, access the SCHEMA BROWSER menu option in Solr's admin interface. A list of your fields will appear on the left. In case the list does not appear at once, be patient. For large indexes, there is a considerable delay before the field list appears because Solr is analyzing the data in your index. Now, choose a field that you know contains a lot of text. In the main viewing area, you'll see a variety of statistics about the field including the top-10 terms appearing most frequently.
  What is the purpose of phonetic sound-alike analysis?
  Correct misspelled words.
  A filter is used at both index and query-time that phonetically encodes each word into a phoneme.
  How many phonetic sound-alike algorithms does Solr offer?


  • DoubleMetaphone
  • Metaphone
  • RefinedSoundex
  • Soundex.
  Among the phonetic sound-alike algorithms that Solr offers, which one seems to be the best?
  Anecdotally, DoubleMetaphone appears to be the best, even for non-English text. However, you might want to experiment in order to make your own choice.
  RefinedSoundex declares itself to be most suitable for spellcheck applications. However, Solr can't presently use phonetic analysis in its spellcheck component.
  How many tools does Solr have for aggressive in-exact searching?
  Solr has three tools at its disposal for more aggressive in-exact searching: phonetic sounds-like, query spellchecking, and fuzzy searching.
  With regard to phonetic sound-alike analysis, how to use the analysis page?
  Using Solr's analysis admin page, it can be shown that this field type encodes Smashing Pumpkins as SMXNK|XMXNK PMPKNS. The use of a vertical bar | here indicates both sides are alternatives for the same position. This is not supposed to be meaningful, but it is useful for comparing similar spellings to detect its effectiveness.
  What are the two options for DoubleMetaphoneFilterFactory?


  • inject: A boolean defaulting to true that will cause the original words to pass through the filter. It might interfere with other filter options, querying, and potentially scoring. Therefore, it is preferred to disable this, and use a separate field dedicated to phonetic indexing.
  • maxCodeLength: The maximum phoneme code (that is Phonetic character, or syllable) length. It defaults to 4. Longer code are truncated.
  How to use the DoubleMetaphone filter factory?





  How to use the other phonetic sound-alike filter factory?
  To use the DoubleMetaphone filter factory:





  To use the other 3 phonetic sound-alike filter factories:





  where the encoder attribute is one of Metaphone, RefinedSoundex, Soundex.
  Does Solr support leading wild card search?
  No.
  Usually, text indexing technology is employed to search entire words. Occasionally however, there arises a need for a search to match an arbitrary substring of a word or across them. Lucene supports leading and trailing wildcards (example: *) on queries. However, only the latter is supported by Solr without internal modification.
  Moreover, this approach only scales for very small indices before it gets very slow and/or results in an error. The right way to solve this is to venture into the black art of n-grams.
  Before employing this approach, consider if what you really need is better tokenization for special code. For example, if you have a long string code that internally has different parts that users might search on separately, then you can use a PatternReplaceFilterFactory with some other analyzers to split them up.
  How does N-gram analysis works in general?
  N-gram analysis slices text into many smaller substrings ranging between a minimum and maximum configured size. For example, consider the word Tonight. An NGramFilterFactory configured with minGramSize of 2 and maxGramSize of 5 would yield all of the following indexed terms: (2-grams:) To, on, ni, ig, gh, ht, (3-grams:) Ton, oni, nig, igh, ght, (4-grams:) Toni, onig, nigh, ight, (5-grams:) Tonig, onigh, night. Note that Tonight fully does not pass through because it has more characters than the maxGramSize. N-gram analysis can be used as a filter for processing on a term-by-term basis, and it can also be used as a tokenizer with NGramTokenizerFactory, which will emit n-grams spanning across the words of the entire source text.
  What is the suggested analyzer configuration for using n-grams?

















  Notice that the n-gramming only happens at index-time.
  How should n-gram analysis be used?
  This analysis would be applied to a field created solely for the purpose of matching substrings. Another field would exist for typical searches, and a dismax handler should be configured for searches to use both fields using a smaller boost for this field.
  How do EdgeNGramTokenizerFactory and EdgeNGramFilterFactory work?
  Another variation is EdgeNGramTokenizerFactory and EdgeNGramFilterFactory, which emit n-grams that are adjacent to either the start or end of the input text. For the filter-factory, this input-text is a term, and the tokenizer is the entire input. In addition to minGramSize and maxGramSize, these analyzers take a side argument that is either front or back. If only prefix or suffix matching is needed instead of both, then an EdgeNGram analyzer is for you.
  What are the costs of n-gram analysis?
  There is a high price to be paid for n-gramming. Recall that in the earlier example, Tonight was split into 15 substring terms, whereas typical analysis would probably leave only one. This translates to greater index sizes, and thus a longer time to index.
  Note the ten-fold increase in indexing time for the artist name, and a five-fold increase in disk space. Remember that this is just one field!
  Given these costs, n-gramming, if used at all, is generally only done on a field or two of small size where there is a clear requirement for substring matches.
  The costs of n-gramming are lower if minGramSize is raised and to a lesser extent if maxGramSize is lowered. Edge n-gramming costs less too. This is because it is only based on one side. It definitely costs more to use the tokenizer-based n-grammers instead of the term-based filters used in the example before, because terms are generated that include and span whitespace. However, with such indexing, it is possible to match a substring spanning words.
  What is the purpose for them n-gram analysis?
  N-gram analysis is potentially useful for


  • leading wild card search
  Caution: N-gram analysis comes with a huge cost.
  How to match substring spanning across words?
  One possible way is to use n-gram analysis that span across words.
  The costs of n-gramming are lower if minGramSize is raised and to a lesser extent if maxGramSize is lowered. Edge n-gramming costs less too. This is because it is only based on one side. It definitely costs more to use the tokenizer-based n-grammers instead of the term-based filters used in the example before, because terms are generated that include and span whitespace. However, with such indexing, it is possible to match a substring spanning words.
  What is the purpose of StandardFilterFactory?
  Works in conjunction with StandardTokenizer. It will remove periods inbetween acronyms and s at the end of terms:



"I.B.M. cat's" => "IBM", "cat"

  What is the purpose of LowerCaseFilterFactory?
  Simply lowercases all text. Don't put this before WordDelimeterFilterFactory if you want to split on case transitions.
  What is the purpose of KeepWordFilterFactory?
  Omits all of the words, except those in the specified file:





  If you want to ensure a certain vocabulary of words in a special field, then you might enforce it with this.
  What is the purpose of LengthFilterFactory?
  Filters out the terms that do not have a length within an inclusive range.





  What is the purpose of RemoveDuplicatesTokenFilterFactory?
  Ensures that no duplicate terms appear at the same position. This can happen, for example, when synonyms stem to a common root. It's a good idea to add this to your last analysis step, if you are doing a fair amount of other analysis.
  What is the purpose of ISOLatin1AccentFilterFactory?
  This will normalize accented characters such as é to the unaccented equivalent e. An alternative and more customizable mechanism introduced in Solr 1.4 is a CharFilterFactory, which is something that actually comes before the lead tokenizer in the analysis chain. For more information about this approach, search Solr's Wiki for MappingCharFilterFactory.
  What is the purpose of CapitalizationFilterFactory?
  This capitalizes each word according to the rules that you specify. For more information, see the Javadocs athttp://lucene.apache.org/solr/api/org/apache/solr/analysis/CapitalizationFilterFactory.html.
  What is the purpose of PatternReplaceFilterFactory?
  Takes a regular expression and replaces the matches. Example:





  This replacement happens to be a reference to a regexp group, but it might be any old string. The replace attribute is either first to only apply to the first occurrence, or all. This example is for processing an email address field to get only the domain of the address.
  How to use explainOther?
  If you want to determine why a particular document wasn't matched by the query, or the query matched many documents and you want to ensure that you see scoring diagnostics for a certain document, then you can put a query for this value, such as id:"Release:12345", and debugQuery's output will be sure to include documents matching this query in its output.



&q=patient&debugQuery=on&explainOther=id:juggernaut

  How to import data into Solr?
  See chapter 3 in the book "Solr 1.4 Enterprise Search Server", published by Packtpub.
  Why is index-time boosting considered less flexible than query-time boosting?
  index-time boosting, which is rarely done as compared to the more flexible query-time boosting. Index-time boosting is less flexible because such boosting decisions must be decided at index-time and will apply to all of the queries.
  What is similarity?
  A  declaration (in schema.xml) can be used to specify the subclass of Similarity that you want Solr to use when dealing with your index. If no Similarity class is specified, the Lucene DefaultSimilarity is used. Please see SolrPlugins for information on how to ensure that your own custom Similarity can be loaded into Solr.
  How do I use copyField with wildcards?
  The  directive allows wildcards in the source, so that several fields can be copied into one destination field without having to specify them all individually. The dest field may by a full field name, or a wildcard expression. A common use case is something like:





  This tells Solr to copy the contents of any field that ends in "_t" to the "text" field. This is particularly useful when you have a large, and possibly changing, set of fields you want to index into a single field. In this example, it's important that the "text" field be defined in schema.xml as multiValued since you intend to copy multiple sources into the single destination.
  What is the effect of having too many copyField directives?
  Copying data to additional fields is necessary to supporting different indexing purposes (one field may have a different analyzer). However, copying data to additional fields increase indexing time, and will consume more disk space. So make sure that you remove copyField directives that you don't really need.
  How to specify the amount of time for search to complete?
  Use the timeAllowed parameter. It specifies the time allowed for a search to finish. This value only applies to the search and not to requests in general. Time is in milliseconds. Values

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-87446-1-1.html 上篇帖子: Solr 使用 Log4j 下篇帖子: solr拾遗:CopyField
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表