Solr DisjunctionMax 注解
DisjunctionMax析取最大(并集)
本质多域联合搜索,并且不同域指定不同的权重,命中时取最大得分域结果作为结果得分。与直接多域boost求和是完全不同的结果。使用起来非常复杂,需要debugquery
看结果,反复尝试!
http://wiki.apache.org/solr/DisMax
http://searchhub.org/dev/2010/05/23/whats-a-dismax/
What’sa“DisMax”?Posted
byhossman
Theterm“dismax”gets
tossedaround(被抛出来)on
theSolrlistsfrequently,whichcanbefairlyconfusingtonew
users.Itoriginatedasashorthandnameforthe
DisMaxRequestHandler(whichInamedafterthe
DisjunctionMaxQueryParser,whichInamedafterthe
DisjunctionMaxQueryclassthatitusesheavily).Inrecent
years,theDisMaxRequestHandlerandtheStandardRequestHandlerwere
bothrefactoredinto(重构)
asingleSearchHandlerclass,and
nowtheterm“dismax”usuallyreferstothe
DisMaxQParser.
注解:dismax现在对应于DisMaxQParser,而DismaxRequestHandler与standardRequestHandler重构到SearchHandler中
ClearasMudd,
right?
Regardlessofwhetheryou
usetheDisMaxRequestHandlerviatheqt=dismax
parameter,orusetheSearchHandlerwiththeDisMaxQParservia
defType=dismaxtheendresultisthatyour
qparametergetsparsedbythe
DisjunctionMaxQueryParser.
注解:qt=dismax,采取DisMaxRequestHandler,而defType=dismax,是SearchHandler中使用DisMaxQParser,二者q的参数采取DisJunctionMaxQueryParser解析
The
originalgoalsofdismax(whichevermeaningyoumightinfer)
haveneverchanged:
…supportsasimplified
versionoftheLuceneQueryParsersyntax.Quotescanbeusedto
groupphrases(分组短语),and
+/-canbeusedtodenotemandatory(强制性、必选的)andoptional(可选的)clauses…butallotherLucenequeryparser
specialcharactersareescapedtosimplifytheuserexperience.The
handlertakesresponsibilityforbuildingagoodqueryfromthe
user’sinputusingBooleanQueriescontainingDisjunctionMaxQueries
acrossfieldsandboostsyouspecifyItalsoallowsyoutoprovide
additionalboostingqueries,boostingfunctions,andfiltering
queriestoartificially(人工)affecttheoutcomeofallsearches.Theseoptionscanall
bespecifiedasdefaultparametersforthehandlerinyour
solrconfig.xmloroverriddentheSolrqueryURL.
Inshort:Youworryabout
whatfieldsandboostsyouwanttousewhenyouconfigureit,your
usersjustgiveyouwordsw/oworryingtoomuchabout
syntax.
注解:dismax句柄主要负责使用布尔查询封装DisjunctionMaxQueries,同时允许手工执行query激励、函数激励、过滤query影响最终搜索结果。所有参数可以通过在solrconfig.xml中配置,作为全局查询用,也可以通过url添加参数,在每一次或者每一类查询中动态使用。
Themagicofdismax(inmy
opinion)comesfromthequerystructureitproduces.Whatit
essentiallyboilsdowntois
matrixmultiplication:aonecolumnmatrixofeach“chunk”of
youruser’sinput,multipliedbyaonerowmatrixofthe
qffieldstoproduceabigmatrixofeveryfield:chunk
permutation(排列).
ThematrixisthenturnedintoaBooleanQueryconsistingof
DisjunctionMaxQueriesforeachrow
inthematrix.DisjunctionMaxQueryisusedbecause
it’sscoreisdeterminedbythemaximumscoreofit’s
subclauses—insteadofthesumlikeaBooleanQuery—sonoone
wordfromtheuserinputdominatesthefinalscore.Thebestwayto
explainthisiswithanexample,solet’sconsiderthefollowing
input…
span lang="EN-US">defType = dismax
mm = 50%
qf = features^2 name^3
q = +"apache solr" search server
Firstoff,weconsiderthe
“markup”charactersoftheparserthatappearinthis
qstring:
[*] whitespace–dividinginput
stringintochunk(
分词)
[*] quotes–makesasinglephrase
chunk(
括号)
[*] +–makesachunkmandatory
(
组合关系)
Sowehave3“chunks”ofuserinput:
[*] “apachesolr”(must
match)
[*] “search”(should
match)
[*] “server”(should
match>
Ifwe“multiply”thatwith
ourqflist(features,name)wegeta
matrixlikethis…
features:”apache
solr”
name:”apache
solr”
(mustmatch)
features:”search”
name:”search”
(shouldmatch)
features:”server”
name:”server”
(shouldmatch)
Ifwethenfactorinthe
mmparamtodetermingthe“minimumnumberof
‘ShouldMatch’clausesthat(ahem)mustmatch”(50%of2==1)we
getthefollowingquerystructure(inpsuedo-code)…
q = BooleanQuery(
minNumberShouldMatch => 1,
booleanClauses => ClauseList(
MustMatch(DisjunctionMaxQuery(
PhraseQuery("features","apache solr")^2,
PhraseQuery("name","apache solr")^3)
),
ShouldMatch(DisjunctionMaxQuery(
TermQuery("features","search")^2,
TermQuery("name","search")^3)
),
ShouldMatch(DisjunctionMaxQuery(
TermQuery("features","server")^2,
TermQuery("name","server")^3))
));
注解:boolean查询这个是最最基本的原子查询,其他高级查询都是基于这个查询的组合、封装,Dismax也是如此。从dismax qp分解过程和定义看,dismax也是分解为boolean查询,并且field激励也同一般域boost一致,但是不同的时候dismax是以最大得分作为最终得分,而一般多域独立boost时候是求和得分。
Withmesofar
right?
Wherepeopletendtoget
trippedup(绊倒),isinthinkingabouthowSolr’sper-fieldanalysis
configuration(inschema.xml)impactsallofthis.Ourexample
abovewasprettystraightforward,butletsconsiderforamoment
whatmighthappenif:
[*] Thename
fieldusestheWordDelimiterFilter(单词分割符过滤器)atquerytimebutfeaturesdoesnot.
[*] Thefeaturesfieldisconfiguredsothat“the”isastopword,but
nameis
not.
Nowlet’slookatwhatwe
getwhenourinputparametersarestructurallysimilartowhatwe
hadbefore,butjustdifferentenoughtoforWordDelimiterFilter
andStopFiltertocomeintoplay…
defType = dismax
mm = 50%
qf = features^2 name^3
q = +"apache solr" the search-server
Ourresultingqueryisgoing
tobesomethinglike…
q = BooleanQuery(
minNumberShouldMatch => 1,
booleanClauses => ClauseList(
MustMatch(DisjunctionMaxQuery(
PhraseQuery("features","apache solr")^2,
PhraseQuery("name","apache solr")^3)
),
ShouldMatch(DisjunctionMaxQuery(
TermQuery("name","the")^3)
),
ShouldMatch(DisjunctionMaxQuery(
TermQuery("features","search-server")^2,
PhraseQuery("name","search server")^3))
));
Theuseof
WordDelimiterFilterhasn’tchangedthingsverymuch:featuresis
treating“search-server”asasingleTerm,whileinthe
namefieldwearesearchingforthephrase“search
server”—hopefullythisshouldn’tsurpriseanyonegiventheuseof
WordDelimiterFilterforthenamefield(presumablythat’swhyit’s
beingused).ThisDisjunctionMaxQuerystill“makessense”,but
otherfieldswithoddanalysisthatproduceless/moreTokensthena
“typical”fieldforthesamethunkmightproducequeriesthat
aren’taseasilytounderstand.Inparticularconsiderwhathas
happenedinourexamplewiththeword“the”:Because“the”isa
stopwordinthefeaturesfield,noQueryobjectis
producedforthatfield/chunkcombination.ButaQueryisproduced
forthenamefield,whichmeansthetotalnumberof
“ShouldMatch”clausesinourtoplevelqueryisstill2soour
minNumberShouldMatchisstill1(50%of2==1).
Thistypeofsituationtends
toconfusealotofpeople:since“the”isastopwordinone
field,theydon’texpectittomatterinthefinalquery—butas
longasatleastoneqffieldproducesaTokenforit
(nameinourexample)itwillbeincludedinthefinal
query,andwillcontributetothecountof“ShouldMatch”
clauses.
So,what’sthetakeaway
fromallofthis?
DisMaxisacomplicated
creature.Whenusingit,youneedtoconsiderallofit’s
optionscarefully,andlookatthedebugQuery=true
outputwhileexperimentingwithdifferentquerystringsand
differentanalysisconfigurationstomakereallysureyou
understandhowqueriesfromyouruserswillbeparsed.
注解:dismax构造非常复杂,使用的时候需要仔细考虑所有选项,同时,开启debugQuery=true,针对不同的查询串和分词器。
Forqf(QueryFields),pf(PhraseFields),
mm(Minimum‘Should’Match),andtie(TieBreaker),
see:theSolr
WikiDisMaxQParserPlugin.
Solr:
ForcingitemswithallquerytermstothetopofaSolrsearch
RobotLibrarian
http://robotlibrarian.billdueber.com/solr-forcing-items-with-all-query-terms-to-the-top-of-a-solr-search/
LucidImaginationSolrPoweredISFDB–Part
#10:TweakingRelevancy
http://searchhub.org/dev/2011/06/20/solr-powered-isfdb-part-10/
LucidImaginationSolrPoweredISFDB–Part
#11:UsingDisMax
http://searchhub.org/dev/2011/08/08/solr-powered-isfdb-part-11/
http://tm.durusau.net/?p=21573
Using
Solr’sDismaxTieParameterAnotherWordForIt(tie
breake配合断路器)
http://java.dzone.com/articles/using-solrs-dismax-tie
SolrPoweredISFDB–Part#11:Using
DisMax
http://searchhub.org/dev/2011/06/20/solr-powered-isfdb-part-10/
页:
[1]