|
The collections API let's you manage collections. Under the hood, it generally uses the
CoreAdmin API to asynchronously (though Overseer) manage
SolrCores on each server - it's essentially sugar for actions that you could handle yourself if you made individual
CoreAdmin API calls to each server you wanted an action to take place on.
Create
http://localhost:8983/solr/admin/collections?action=CREATE&name=mycollection&numShards=3&replicationFactor=4
About the params:
name: The name of the collection to be created.
numShards: The number of logical shards (sometimes called slices) to be created as part of the collection.
replicationFactor: The number of copies of each document (or, the number of physical replicas to be created for each logical shard of the collection.) A replicationFactor of 3 means that there will be 3 replicas (one of which
is normally designated to be the leader) for each logical shard. NOTE: in Solr 4.0, replicationFactor was the number of *additional* copies as opposed to the total number of copies.
maxShardsPerNode : A create operation will spread numShards*replicationFactor shard-replica across your live Solr nodes - fairly distributed, and never two replica of the same shard on the same Solr node. If a Solr is not
live at the point in time where the create operation is carried out, it will not get any parts of the new collection. To prevent too many replica being created on a single Solr node, use maxShardsPerNode to set a limit for how many replicas the create operation
is allowed to create on each node - default is 1. If it cannot fit the entire collection numShards*replicationFactor replicas on you live Solrs it will not create anything at all.
createNodeSet: If not provided the create operation will create shard-replica spread across all of your live Solr nodes. You can provide the "createNodeSet" parameter to change the set of nodes to spread the shard-replica
across. The format of values for this param is "<node-name1>,<node-name2>,...,<node-nameN>" - e.g. "localhost:8983_solr,localhost:8984_solr,localhost:8985_solr"
collection.configName: The name of the config (must be already stored in zookeeper) to use for this new collection. If not provided the create operation will default to the collection name as the config name.
Solr4.2
About the params:
name: The name of the collection alias to be created.
collections: A comma-separated list of one or more collections to alias to.
Delete
http://localhost:8983/solr/admin/collections?action=DELETE&name=mycollection
About the params:
name: The name of the collection to be deleted.
Reload
http://localhost:8983/solr/admin/collections?action=RELOAD&name=mycollection
About the params:
name: The name of the collection to be reloaded.
Split Shard
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=<collection_name>&shard=shardId
Solr4.3
About the params:
collection: The name of the collection
shard: The shard to be split
This command cannot be used by clusters with custom hashing because such clusters do not rely on a hash range. It should only be used by clusters having "plain" or "compositeId" router.
The SPLITSHARD command will create two new shards by splitting the given shard's index into two pieces. The split is performed by dividing the shard's range into two equal partitions and dividing up the documents in the parent shard according
to the new sub-ranges. This is a synchronous operation. The new shards will be named by appending _0 and _1 to the parent shard name e.g. if shard=shard1 is to be split, the new shards will be named as shard1_0 and shard1_1. Once the new shards are created,
they are set active and the parent shard is set to inactive so that no new requests are routed to the parent shard.
This feature allows for seamless splitting and requires no down-time. The parent shard is not removed and therefore no data is removed. It is up to the user of the command to unload the shard using the new APIs in SOLR-4693 (under construction).
This feature was released with Solr 4.3 however due to bugs found after 4.3 release, it is recommended that you wait for release 4.3.1 before using this feature.
Collection Aliases
Aliasing allows you to create a single 'virtual' collection name that can point to one more real collections. You can update the alias on the fly.
CreateAlias
http://localhost:8983/solr/admin/collections?action=CREATEALIAS&name=alias&collections=collection1,collection2,…
Creates or updates a given alias. Aliases that are used to send updates to should only map an alias to a single collection. Read aliases can map an alias to a single collection or multiple collections.
DeleteAlias
http://localhost:8983/solr/admin/collections?action=DELETEALIAS&name=alias
Removes an existing alias.
Creating cores via CoreAdmin
New Solr cores may also be created and associated with a collection via
CoreAdmin.
Additional cloud related parameters for the CREATE action:
collection - the name of the collection this core belongs to. Default is the name of the core.
shard - the shard id this core represents (Optional - normally you want to be auto assigned a shard id)
numShards - the number of shards you want the collection to have - this is only respected on the first core created for the collection
collection.<param>=<value> - causes a property of <param>=<value> to be set if a new collection is being created.
Use collection.configName=<configname> to point to the config for a new collection.
Example:
curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=mycore&collection=collection1&shard=shard2'
Distributed Requests
Query all shards of a collection (the collection is implicit in the URL):
http://localhost:8983/solr/collection1/select?
Query all shards of a compatible collection, explicitly specified:
http://localhost:8983/solr/collection1/select?collection=collection1_recent
Query all shards of multiple compatible collections, explicitly specified:
http://localhost:8983/solr/collection1/select?collection=collection1_NY,collection1_NJ,collection1_CT
Query specific shard ids of the (implicit) collection. In this example, the user has partitioned the index by date, creating a new shard every month:
http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001
Explicitly specify the addresses of shards you want to query:
http://localhost:8983/solr/collection1/select?shards=localhost:8983/solr,localhost:7574/solr
Explicitly specify the addresses of shards you want to query, giving alternatives (delimited by
|) used for load balancing and fail-over:
http://localhost:8983/solr/collection1/select?shards=localhost:8983/solr|localhost:8900/solr,localhost:7574/solr|localhost:7500/solr
Required Config
All of the required config is already setup in the example configs shipped with Solr. The following is what you need to add if you are migrating old config files, or what you should not remove if you are starting with new config files.
schema.xml
You must have a _version_ field defined:
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
solrconfig.xml
You must have an
UpdateLog defined - this should be defined in the updateHandler section.
<!-- Enables a transaction log, currently used for real-time get.
"dir" - the target directory for transaction logs, defaults to the
solr data directory. -->
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
You must have a replication handler called /replication defined:
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />
You must have a realtime get handler called /get defined:
<requestHandler name="/get" class="solr.RealTimeGetHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
</lst>
</requestHandler>
You must have the admin handlers defined:
<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />
The
DistributedUpdateProcessor is part of the default update chain and is automatically injected into any of your custom update chains. You can still explicitly add it yourself as follows:
<updateRequestProcessorChain name="sample">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="my.package.UpdateFactory"/>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
If you do not want the DistributedUpdateProcessFactory auto injected into your chain (say you want to use SolrCloud functionality,
but you want to distribute updates yourself) then specify the following update processor factory in your chain:
NoOpDistributingUpdateProcessorFactory
solr.xml
You must leave the admin path as the default:
<cores adminPath="/admin/cores"
Re-sizing a Cluster
You can control cluster size by passing the numShards when you start up the first
SolrCore in a collection. This parameter is used to auto assign which shard each instance should be part of. Any
SolrCores that you start after starting numShards instances are evenly added to each shard as replicas (as long as they all belong to the same collection).
To add more
SolrCores to your collection, simply keep starting new
SolrCores up. You can do this at any time and the new
SolrCore will sync up its data with the current replicas in the shard before becoming active.
If you want to start your cluster on fewer machines and then expand over time beyond just adding replicas, you can choose to start by hosting multiple shards per machine (using multiple
SolrCores) and then later migrate shards onto new machines by starting up a new replica for a given shard and eventually removing the shard from the original machine.
Solr4.3 The new "SPLITSHARD" collection API can be used to split an existing shard into two shards containing exactly half the range of the parent shard each. More details can be found under the "Managing collections
via the Collections API" section.
Near Realtime Search
If you want to use the Near Realtime search support, you will probably want to enable auto soft commits in your solrconfig.xml file before putting it into zookeeper. Otherwise you can send explicit soft commits to the cluster as you desire.
See NearRealtimeSearch
Parameter Reference
Cluster Params
numShards
Defaults to 1
The number of shards to hash documents to. There will be one leader per shard and each leader can have N replicas.
SolrCloud Instance Params
These are set in solr.xml, but by default they are setup in solr.xml to also work with system properties. Important note: the hostPort value found here will be used (via zookeeper) to inform the rest of the cluster what port each Solr instance
is using. The default port is 8983. The example solr.xml uses the jetty.port system property, so if you want to use a port other than 8983, either you have to set this property when starting Solr, or you have to change solr.xml to fit your particular installation.
If you do not do this, the cluster will think all your Solr servers are using port 8983, which may not be what you want.
host
Defaults to the first local host address found
If the wrong host address is found automatically, you can over ride the host address with this param.
hostPort
Defaults to the jetty.port system property
The port that Solr is running on - by default this is found by looking at the jetty.port system property.
hostContext
Defaults to solr
The context path for the Solr webapp. (Note: in Solr 4.0, it was mandatory that the hostContext not contain "/" or "_" characters. Begining with Solr 4.1, this limitation was removed, and it is recomended that you specify the begining slash.
When running in the example jetty configs, the "hostContext" system property can be used to control both the servlet context used by jetty, and the hostContext used by SolrCloud -- eg:
-DhostContext=/solr)
SolrCloud Instance ZooKeeper Params
zkRun
Defaults to localhost:<solrPort+1001>
Causes Solr to run an embedded version of
ZooKeeper. Set to the address of
ZooKeeper on this node - this allows us to know who 'we are' in the list of addresses in the
zkHost connect string. Simply using
-DzkRun gets you the default value. Note this must be one of the exact strings from
zkHost; in particular, the default
localhost will not work for a multi-machine ensemble.
zkHost
No default
The host address for
ZooKeeper - usually this should be a comma separated list of addresses to each node in your
ZooKeeper ensemble.
zkClientTimeout
Defaults to 15000
The time a client is allowed to not talk to
ZooKeeper before having it's session expired.
zkRun and zkHost are setup using system properties. zkClientTimeout is setup in solr.xml, but default, can also be set using a system property.
SolrCloud Core Params
shard
The shard id. Defaults to being automatically assigned based on numShards
Allows you to specify the id used to group
SolrCores into shards.
shard can be configured in solr.xml for each core element as an attribute.
Getting your Configuration Files into ZooKeeper
Config Startup Bootstrap Params
There are two different ways you can use system properties to upload your initial configuration files to
ZooKeeper the first time you start Solr. Remember that these are meant to be used only on first startup or when overwriting configuration files - everytime you start Solr with these system
properties, any current configuration files in
ZooKeeper may be overwritten when 'conf set' names match.
1. Look at solr.xml and upload the conf for each
SolrCore found. The 'config set' name will be the collection name for that
SolrCore, and collections will use the 'config set' that has a matching name.
bootstrap_conf
No default
If you pass -Dbootstrap_conf=true on startup, each
SolrCore you have configured will have it's configuration files automatically uploaded and linked to the collection that
SolrCore is part of
2. Upload the given directory as a 'conf set' with the given name. No linking of collection to 'config set' is done. However, if only one 'conf set' exists, a collection will auto link to it.
bootstrap_confdir
No default
If you pass -bootstrap_confdir=<directory> on startup, that specific directory of configuration files will be uploaded to
ZooKeeper with a 'conf set' name defined by the below system property, collection.configName
collection.configName
Defaults to configuration1
Determines the name of the conf set pointed to by bootstrap_confdir
Command Line Util
The CLI tool also lets you upload config to
ZooKeeper. It allows you to do it the same two ways that you can above. It also provides a few other commands that let you link collection sets to collections, make
ZooKeeper paths or clear them, as well as download configs from
ZooKeeper to the local filesystem.
usage: ZkCLI
-c,--collection <arg> for linkconfig: name of the collection
-cmd <arg> cmd to run: bootstrap, upconfig, downconfig,
linkconfig, makepath, clear
-d,--confdir <arg> for upconfig: a directory of configuration files
-h,--help bring up this help page
-n,--confname <arg> for upconfig, linkconfig: name of the config set
-r,--runzk <arg> run zk internally by passing the solr run port -
only for clusters on one machine (tests, dev)
-s,--solrhome <arg> for bootstrap, runzk: solrhome location
-z,--zkhost <arg> ZooKeeper host address
Examples
# try uploading a conf dir
java -classpath example/solr-webapp/WEB-INF/lib/* org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 127.0.0.1:9983 -confdir example/solr/collection1/conf -confname conf1
# try linking a collection to a conf set
java -classpath example/solr-webapp/WEB-INF/lib/* org.apache.solr.cloud.ZkCLI -cmd linkconfig -zkhost 127.0.0.1:9983 -collection collection1 -confname conf1
# try bootstrapping all the conf dirs in solr.xml
java -classpath example/solr-webapp/WEB-INF/lib/* org.apache.solr.cloud.ZkCLI -cmd bootstrap -zkhost 127.0.0.1:9983 -solrhome example/solr
Scripts
There are scripts in example/cloud-scripts that handle the classpath and class name for you if you are using Solr out of the box with Jetty. Cmds then become:
sh zkcli.sh -cmd linkconfig -zkhost 127.0.0.1:9983 -collection collection1 -confname conf1
Zookeeper chroot
If you are already using Zookeeper for other applications and you want to keep the ZNodes organized by application, or if you want to have multiple separated SolrCloud clusters sharing one Zookeeper ensemble you can use Zookeeper's "chroot"
option. From Zookeeper's documentation:
http://zookeeper.apache.org/doc/r3.3.6/zookeeperProgrammers.html#ch_zkSessions
An optional "chroot" suffix may also be appended to the connection string. This will run the client commands while interpreting all paths relative to this root (similar to the unix chroot command). If used the example would look like: "127.0.0.1:4545/app/a" or "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002/app/a" where the client would be rooted at "/app/a" and all paths would be relative to this root - ie getting/setting/etc... "/foo/bar" would result in operations being run on "/app/a/foo/bar" (from the server perspective).
To use this Zookeeper feature, simply start Solr with the "chroot" suffix in the zkHost parameter. For example:
java -DzkHost=localhost:9983/foo/bar -jar start.jar
or
java -DzkHost=zoo1:9983,zoo2:9983,zoo3:9983/foo/bar -jar start.jar
NOTE: With Solr 4.0 you'll need to create the initial path in Zoookeeper before starting Solr. Since Solr 4.1, the initial path will automatically be created if you are using either
bootstrap_conf or boostrap_confdir.
Known Limitations
A small number of Solr search components do not support
DistributedSearch. In some cases, a component may never get distributed support, in other cases it may just be a matter of time and effort. All of the search components that do not yet support standard distributed search have
the same limitation with SolrCloud. You can pass distrib=false to use these components on a single
SolrCore.
The Grouping feature only works if groups are in the same shard. You must use the custom sharding feature to use the Grouping feature.
If upgrading an existing Solr instance instance running with SolrCloud from Solr 4.0 to 4.1, be aware that the way the name_node parameter is defined has changed. This may cause a situation where the name_node uses the IP address of the machine
instead of the server name, and thus SolrCloud is not aware of the existing node. If this happens, you can manually edit the host parameter in solr.xml to refer to the server name, or set the host in your system environment variables (since by default solr.xml
is configured to inherit the host name from the environment variables). See also the section Core Admin and Configuring solr.xml for more information about the host parameter.
Glossary
Collection:
A single search index.
Shard:
A logical section of a single collection (also called Slice). Sometimes people will talk about "Shard" in a physical sense (a manifestation of a logical shard)
Replica:
A physical manifestation of a logical Shard, implemented as a single Lucene index on a
SolrCore
Leader:
One Replica of every Shard will be designated as a Leader to coordinate indexing for that Shard
SolrCore:
Encapsulates a single physical index. One or more make up logical shards (or slices) which make up a collection.
Node:
A single instance of Solr. A single Solr instance can have multiple
SolrCores that can be part of any number of collections.
Cluster:
All of the nodes you are using to host
SolrCores.
FAQ
Q: I'm seeing lot's of session timeout exceptions - what to do?
A: Try raising the
ZooKeeper session timeout by editing solr.xml - see the zkClientTimeout attribute. The minimum session timeout is 2 times your
ZooKeeper defined tickTime. The maximum is 20 times the tickTime. The default tickTime is 2 seconds. You should avoiding raising this for no good reason, but it should be high enough that
you don't see a lot of false session timeouts due to load, network lag, or garbage collection pauses. The default timeout is 15 seconds, but some environments might need to go as high as 30-60 seconds.
Q: How do I use SolrCloud, but distribute updates myself?
A: Add the following
UpdateProcessorFactory somewhere in your update chain: NoOpDistributingUpdateProcessorFactory
Q: What is the difference between a Collection and a
SolrCore?
A: In classic single node Solr, a
SolrCore is basically equivalent to a Collection. It presents one logical index. In SolrCloud, the
SolrCore's on multiple nodes form a Collection. This is still just one logical index, but multiple
SolrCores host different 'shards' of the full collection. So a
SolrCore encapsulates a single physical index on an instance. A Collection is a combination of all of the
SolrCores that together provide a logical index that is distributed across many nodes.
|
|