The Data Import Handler Framework
Solr includes a very popular contrib module for importing data known as the DataImportHandler (DIH in short). It's a data processing pipeline built specificallyfor Solr. Here's a summary of notable capabilities:
· Imports data from databases through JDBC (Java Database Connectivity)
° Supports importing only changed records, assuming a last-updated date
· Imports data from a URL (HTTP GET)
· Imports data from files (that is it crawls files)
· Imports e-mail from an IMAP server, including attachments
· Supports combining data from different sources
· Extracts text and metadata from rich document formats
· Applies XSLT transformations and XPath extraction on XML data
· Includes a diagnostic/development tool
The DIH is not considered a core part of Solr, even though it comes with the Solr download, and so you must add its Java JAR files to your Solr setup to use it. If this isn't done, you'll eventually see a ClassNotFoundException error. The DIH's JAR files are located in Solr's dist directory: apache-solr-dataimporthandler-3.4.0.jar and apache-solr-dataimporthandler-extras-3.4.0.jar. The easiest way to add JAR files to a Solr configuration is to copy them to the <solr_home>/lib directory; you may need to create it. Another method is to reference them from solrconfig.xml via <lib/> tags—see Solr's example configuration for examples of that. You will most likely need some additional JAR files as well. If you'll be communicating with a database, then you'll need to get a JDBC driver for it. If you will be extracting text from various document formats then you'll need to add the JARs in /contrib/extraction/lib. Finally, if you'll be indexing e-mail then you'll need to add the JARs in /contrib /dataimporthandler/lib.
The DIH needs to be registered with Solr in solrconfig.xml like so:
<requestHandler name="/dih_artists_jdbc"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">mb-dih-artists-jdbc.xml</str>
</lst>
</requestHandler>
This reference mb-dih-artists-jdbc.xml is located in <solr-home>/conf, which specifies the details of a data importing process. We'll get to that file in a bit.
Index a DB table directly into Solr
Step 1 : Edit your solrconfig.xml to add the request handler
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
Step 2 : Create a data-config.xml file as follows and save it to the conf dir
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/dbname"
user="user-name"
password="password"/>
<document>
<entity name="id"
query="select id,name,desc from mytable">
</entity>
</document>
</dataConfig>
Step 3 : Ensure that your solr schema (schema.xml) has the fields 'id', 'name', 'desc'. Change the appropriate details in the data-config.xml
Step 4: Drop your JDBC driver jar file into the <solr-home>/lib directory .
Step 5 : Run the command
http://solr-host:port/solr/dataimport?command=full-import .
Keep in mind that every time a full-import is executed the index is cleaned up. If you do not wish that to happen add clean=false. For example:
http://solr-host:port/solr/dataimport?command=full-import&clean=false
Index the fields in different names
Step: 1 Change the data-config as follows :
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/dbname"
user="user-name"
password="password"/>
<document>
<entity name="id"
query="select id,name,desc from mytable">
<field column="id" name="solr_id"/>
<field column="name" name="solr_name"/>
<field column="desc" name="solr_desc"/>
</entity>
</document>
</dataConfig>
Step 2 : This time the fields will be written to the solr fields 'solr_id', 'solr_name', solr_desc'. You must have these fields in the schema.xml.
Step 3 : Run the command http://solr-host:port/dataimpor?command=full-import
Index data from multiple tables into Solr
Step: 1 Change the data-config as follows :
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/dbname"
user="user-name"
password="password"/>
<document>
<entity name="outer"
query="select id,name,desc from mytable">
<field column="id" name="solr_id"/>
<field column="name" name="solr_name"/>
<field column="desc" name="solr_desc"/>
<entity name="inner"
query="select details from another_table where id ='${outer.id}'">
<field column="details" name="solr_details"/>
</entity>
</entity>
</document>
</dataConfig>
Step 2: The schema.xml should have the solr_details field
Step 3: Run the full-import command 配置数据源
将 dataSource标签直接添加到dataConfig下面,即成为dataConfig的子元素.
<entity name="feature" query="select description from feature where item_id='${item.id}'">
<field name="feature" column="description" />
</entity>
feature表中的外键item_id跟item中的主键连在一起从数据库中取得该row的数据。相同地,我们将item和category连表 (它们是多对多的关系)。注意,我们是怎样使用中间表和标准sql连表的:
<entity name="item_category" query="select category_id from item_category where item_id='${item.id}'">
<entity name="category" query="select description from category where id = '${item_category.category_id}'">
<field column="description" name="cat" />
</entity>
</entity>
短一点的 data-config
在上面的例子中,这里有好几个从域到solr域之间的映射。如果域的名字和solr中域的名字是一样的话,完全避免使用在实体中配置域也是可以的。 当然,如果你需要使用转换器的话,你还是需要加上域实体的。
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document>
<entity name="item" query="select * from item">
<entity name="feature" query="select description as features from feature where item_id='${item.ID}'"/>
<entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'">
<entity name="category" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'"/>
</entity>
</entity>
</document>
</dataConfig>
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" pk="ID" query="select * from item"
deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">
<entity name="feature" pk="ITEM_ID"
query="select description as features from feature where item_id='${item.ID}'">
</entity>
<entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'">
<entity name="category" pk="ID"
query="select description as cat from category where id = '${item_category.CATEGORY_ID}'">
</entity>
</entity>
</entity>
</document>
</dataConfig>
注意到item实体的 属性deltaquery了吗,它包含了一个能够查出最近更新的sql语句。注意,变量{dataimporter.last_index_time } 是DataImporthandler传过来的变量,我们叫它时间戳,它指出“完全导入”或者“部分导入”的最后运行时间。你可以在data- config.xml文件中的sql的任何地方使用这个变量,它将在processing这个过程中被赋值。
上面例子中deltaQuery 只能够发现item中的更新,而不能发现其他表的。你可以像下面那样在一个sql语句中指定所有的表的更新:
deltaQuery="select id from item where id in
(select item_id as id from feature where last_modified > '${dataimporter.last_index_time}')
or id in
(select item_id as id from item_category where item_id in
(select id as item_id from category where last_modified > '${dataimporter.last_index_time}')
or last_modified > '${dataimporter.last_index_time}')
or last_modified > '${dataimporter.last_index_time}'"
写一个类似上面的庞大的deltaQuery 并不是一件很享受的工作,我们还是选择其他的方法来达到这个目的
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document>
<entity name="item" pk="ID" query="select * from item"
deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">
<entity name="feature" pk="ITEM_ID"
query="select DESCRIPTION as features from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"/>
<entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">
<entity name="category" pk="ID"
query="select DESCRIPTION as cat from category where ID = '${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"/>
</entity>
</entity>
</document>
</dataConfig>