Inside SharePoint Creating an External Storage Solution for SharePoint(转)
Inside SharePoint Creatingan External Storage Solution for SharePointPavCherny
Code download available at: ChernySharePoint2009_06.exe (2,006 KB)
Contents
InternalBinary Storage
ExternalBinary Storage
Buildingan Unmanaged EBS Provider
Buildinga Managed EBS Provider
Registeringan EBS Provider in SharePoint
ImplementingGarbage Collection
Conclusion
Microsoft estimates that as much as 80 percentof the data stored in Microsoft Windows SharePoint Services ( WSS ) 3.0and Microsoft Office SharePoint Server (MOSS) 2007 content databases isnon-relational binary large object ( BLOB) data, such as MicrosoftOffice Word documents, Microsoft Office Excel spreadsheets, andMicrosoft Office PowerPoint presentations. Only 20 percent is relationalmetadata, which implies a suboptimal use of Microsoft SQL Serverresources at the database backend. SharePoint does not take advantage ofrecent SQL Server innovations for unstructured data introduced in SQLServer 2008, such as the FILESTREAM attribute or Remote BLOB StorageAPI, but provides its own options to increase the storage efficiency andmanageability of massive data volumes.
Specifically, SharePoint includes an externalbinary storage provider API, ISPExternalBinaryProvider, which Microsoftfirst published as a hotfix in May 2007 and incorporated later intoService Pack 1. The ISPExternalBinaryProvider API is separate from theRemote BLOB Storage API. Third-party vendors can use this API tointegrate SharePoint with advanced storage solutions, such ascontent-addressable storage (CAS) systems. You can also use this API tomaintain SharePoint BLOB data on a central file server outside ofcontent databases if you want to build a custom solution to increasestorage efficiency and scalability in a SharePoint farm. Keep in mind,however, that this API is specific to WSS 3.0 and MOSS 2007. It willchange in the next SharePoint release, which means that you will have toupdate your provider.
In this column, I discuss how to extend theSharePoint storage architecture using the ISPExternalBinaryProvider API,including advantages and disadvantages, implementation details,performance considerations, and garbage collection. I also discuss a64-bit compatibility issue of Microsoft Visual Studio that can causeSharePoint to fail loading managed ISPExternalBinaryProvider componentsdespite a correct interface implementation. Where appropriate, I referto the ISPExternalBinaryProvider documentation in the WSS 3.0 SDK.Another reference worth mentioning is KyleTillman's blog.
Kyle does a great job explaining how hemastered the implementation hurdles in managed code, but neither the WSS3.0 SDK nor Kyle's blog post includes a Visual Studio sample project,so I decided to provide ISPExternalBinaryProvider samples in bothunmanaged and managed code in this column's companion material. Thepurpose of these samples is to help you get started if you areinterested in integrating external storage solutions with SharePoint.Remember, though, that these samples are untested and not ready forproduction use.
Internal Binary Storage
By default, SharePoint stores BLOB data in theContent column of the AllDocStreams table in the content database. Theobvious advantage of this approach is straightforward transactionalconsistency between relational data and the associated non-relationalfile contents. For example, it'snot complicated to insert the metadataof a Word document along with the unstructured content into a contentdatabase, nor is it complicated to associate metadata with thecorresponding unstructured content in select, update, or deleteoperations. However, the most obvious disadvantage of the defaultapproach is an inefficient use of storage resources. Despite an I/Osubsystem optimized for high performance, the SQL Server storage engineis not exactly a file-server replacement.
A SQL Server database consists of transactionlog and data files, as illustrated in Figure 1. Inorder to ensure reliable transactional behavior, SQL Server first writesall transaction records to the log file before it flushes thecorresponding data in 8KB pages to the data file on disk. Depending onthe selected recovery model, this requires more than twice the BLOB sizein storage capacity until you perform a backup and purge thetransaction log. Moreover, SQL Server does not store unstructuredSharePoint content directly in data pages. Instead, SQL Server uses aseparate collection of text/image pages and only stores a 16-byte textpointer to the BLOB's root node in the data row. Text/image pages areorganized in a balanced tree, yet there is only one collection oftext/image pages for each table. For the AllDocStreams table, this meansthat the content of all files is spread across the same text/image pagecollection. A single text/image page can hold data fragments frommultiple BLOBs, or it may hold intermediate nodes for BLOBs larger than32KB in size.
Figure 1 Default SharePoint BLOBstorage in SQL Server
Let's not dive too deeply into SQL Serverinternals, though. The point is that when reading unstructured content,SQL Server must go through the data row to get the text pointer and thenthrough the BLOB's root node and possibly additional intermediate nodesto locate all data fragments spread across any number of text/imagepages that SQL Server must load into memory in full to get all datablocks. This is because SQL Server performs I/O operations at the pagelevel. These complexities impair file-streaming performance incomparison to direct access through the file system. SQL Server alsoimposes a hard size limit of 2GB on SharePoint because this is themaximum capacity of the image data type. The Content column of theAllDocStreams table is an image column, so you cannot store files largerthan 2GB in a SharePoint content database.
External Binary Storage
The ISPExternalBinaryProvider API offers aclever alternative to internal BLOB storage in SharePoint contentdatabases. It is a straightforward COM interface with only two methods(StoreBinary and RetrieveBinary), which you can use to implement anExternal Binary Storage (EBS) provider. For architecture details, seethe topic "Architectureof External BLOB Storage" in the WSS 3.0 SDK.
SharePoint loads your EBS provider when youset the ExternalBinaryStoreClassId property of the local SPFarm object(SPFarm.Local.ExternalBinaryStoreClassId) to the provider's COM classidentifier (CLSID). SharePoint then calls the provider's StoreBinarymethod whenever you submit BLOB data, such as when you're uploading afile to a document library. The EBS provider can decide to store theBLOB in its associated external storage system and return acorresponding BLOB identifier ( BLOB ID) to SharePoint, or it can setthe pfAccepted parameter in the StoreBinary method to false to indicatethat it did not handle the BLOB. In the latter case, SharePoint storesthe BLOB in the content database as usual. On the other hand, if the EBSprovider accepted the BLOB, SharePoint only inserts the BLOB ID intothe Content column of the AllDocStreams table, as indicated in Figure2. The BLOB ID can be any value that enables the EBS providerto locate the content in the external storage system, such as afilename, a file path, a globally unique identifier (GUID), or a contentdigest. The sample providers included in the companion material, forinstance, use GUIDs as filenames for reliable identification of BLOBs ona file server.
Figure 2 Storing a SharePointBLOB in an external storage system
SharePoint also keeps track of externallystored files by setting the highest DocFlags bit of these files to 1.DocFlags is a column of the AllDocs table. When a user requests todownload an externally stored file, SharePoint checks DocFlags andpasses the Content value from the AllDocStreams table to theRetrieveBinary method of the EBS provider. In response to theRetrieveBinary call, the EBS provider must retrieve the indicated BLOBfrom the external storage system and return the binary content toSharePoint in form of a COM object that implements the ILockBytesinterface. Note that SharePoint does not call the RetrieveBinary methodfor BLOBs stored directly in the content database.
Note also that the storage and retrievalprocesses are transparent to the user as long as the user doesn'tattempt to bypass SharePoint. So, you don't need to replace built-in Webparts with custom versions that tie metadata in a list with a documentstored externally; productivity applications, such as Microsoft Office,don't need to know how to store metadata in one place and then thedocument in another; and Search does not need to process metadataseparate from documents. Moreover, and this is one of my favoriteadvantages of the EBS provider architecture, the user must go throughSharePoint to access externally stored BLOB data. A user bypassingSharePoint and directly accessing a content database through a SQLServer connection ends up downloading BLOB IDs instead of actual filecontents, as illustrated in Figure 3. You can verifythis behavior if you deploy the SQL Download Web Part (which I used inthe April 2009 column to demonstrate how to bypass SharePoint AD RMSprotection) in a test environment. Furthermore, users don't need—andshould not have—access permissions to the external BLOB store. OnlySharePoint security accounts require access because SharePoint calls theEBS provider methods in the security context of the site's applicationpool account.
Figure 3 The EBS provider can bea roadblock to bypassing SharePoint permissions for file downloads
Keep in mind, however, that EBS providers alsohave drawbacks due to the complexity of maintaining integrity betweenmetadata in the SharePoint farm's content databases and the externalBLOB store. For a good discussion of pros and cons, check out the topic "OperationalLimits and Trade-Off Analysis" in the WSS 3.0 SDK. Make sure youread this very important topic before implementing an EBS provider in aSharePoint environment.
Building an Unmanaged EBS Provider
Now let's tackle the challenges of buildingEBS providers. The ISPExternalBinaryProvider interface iswell-documented in the WSS 3.0 SDK under "The BLOBAccess Interface: ISPExternalBinaryProvider." However, it seemsMicrosoft forgot to cover the EBS provider details. After all, we arenot just consuming the interface of an existing COM server. We aretasked with building that COM server ourselves and implementing theISPExternalBinaryProvider interface. Most importantly, the WSS 3.0 SDKfails to mention the type of COM server we are supposed to build and therequired threading model. A classic COM server can run out-of-processor in-process, and it can support the single-threaded apartment (STA)model, the multithreaded apartment ( MTA) model, or both, or thefree-threaded model. For the EBS provider to work properly, make sureyou build a thread-safe in-process COM server that supports thethreading model "Both" for STAs and the MTA.
You also need to think about which programminglanguage to use. This is important because theISPExternalBinaryProvider interface is the lowest-level API ofSharePoint. Performance issues can affect the entire SharePoint farm.For this reason, I recommend using a language that enables you to buildsmall and fast COM objects, such as Visual C++ and Active TemplateLibrary (ATL). ATL provides helpful C++ classes to simplify thedevelopment of thread-safe COM servers in unmanaged code with thecorrect level of threading support.
Visual Studio also includes a variety of ATLwizards. Just create an ATL project, select Dynamic-link library ( DLL)for the server type, copy the ISPExternalBinaryProvider interfacedefinition from the WSS 3.0 SDK into the interface definition language( IDL) file of your ATL project, add a new class for an ATL SimpleObject, select "Both" as the threading model and no aggregation, thenright-click the new class, point to Add, click Implement Interface, andselect ISPExternalBinaryProvider. That's it! The Implement InterfaceWizard performs all necessary plumbing, so you can focus on implementingthe StoreBinary and RetrieveBinary methods.
And don't let unmanaged C++ code intimidateyou. If you analyze the SampleStore.cpp file in the companion material,you can see that the StoreBinary and RetrieveBinary implementations arerelatively straightforward. Essentially, the sample StoreBinary methodconstructs a file path based on a StorePath registry value, the Site IDpassed in from SharePoint, and a GUID generated for the BLOB, and thenuses the Win32 WriteFile function to save the binary data obtained fromthe ILockBytes instance. The sample RetrieveBinary method, on the otherhand, constructs the file path based on the same StorePath registryvalue, the Site ID, and the BLOB ID passed in from SharePoint, and thenuses the Win32 ReadFile function to retrieve the unstructured data,which the EBS provider copies into a new ILockBytes instance that itthen passes back to SharePoint. Figure 4 illustrateshow the EBS provider constructs the file path.
Figure 4 Constructing file pathsfor StoreBinary and RetrieveBinary operations in the sample EBSproviders
Building a Managed EBS Provider
Of course, SharePoint developers might preferusing familiar managed languages to build EBS providers, even thoughbuilding managed EBS providers is not necessarily less complicated thanbuilding unmanaged providers due to the complexity of COMinteroperability. Keep in mind that an application written in unmanagedcode can only load one version of the common language runtime (CLR), soyour code needs to work with the same version of the CLR that the restof SharePoint is using, otherwise you might end up with unexpectedbehavior. Also, you still must deal with unmanaged interfaces and thecorresponding marshalling of parameters and buffers. Just compareSampleStore.cpp with SampleStore.cs in the companion material. There areno gains using a managed language in terms of code structure orprogramming simplicity.
Moreover, be aware of 64-bit compatibilityissues if you develop managed EBS providers on the x64 platform. Figure5 shows a typical error that results from invalid COMregistration settings on a development computer. If you enable theRegister for COM Interop checkbox in the project properties in VisualStudio 2005 or Visual Studio 2008, you'll end up with COM registrationsettings for your provider in the registry underHKEY_CLASSES_ROOT\Wow6432Node\CLSID\<ProviderCLSID>. Visual Studiouses the 32-bit version of the Assembly Registration Tool (Regasm.exe)even on the x64 platform.
Figure 5 Due to invalid COMregistration settings, a managed EBS provider could not be loaded
However, the 64-bit version of SharePointcannot load a 32-bit COM server registered under the Wow6432Node, so youmust manually register your managed EBS provider by using the 64-bitRegasm.exe version, located in the%WINDIR%\Microsoft.NET\Framework64\v2.0.50727 directory. For example,the command "%WINDIR%\Microsoft.NET\Framework64\v2.0.50727\Regasm.exe"ManagedProvider.dll creates the required registry settings for themanaged sample provider underHKEY_CLASSES_ROOT\CLSID\<ProviderCLSID>. Another approach is tocreate a Setup program and mark the EBS provider for automatic COMregistration.
Remember also that managed EBS providers comewith significantly more overhead and performance penalties than theirunmanaged ATL counterparts. You can see this if you compare the COMregistration settings in the registry. As the InProcServer32 keyreveals, the COM runtime loads unmanaged EBS provider DLLs directly,while managed EBS providers rely on Mscoree.dll as the in-proc server,which is the core engine of the CLR. So, for managed providers, the COMruntime loads the CLR and then the CLR loads the EBS provider assemblyas registered under the Assembly key and creates a COM Callable Wrapper(CCW) proxy to handle the interaction between the unmanaged SharePointclient (Owssvr.dll) and the managed EBS provider.
Keep in mind that the unmanaged SharePointserver does not directly interact with your managed provider. It's theCCW that marshals parameters, calls the managed methods, and handlesHRESULTs. This indirection is especially apparent in the differentreturn types of managed methods in comparison to unmanaged methods.Unmanaged methods return HRESULTs to indicate success or failures whilemanaged methods are supposed to have the void return type. So don'treturn explicit HRESULTs in managed code. You must raise system oruser-defined exceptions in response to error conditions. If a managedmethod completes without an exception, the CCW automatically returnsS_OK to the unmanaged client.
On the other hand, if a managed method raisesan exception, the CCW maps error codes and messages to HRESULTs anderror information. The CCW implements various error-handling interfacesfor this purpose, such as ISupportErrorInfo and IErrorInfo, butSharePoint does not take advantage of these interfaces. EBS providersmust implement their own error reporting through the Windows event log,SharePoint diagnostic logs, trace files, or other means. SharePoint onlyexpects the HRESULT values S_OK for success and E_FAIL for any error.You can use the Marshal.ThrowExceptionForHR method to return E_FAIL toSharePoint, as demonstrated in SampleStore.cs.
Registering an EBS Provider in SharePoint
Easily the most confusing section onISPExternalBinaryProvider in the WSS 3.0 SDK is the topic "Installingand Configuring Your BLOB Provider." At the time of this writing,this section was filled with misleading information and errors. Even theWindows PowerShell commands were incorrect. If you assign the EBSprovider to $yourProviderConfig and afterwards use$providerConfig.ProviderCLSID, don't be surprised when you receive anerror stating that $providerConfig doesn't exist. Of course, you won'teven reach this point because the Active and ProviderCLSID propertiesaren't part of the ISPExternalBinaryProvider interface. These mysteriousproperties belong to a dual interface that is not covered in thedocumentation. Just for fun, I implemented a sample version in bothunmanaged and managed code, but your ISPExternalBinaryProviderimplementation does not require these proprietary properties at all.
The ProviderCLSID property might be handy, butthe CLSID is also available in the registry if you search for theProgID, such as UnmanagedProvider.SampleStore orManagedProvider.SampleStore, and you can also find the CLSIDs in thecode files SampleStore.rgs and SampleStore.cs. As mentioned earlier,setting the ExternalBinaryStoreClassId property of the local SPFarmobject to the CLSID registers the EBS provider. Setting theExternalBinaryStoreClassId property of the local SPFarm object to anempty GUID ("00000000-0000-0000-0000-000000000000") removes the EBSprovider registration. Don't forget to call the SPFarm object's Updatemethod to save the changes in the configuration database and restartInternet Information Services ( IIS). The following code listingillustrates how to accomplish these tasks in Windows PowerShell:
::LoadWithPartialName('Microsoft.SharePoint')
$farm = ::Local
# Registering the CLSID of an EBS provider
$farm.ExternalBinaryStoreClassId = "C4A543C2-B7DB-419F-8C79-68B8842EC005"
$farm.Update()
IISRESET
# Removing the EBS provider registration
$farm.ExternalBinaryStoreClassId = "00000000-0000-0000-0000-000000000000"
$farm.Update()
IISRESET
Implementing Garbage Collection
Another section in the WSS 3.0 SDK featuringmysterious components and critical code snippets is titled "ImplementingLazy Garbage Collection." At the time of this writing, this sectioncontained references to another mysterious Utility class withDirFromSiteId and FileFromBlobid methods as well as an incorrectassignment of Directory.GetFiles results to a FileInfo array, but let'snot be too demanding on WSS 3.0 documentation quality. The DirFromSiteIdand FileFromBlobid helper methods reveal their purpose through theirnames and the incorrect FileInfo array is easily replaced with a stringarray, or you can replace the Directory.GetFiles method with a call tothe GetFiles method of a DirectoryInfo object. The Garbage Collectorsample program in the companion material uses the DirectoryInfo approachand follows the suggested sequence of steps for garbage collection.
An important deviation of the GarbageCollector sample from the SDK explanations concerns the handling oftiming conditions. This is a critical issue because timing conditionscan lead to misidentification and deletion of valid files during garbagecollection. Take a look at Figure 6, which illustratesthe WSS 3.0 SDK–recommended approach to determine orphaned files byenumerating all BLOB files in the EBS store and then removing all thosereferences from the BLOB list that are still in the content database asindicated through the site's ExternalBinaryIds collection. The remainingreferences in the BLOB list are supposed to indicate orphaned filesthat should be deleted.
Figure 6 Misidentification of avalid BLOB as orphaned due to a timing condition
However, the EBS provider must, of course,first finish writing BLOB data before it can return a BLOB ID toSharePoint. Depending on network bandwidth and other conditions, I/Operformance can fluctuate. So, there is a chance that the EBS providercould create a new BLOB—which then appears in your BLOB list—butcompletes writing the BLOB data after you have determined theExternalBinaryIds so the BLOB ID is not yet present in this collection.Accordingly, the reference to the new BLOB remains in the orphaned BLOBlist and if you purge the orphaned BLOBs at this point, you accidentallydelete a valid content item and lose data! In order to avoid thisproblem, the sample Garbage Collector checks the file creation time andadds only those items to the BLOB list that are more than one hour old.
Conclusion
By integrating an external storage solutionwith SharePoint, you can increase storage efficiency, systemperformance, and scalability of a SharePoint farm. Another advantage isthat this forces users to go through SharePoint to access unstructuredcontents. Pulling data out of the content databases via direct SQLServer connections only yields binary BLOB identifiers instead of theactual files. However, EBS providers also have drawbacks due to thecomplexity of maintaining integrity between metadata in the SharePointfarm's content databases and the external BLOB store.
In order to integrate SharePoint with anexternal storage solution, you must build an EBS provider, which is aCOM server that implements the ISPExternalBinaryProvider interface withits StoreBinary and RetrieveBinary methods. You can create unmanaged andmanaged EBS providers, but be aware of performance and compatibilityissues if you decide to use managed code. Also keep in mind that theISPExternalBinaryProvider interface does not include a DeleteBinarymethod. You must explicitly remove orphaned BLOBs through lazy garbagecollection, and be careful to avoid timing conditions that can lead tothe accidental deletion of valid BLOB items.
Pav Chernyis an IT expert and author specializing in Microsoft technologies forcollaboration and unified communication. His publications include whitepapers, product manuals, and books with a focus on IT operations andsystem administration. Pav is President of Biblioso Corporation, acompany that specializes in managed documentation and localizationservices.Inside SharePoint Creatingan External Storage Solution for SharePoint
PavCherny
Code download available at: ChernySharePoint2009_06.exe (2,006 KB)
Contents
InternalBinary Storage
ExternalBinary Storage
Buildingan Unmanaged EBS Provider
Buildinga Managed EBS Provider
Registeringan EBS Provider in SharePoint
ImplementingGarbage Collection
Conclusion
Microsoft estimates that as much as 80 percentof the data stored in Microsoft Windows SharePoint Services ( WSS ) 3.0and Microsoft Office SharePoint Server (MOSS) 2007 content databases isnon-relational binary large object ( BLOB) data, such as MicrosoftOffice Word documents, Microsoft Office Excel spreadsheets, andMicrosoft Office PowerPoint presentations. Only 20 percent is relationalmetadata, which implies a suboptimal use of Microsoft SQL Serverresources at the database backend. SharePoint does not take advantage ofrecent SQL Server innovations for unstructured data introduced in SQLServer 2008, such as the FILESTREAM attribute or Remote BLOB StorageAPI, but provides its own options to increase the storage efficiency andmanageability of massive data volumes.
Specifically, SharePoint includes an externalbinary storage provider API, ISPExternalBinaryProvider, which Microsoftfirst published as a hotfix in May 2007 and incorporated later intoService Pack 1. The ISPExternalBinaryProvider API is separate from theRemote BLOB Storage API. Third-party vendors can use this API tointegrate SharePoint with advanced storage solutions, such ascontent-addressable storage (CAS) systems. You can also use this API tomaintain SharePoint BLOB data on a central file server outside ofcontent databases if you want to build a custom solution to increasestorage efficiency and scalability in a SharePoint farm. Keep in mind,however, that this API is specific to WSS 3.0 and MOSS 2007. It willchange in the next SharePoint release, which means that you will have toupdate your provider.
In this column, I discuss how to extend theSharePoint storage architecture using the ISPExternalBinaryProvider API,including advantages and disadvantages, implementation details,performance considerations, and garbage collection. I also discuss a64-bit compatibility issue of Microsoft Visual Studio that can causeSharePoint to fail loading managed ISPExternalBinaryProvider componentsdespite a correct interface implementation. Where appropriate, I referto the ISPExternalBinaryProvider documentation in the WSS 3.0 SDK.Another reference worth mentioning is KyleTillman's blog.
Kyle does a great job explaining how hemastered the implementation hurdles in managed code, but neither the WSS3.0 SDK nor Kyle's blog post includes a Visual Studio sample project,so I decided to provide ISPExternalBinaryProvider samples in bothunmanaged and managed code in this column's companion material. Thepurpose of these samples is to help you get started if you areinterested in integrating external storage solutions with SharePoint.Remember, though, that these samples are untested and not ready forproduction use.
Internal Binary Storage
By default, SharePoint stores BLOB data in theContent column of the AllDocStreams table in the content database. Theobvious advantage of this approach is straightforward transactionalconsistency between relational data and the associated non-relationalfile contents. For example, it'snot complicated to insert the metadataof a Word document along with the unstructured content into a contentdatabase, nor is it complicated to associate metadata with thecorresponding unstructured content in select, update, or deleteoperations. However, the most obvious disadvantage of the defaultapproach is an inefficient use of storage resources. Despite an I/Osubsystem optimized for high performance, the SQL Server storage engineis not exactly a file-server replacement.
A SQL Server database consists of transactionlog and data files, as illustrated in Figure 1. Inorder to ensure reliable transactional behavior, SQL Server first writesall transaction records to the log file before it flushes thecorresponding data in 8KB pages to the data file on disk. Depending onthe selected recovery model, this requires more than twice the BLOB sizein storage capacity until you perform a backup and purge thetransaction log. Moreover, SQL Server does not store unstructuredSharePoint content directly in data pages. Instead, SQL Server uses aseparate collection of text/image pages and only stores a 16-byte textpointer to the BLOB's root node in the data row. Text/image pages areorganized in a balanced tree, yet there is only one collection oftext/image pages for each table. For the AllDocStreams table, this meansthat the content of all files is spread across the same text/image pagecollection. A single text/image page can hold data fragments frommultiple BLOBs, or it may hold intermediate nodes for BLOBs larger than32KB in size.
Figure 1 Default SharePoint BLOBstorage in SQL Server
Let's not dive too deeply into SQL Serverinternals, though. The point is that when reading unstructured content,SQL Server must go through the data row to get the text pointer and thenthrough the BLOB's root node and possibly additional intermediate nodesto locate all data fragments spread across any number of text/imagepages that SQL Server must load into memory in full to get all datablocks. This is because SQL Server performs I/O operations at the pagelevel. These complexities impair file-streaming performance incomparison to direct access through the file system. SQL Server alsoimposes a hard size limit of 2GB on SharePoint because this is themaximum capacity of the image data type. The Content column of theAllDocStreams table is an image column, so you cannot store files largerthan 2GB in a SharePoint content database.
External Binary Storage
The ISPExternalBinaryProvider API offers aclever alternative to internal BLOB storage in SharePoint contentdatabases. It is a straightforward COM interface with only two methods(StoreBinary and RetrieveBinary), which you can use to implement anExternal Binary Storage (EBS) provider. For architecture details, seethe topic "Architectureof External BLOB Storage" in the WSS 3.0 SDK.
SharePoint loads your EBS provider when youset the ExternalBinaryStoreClassId property of the local SPFarm object(SPFarm.Local.ExternalBinaryStoreClassId) to the provider's COM classidentifier (CLSID). SharePoint then calls the provider's StoreBinarymethod whenever you submit BLOB data, such as when you're uploading afile to a document library. The EBS provider can decide to store theBLOB in its associated external storage system and return acorresponding BLOB identifier ( BLOB ID) to SharePoint, or it can setthe pfAccepted parameter in the StoreBinary method to false to indicatethat it did not handle the BLOB. In the latter case, SharePoint storesthe BLOB in the content database as usual. On the other hand, if the EBSprovider accepted the BLOB, SharePoint only inserts the BLOB ID intothe Content column of the AllDocStreams table, as indicated in Figure2. The BLOB ID can be any value that enables the EBS providerto locate the content in the external storage system, such as afilename, a file path, a globally unique identifier (GUID), or a contentdigest. The sample providers included in the companion material, forinstance, use GUIDs as filenames for reliable identification of BLOBs ona file server.
Figure 2 Storing a SharePointBLOB in an external storage system
SharePoint also keeps track of externallystored files by setting the highest DocFlags bit of these files to 1.DocFlags is a column of the AllDocs table. When a user requests todownload an externally stored file, SharePoint checks DocFlags andpasses the Content value from the AllDocStreams table to theRetrieveBinary method of the EBS provider. In response to theRetrieveBinary call, the EBS provider must retrieve the indicated BLOBfrom the external storage system and return the binary content toSharePoint in form of a COM object that implements the ILockBytesinterface. Note that SharePoint does not call the RetrieveBinary methodfor BLOBs stored directly in the content database.
Note also that the storage and retrievalprocesses are transparent to the user as long as the user doesn'tattempt to bypass SharePoint. So, you don't need to replace built-in Webparts with custom versions that tie metadata in a list with a documentstored externally; productivity applications, such as Microsoft Office,don't need to know how to store metadata in one place and then thedocument in another; and Search does not need to process metadataseparate from documents. Moreover, and this is one of my favoriteadvantages of the EBS provider architecture, the user must go throughSharePoint to access externally stored BLOB data. A user bypassingSharePoint and directly accessing a content database through a SQLServer connection ends up downloading BLOB IDs instead of actual filecontents, as illustrated in Figure 3. You can verifythis behavior if you deploy the SQL Download Web Part (which I used inthe April 2009 column to demonstrate how to bypass SharePoint AD RMSprotection) in a test environment. Furthermore, users don't need—andshould not have—access permissions to the external BLOB store. OnlySharePoint security accounts require access because SharePoint calls theEBS provider methods in the security context of the site's applicationpool account.
Figure 3 The EBS provider can bea roadblock to bypassing SharePoint permissions for file downloads
Keep in mind, however, that EBS providers alsohave drawbacks due to the complexity of maintaining integrity betweenmetadata in the SharePoint farm's content databases and the externalBLOB store. For a good discussion of pros and cons, check out the topic "OperationalLimits and Trade-Off Analysis" in the WSS 3.0 SDK. Make sure youread this very important topic before implementing an EBS provider in aSharePoint environment.
Building an Unmanaged EBS Provider
Now let's tackle the challenges of buildingEBS providers. The ISPExternalBinaryProvider interface iswell-documented in the WSS 3.0 SDK under "The BLOBAccess Interface: ISPExternalBinaryProvider." However, it seemsMicrosoft forgot to cover the EBS provider details. After all, we arenot just consuming the interface of an existing COM server. We aretasked with building that COM server ourselves and implementing theISPExternalBinaryProvider interface. Most importantly, the WSS 3.0 SDKfails to mention the type of COM server we are supposed to build and therequired threading model. A classic COM server can run out-of-processor in-process, and it can support the single-threaded apartment (STA)model, the multithreaded apartment ( MTA) model, or both, or thefree-threaded model. For the EBS provider to work properly, make sureyou build a thread-safe in-process COM server that supports thethreading model "Both" for STAs and the MTA.
You also need to think about which programminglanguage to use. This is important because theISPExternalBinaryProvider interface is the lowest-level API ofSharePoint. Performance issues can affect the entire SharePoint farm.For this reason, I recommend using a language that enables you to buildsmall and fast COM objects, such as Visual C++ and Active TemplateLibrary (ATL). ATL provides helpful C++ classes to simplify thedevelopment of thread-safe COM servers in unmanaged code with thecorrect level of threading support.
Visual Studio also includes a variety of ATLwizards. Just create an ATL project, select Dynamic-link library ( DLL)for the server type, copy the ISPExternalBinaryProvider interfacedefinition from the WSS 3.0 SDK into the interface definition language( IDL) file of your ATL project, add a new class for an ATL SimpleObject, select "Both" as the threading model and no aggregation, thenright-click the new class, point to Add, click Implement Interface, andselect ISPExternalBinaryProvider. That's it! The Implement InterfaceWizard performs all necessary plumbing, so you can focus on implementingthe StoreBinary and RetrieveBinary methods.
And don't let unmanaged C++ code intimidateyou. If you analyze the SampleStore.cpp file in the companion material,you can see that the StoreBinary and RetrieveBinary implementations arerelatively straightforward. Essentially, the sample StoreBinary methodconstructs a file path based on a StorePath registry value, the Site IDpassed in from SharePoint, and a GUID generated for the BLOB, and thenuses the Win32 WriteFile function to save the binary data obtained fromthe ILockBytes instance. The sample RetrieveBinary method, on the otherhand, constructs the file path based on the same StorePath registryvalue, the Site ID, and the BLOB ID passed in from SharePoint, and thenuses the Win32 ReadFile function to retrieve the unstructured data,which the EBS provider copies into a new ILockBytes instance that itthen passes back to SharePoint. Figure 4 illustrateshow the EBS provider constructs the file path.
Figure 4 Constructing file pathsfor StoreBinary and RetrieveBinary operations in the sample EBSproviders
Building a Managed EBS Provider
Of course, SharePoint developers might preferusing familiar managed languages to build EBS providers, even thoughbuilding managed EBS providers is not necessarily less complicated thanbuilding unmanaged providers due to the complexity of COMinteroperability. Keep in mind that an application written in unmanagedcode can only load one version of the common language runtime (CLR), soyour code needs to work with the same version of the CLR that the restof SharePoint is using, otherwise you might end up with unexpectedbehavior. Also, you still must deal with unmanaged interfaces and thecorresponding marshalling of parameters and buffers. Just compareSampleStore.cpp with SampleStore.cs in the companion material. There areno gains using a managed language in terms of code structure orprogramming simplicity.
Moreover, be aware of 64-bit compatibilityissues if you develop managed EBS providers on the x64 platform. Figure5 shows a typical error that results from invalid COMregistration settings on a development computer. If you enable theRegister for COM Interop checkbox in the project properties in VisualStudio 2005 or Visual Studio 2008, you'll end up with COM registrationsettings for your provider in the registry underHKEY_CLASSES_ROOT\Wow6432Node\CLSID\<ProviderCLSID>. Visual Studiouses the 32-bit version of the Assembly Registration Tool (Regasm.exe)even on the x64 platform.
Figure 5 Due to invalid COMregistration settings, a managed EBS provider could not be loaded
However, the 64-bit version of SharePointcannot load a 32-bit COM server registered under the Wow6432Node, so youmust manually register your managed EBS provider by using the 64-bitRegasm.exe version, located in the%WINDIR%\Microsoft.NET\Framework64\v2.0.50727 directory. For example,the command "%WINDIR%\Microsoft.NET\Framework64\v2.0.50727\Regasm.exe"ManagedProvider.dll creates the required registry settings for themanaged sample provider underHKEY_CLASSES_ROOT\CLSID\<ProviderCLSID>. Another approach is tocreate a Setup program and mark the EBS provider for automatic COMregistration.
Remember also that managed EBS providers comewith significantly more overhead and performance penalties than theirunmanaged ATL counterparts. You can see this if you compare the COMregistration settings in the registry. As the InProcServer32 keyreveals, the COM runtime loads unmanaged EBS provider DLLs directly,while managed EBS providers rely on Mscoree.dll as the in-proc server,which is the core engine of the CLR. So, for managed providers, the COMruntime loads the CLR and then the CLR loads the EBS provider assemblyas registered under the Assembly key and creates a COM Callable Wrapper(CCW) proxy to handle the interaction between the unmanaged SharePointclient (Owssvr.dll) and the managed EBS provider.
Keep in mind that the unmanaged SharePointserver does not directly interact with your managed provider. It's theCCW that marshals parameters, calls the managed methods, and handlesHRESULTs. This indirection is especially apparent in the differentreturn types of managed methods in comparison to unmanaged methods.Unmanaged methods return HRESULTs to indicate success or failures whilemanaged methods are supposed to have the void return type. So don'treturn explicit HRESULTs in managed code. You must raise system oruser-defined exceptions in response to error conditions. If a managedmethod completes without an exception, the CCW automatically returnsS_OK to the unmanaged client.
On the other hand, if a managed method raisesan exception, the CCW maps error codes and messages to HRESULTs anderror information. The CCW implements various error-handling interfacesfor this purpose, such as ISupportErrorInfo and IErrorInfo, butSharePoint does not take advantage of these interfaces. EBS providersmust implement their own error reporting through the Windows event log,SharePoint diagnostic logs, trace files, or other means. SharePoint onlyexpects the HRESULT values S_OK for success and E_FAIL for any error.You can use the Marshal.ThrowExceptionForHR method to return E_FAIL toSharePoint, as demonstrated in SampleStore.cs.
Registering an EBS Provider in SharePoint
Easily the most confusing section onISPExternalBinaryProvider in the WSS 3.0 SDK is the topic "Installingand Configuring Your BLOB Provider." At the time of this writing,this section was filled with misleading information and errors. Even theWindows PowerShell commands were incorrect. If you assign the EBSprovider to $yourProviderConfig and afterwards use$providerConfig.ProviderCLSID, don't be surprised when you receive anerror stating that $providerConfig doesn't exist. Of course, you won'teven reach this point because the Active and ProviderCLSID propertiesaren't part of the ISPExternalBinaryProvider interface. These mysteriousproperties belong to a dual interface that is not covered in thedocumentation. Just for fun, I implemented a sample version in bothunmanaged and managed code, but your ISPExternalBinaryProviderimplementation does not require these proprietary properties at all.
The ProviderCLSID property might be handy, butthe CLSID is also available in the registry if you search for theProgID, such as UnmanagedProvider.SampleStore orManagedProvider.SampleStore, and you can also find the CLSIDs in thecode files SampleStore.rgs and SampleStore.cs. As mentioned earlier,setting the ExternalBinaryStoreClassId property of the local SPFarmobject to the CLSID registers the EBS provider. Setting theExternalBinaryStoreClassId property of the local SPFarm object to anempty GUID ("00000000-0000-0000-0000-000000000000") removes the EBSprovider registration. Don't forget to call the SPFarm object's Updatemethod to save the changes in the configuration database and restartInternet Information Services ( IIS). The following code listingillustrates how to accomplish these tasks in Windows PowerShell:
::LoadWithPartialName('Microsoft.SharePoint')
$farm = ::Local
# Registering the CLSID of an EBS provider
$farm.ExternalBinaryStoreClassId = "C4A543C2-B7DB-419F-8C79-68B8842EC005"
$farm.Update()
IISRESET
# Removing the EBS provider registration
$farm.ExternalBinaryStoreClassId = "00000000-0000-0000-0000-000000000000"
$farm.Update()
IISRESET
Implementing Garbage Collection
Another section in the WSS 3.0 SDK featuringmysterious components and critical code snippets is titled "ImplementingLazy Garbage Collection." At the time of this writing, this sectioncontained references to another mysterious Utility class withDirFromSiteId and FileFromBlobid methods as well as an incorrectassignment of Directory.GetFiles results to a FileInfo array, but let'snot be too demanding on WSS 3.0 documentation quality. The DirFromSiteIdand FileFromBlobid helper methods reveal their purpose through theirnames and the incorrect FileInfo array is easily replaced with a stringarray, or you can replace the Directory.GetFiles method with a call tothe GetFiles method of a DirectoryInfo object. The Garbage Collectorsample program in the companion material uses the DirectoryInfo approachand follows the suggested sequence of steps for garbage collection.
An important deviation of the GarbageCollector sample from the SDK explanations concerns the handling oftiming conditions. This is a critical issue because timing conditionscan lead to misidentification and deletion of valid files during garbagecollection. Take a look at Figure 6, which illustratesthe WSS 3.0 SDK–recommended approach to determine orphaned files byenumerating all BLOB files in the EBS store and then removing all thosereferences from the BLOB list that are still in the content database asindicated through the site's ExternalBinaryIds collection. The remainingreferences in the BLOB list are supposed to indicate orphaned filesthat should be deleted.
Figure 6 Misidentification of avalid BLOB as orphaned due to a timing condition
However, the EBS provider must, of course,first finish writing BLOB data before it can return a BLOB ID toSharePoint. Depending on network bandwidth and other conditions, I/Operformance can fluctuate. So, there is a chance that the EBS providercould create a new BLOB—which then appears in your BLOB list—butcompletes writing the BLOB data after you have determined theExternalBinaryIds so the BLOB ID is not yet present in this collection.Accordingly, the reference to the new BLOB remains in the orphaned BLOBlist and if you purge the orphaned BLOBs at this point, you accidentallydelete a valid content item and lose data! In order to avoid thisproblem, the sample Garbage Collector checks the file creation time andadds only those items to the BLOB list that are more than one hour old.
Conclusion
By integrating an external storage solutionwith SharePoint, you can increase storage efficiency, systemperformance, and scalability of a SharePoint farm. Another advantage isthat this forces users to go through SharePoint to access unstructuredcontents. Pulling data out of the content databases via direct SQLServer connections only yields binary BLOB identifiers instead of theactual files. However, EBS providers also have drawbacks due to thecomplexity of maintaining integrity between metadata in the SharePointfarm's content databases and the external BLOB store.
In order to integrate SharePoint with anexternal storage solution, you must build an EBS provider, which is aCOM server that implements the ISPExternalBinaryProvider interface withits StoreBinary and RetrieveBinary methods. You can create unmanaged andmanaged EBS providers, but be aware of performance and compatibilityissues if you decide to use managed code. Also keep in mind that theISPExternalBinaryProvider interface does not include a DeleteBinarymethod. You must explicitly remove orphaned BLOBs through lazy garbagecollection, and be careful to avoid timing conditions that can lead tothe accidental deletion of valid BLOB items.
Pav Chernyis an IT expert and author specializing in Microsoft technologies forcollaboration and unified communication. His publications include whitepapers, product manuals, and books with a focus on IT operations andsystem administration. Pav is President of Biblioso Corporation, acompany that specializes in managed documentation and localizationservices.
页:
[1]