Using Advanced Database Crawler

Hello Friends, recently i was working on of the most popular shared source module for sitecore called Advanced Database Crawler developed byAlex Shyba and I must say it is awesome.
ADC uses Lucene Index as a base. Lucene is an open source search engine used in Sitecore CMS for indexing and searching the contents of a Web site. Sitecore implements a wrapper for the Lucene engine which has its own API.
The original API (Lucene.Net) and the Sitecore API (Sitecore.Search) are both accessible to developers who want to extend their indexing and search capabilities.

Note:
The Sitecore.Data.Indexing API was deprecated in Sitecore CMS 6.5 and in 7.0 it will be completely removed. So Advanced Database Crawler is using new Sitecore.Search API only.

If you have not used it then here are few quick steps to setup the module and quickly make the search work.
First you need to download the source code and then compile the project that should give you Sitecore.SharedSource.Search.dll, add reference of the DLL to you sitecore project.

Next step is to setup lucene index file configuration.

Advanced Database Crawler Configuration

Below is a brief description about each node within the configuration.

Index Element Description

Contains index definitions and their configuration settings.
The section under configuration where you define indexes. For example, ‘system’ is one index defined here.
Specify what type of analyzer to use by default in the indexes. An index can use a custom analyzer or refer to this default analyzer definition.Search indexes use the same analyzer both for indexed data (documents) and for the search queries.
Specify what type of analyzer to use by default in the indexes for categories in search results.
This section is used to categories search results. It’s used for content tree search introduced in Sitecore 6.The search box is located right above the content tree in content editor.
Specify which database you want to add to the index.
e.g.

It’s possible to have multiple locations for one index. Moreover it’s even possible to have content from different databases in the same index. Every child of the locations node has its own configuration for a particular part of the content. A name of location node is not predefined. You’re welcome to name it the way you want.You can also define the following settings under locations:

  • Database
  • Root
  • Tags
  • boost
Specify which database you want to index, for example ‘master’.
Specify the root node of the content tree to be included into the index. The indexing crawler will index content below this location
You can attach a string tag to items from this location making it possible to filter or categorize results during a search.
Use boost to adjust the priority of results relative to results from other locations.
Tells the crawler to put a copy of all the fields in the item into the index. This makes fine-grained filtering possible but creates a performance overhead.Default setting = true

    
        {GUID}
    

Explicitly Include the template for indexing; Provide GUID for the template and unique tag name, it should be unique template name.

    
        {GUID}
    

Explicitly Exclude the template for indexing; Provide GUID for the template and unique tag name, it should be unique template name.

    {8CDC337E-A112-42FB-BBB4-4143751E123F}

If

 

is not added then explicitly includes the field for indexing; Provide GUID for the field name and unique tag name, it should be unique field name.


    {8CDC337E-A112-42FB-BBB4-4143751E123F}

If

is not added then explicitly excludes the field for indexing; Provide GUID for the field name and unique tag name, it should be unique field name.


     ……………

Custom Field crawlers for the complex field types like DropLink, DateTime, Date, and Number. If you create your custom field type then you can create your custom field crawler for that field like below.

    ……………

Here all the available field types in Sitecore are listed that needs to be indexed. If a field type is not defined, defaults of storageType=”NO”, indexType=”UN_TOKENIZED” vectorType=”NO” boost=”1f” are applied.


<!-- Text fields need to be tokenized -->
    
    
    
    
    
    
    
<!-- Multilist based fields need to be tokenized to support search of multiple values -->
    
    
    
    
<!-- Legacy tree list field from ver. 5.3 -->
    


    ……………

If you want add custom fields to the index or dynamically add data to the existing field then you can use dynamic fields as shown on the left.Here type=”Sitecore.SharedSource.Search.DynamicFields.MyField,Sitecore.SharedSource.Search” defines the class to use for indexing the specific field. Inherit the class from BaseDynamicField and override the method ResolveValue which should always return string values.
Tells the crawler to subscribe to item changes and updates the index automatically. Default setting = true

Few quick information that will help you under stand features of Advanced Database crawler …

  • Multiple Indexes can be added between
    …………… 

    section.

  • Individual index configuration starts with tag
  • The id attribute is used to distinguish different indexes.
  • Parameter with desc=”folder” defines the folder name for the index.
  • Following setting defines locations for the index:

    – Location is nothing but logical grouping of indexed data. You can use this locationid when using ADC api while searching index files.

    • It’s possible to have multiple locations for one index. Moreover it’s even possible to have content from different databases in the same index. E.g. you can have data from news and blog articles under same location or you can have different location for news and blog article.
    • Every child of the locations node has its own configuration for a particular part of the content.
    • A name of location node is not predefined. You’re welcome to name it the way you want. For example:
      
             
                 ………    ………
             
      
      
  • Every location section has

    section. It defines indexing database for the location.

    master
  • Every location section has

    section. It defines indexing database for the location.

    /sitecore/content/home
  • Every location section has

    section. Here it’s possible to add templates items of which should be included to the index or excluded from it.

    
        {BDB6FA46-2F76-4BDE-8138-52B56C2FC47E}
    
    
    
        {BDB6FA46-2F76-4BDE-8138-52B56C2FC47E}
    
    

    Note:
    1) It does not make sense to use both of the above settings for the one location. Use only one of them.
    2) Use different tag name for the Include and Exclude template else it will only pick the last one.

  • Every location section has

    section. Here we can write our custom logic to index the special/complex fields.

    
           
    
    

    Note: You can read more about Custom Field indexer in this post.

  • Every location section has
    
    

    section. Using this section you can create custom field in the Lucene index which are not created or crawled by Lucene index by default and you can put custom values within these fields.
    Note :
    You can create dynamic field and in the name=”_content” attribute you can provide name of existing field that enable to append content to the existing field values.

  • Every location section has

    section. Here you can tag indexed content and use it during the search procedure.

    taskoption
  • Last available tag under Location section is

    , here you have an ability to boost indexed content among other content that belongs to other locations.

    1.9

Trouble shooting

While setting up the index you may encounter few issues that i faced.
1. Remove unwanted results from search – Use <includetemplate> and <excludetemplate> to get results for items from specific templates.

2. While i was setting up index for my production server somehow my index was not getting built and i tried every bit of configuration that i could think of. But finally i found the missing piece and it was enabling History engine on web database. By default history engine is not enabled on web database. You can enable it as follows.

<database id="web"
    
       
          
          30.00:00:00
       
    
    false
    …

3. Make index work on Webfarm – When you have more then one content delivery servers in that case you will have separate index files each server. In that case set following property to true.


Still having trouble making indexing work on let us know or else You can visit detailed trouble shooting post by Alex here

Looking forward for feedback

Friends this being my first blog article kindly posts your comments/suggestions/questions.
Many more articles to follow.

Also i would like to thank Kiran Patil , Parag Daraji and Brijesh Patel who provided me valuable feedback. Thank you guys 🙂
Keep visiting.