In a production (or even stable) ESP environment, it is difficult to make a change to the Document Processing Pipeline and test it without wiping out the existing collection (not to mention the time it takes to perform a full re-crawl if the collection is even moderately large). In this case, the best option is to use postprocess to feed existing documents to a new (empty) collection.
Making a duplicate collection provides several benefits:
- No re-crawling is required
- The original collection is not affected by pipeline changes
- You can test your new collection without touching the stable data
- Upon determining that your changes are producing good results, you can easily migrate your front-end to the new collection while still maintaining existing stable data in the original collection (in case you want to revert your changes)
Steps to make a duplicate collection
- Using the ESP Admin GUI, create a new collection with the pipeline you would like to use (or test, as the case may be)
- Do not specify any data sources when configuring the new collection
- Stop the Enterprise Crawler:
$FASTSEARCH/bin/nctrl stop crawler
- Run the following command where origcollection is the original collection and newcollection is the new collection (that you just created):
$FASTSEARCH/bin/postprocess -R origcollection -k default:newcollection
Notes about this command:
- the default specified above is a content feeding destination, as specified in the destinations section of $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. Specifying default will specify the destination as the current ESP install.
- be sure to run the above command using either nohup or screen as it will not exit until all content has been fed to the new collection. For large collections this may take a while.
- Restart the Enterprise Crawler:
$FASTSEARCH/bin/nctrl start crawler