Sunday, April 6, 2014

Content Server Search Architecture and Scaling

Back End Scaling – Admin server


Livelink admin server can be scaled to improve performance and task distribution for high volume requirements.

Search architecture can be broken down into three parts:

  • Data Flow – It consists of Extractor, IPool, DCS. It also contains Classification Importer and Prospector Importer
  • Search Grid – Update Distributor, Index Engine, Index, Search Engine, Search Federator. This is the scalable part of search architecture.
  • Search manager – Search Federator, Search Manager

OpenText Livelink Content Server Search components and architecture
Search Architecture


Scalability can be achieved by the addition of more Admin server, search federators and Partitions. A diagram showing all three added to the architecture is shown below.


OpenText Livelink Content Server Search components and architecture
Scaled Search Architecture



Front End Scaling – Web server

Horizontal scaling


In horizontal scaling we have installations done of different physical servers. Thus if one server goes down we can still have others to share the load and keep the application up and running.


Horizontal Scaling
Horizontal Scaling

Vertical scaling

In vertical scaling, one physical server has two or more installations done no it. This allows optimal use of server’s resources and also reduces the cost of hardware since we can get two or more installations done at the cost of one.

Vertical Scaling
Vertical Scaling


Ideally, a combination of horizontal and vertical scaling is the best solution.

Cluster Architecture
Horizontal and Vertical scaling




Search IPC

This post deals with how the various components of Livelink search interact with each other.

Grid Registry

Grid Registry is a process which is used by other components to coordinate their use of RMI. On startup, each component reports its RMI port usage to Grid registry, and obtain RMI port of other processes from it. The processes then use RMI directly without further involving the Grid Registry.

Inter-Process Communication

Internal RMI (Remote Method Invocation) Communication
Search Federator uses RMI protocol to communicate with Search Engines and Update Distributor uses RMI protocol to communicate with the Index Engines.
Internal Socket Connections
Beginning with SE10 Update 4, sockets can be used as an alternative to RMI. This method is used by Content Server 10.

The direct socket approach uses fewer threads than the equivalent RMI configuration. In direct socket connection, RMI Grid registry is not required.



Wednesday, April 2, 2014

Search Process


OpenText Livelink Content Server Search components and architecture
Livelink Search Process



Search Federator:


Otsearchfederator is the process involved for this task. Search federator is responsible for:
  • Accepting search query from content server and sending it to search engines
  • Collecting and sorting result received from search engines
  • Removing duplicate entries in search result


Search Engine:


Search Engine performs the task of searching queries in search index. The process involved is “otsearchengine”. Each search engine carries out the search in its partition. They are also involved in building facets, computing position information for highlighting search results.
OT does not reveal the algorithm followed by search engine to carry out its activity.


Search Index:

  • Consists of several partitions, each having an index engine and one or more search engine
  • Object’s Content data is stored in disk while it’s metadata is stored in RAM (for faster retrieval)
  • Physically, index is stored in OTHOME\index\enterprise\index1 (index1 for first partition) and IPool is stored in OTHOME\index\enterprise\data_flow
  • VerifyIndex tool can be used to check if the index is corrupt [enterprise data source-> maintenance -> verify contents of the index]
  • Partitions can be Read/Write, Update-Only, Read-Only. Changing partition’s mode from Read/Write to Read only or Update only, also moves the metadata from memory to disk.


 Partitions- Search index can be divided horizontally into several pieces called “partitions”.
SE10 can often provide better indexing or searching performance by allowing operations to be distributed to multiple partitions. These partitions can be run on separate physical or virtual computers or CPUs to improve performance.


OpenText Livelink Content Server Search components and architecture
Partition Implementation Overview



Every partition is a self-contained subset of the search index. Every one of them has its own index files, a Search Engine, and an Index Engine. The partitions are tied together by the Update Distributor (for indexing) and by the Search Federator (for queries).
Each partition is relatively independent of the other partitions in the system during indexing. If one partition is given an object to index, the other partitions are idle. The Update Distributor can distribute the indexing load across multiple partitions. For systems with high indexing volumes, using multiple partitions this way can help achieve higher performance, since partitions can be indexing objects in parallel.
A search query normally is serviced by all partitions. Only partitions containing matches to the query will return results. The Search Federator will blend results from multiple partitions into a consolidated set of search results.

  • Update-Only Partitions - In this mode, the partition will not accept new objects to index, but it will update existing objects or delete existing objects. If a partition is marked as Update-Only, then the Update Distributor will not send it new objects.
  • Read-Only Partitions - In this mode, the partition will respond to search queries, but will not process any indexing requests. Objects in the partition cannot be added removed or modified.
  • Read-Write Partitions - For completeness, the normal mode of operation for a partition is “Read-Write” mode. In this mode, the partition will accept new objects, can delete objects and update objects.

Index server:


The main process for index server is otdb. Child processes of otdb are:
1.    Otupbld-index building process, adding, updating and deleting
2.    Otmrg-merging smaller and many indexes into one large index
3.    Otcomp- compacting index when 30% of it gets deleted
4.    Mltcon- multiple connection manager


Otsumlog utility can be used to generate separate log files for each child process of index server (otdb)



Search Indexing Process




OpenText Livelink Content Server Search components and architecture
Indexing in Livelink Search



The Extractor:


“Indexupdate” which is the Extractor process, makes call to index.update (function) request handler in llserver process. Index.update sends request to llserver process. It is the llserver performs that performs actual extraction.
Indexupdate.exe is run by Admin process every 1 minute.
Extractor receives metadata from database and content from file storage, packs it into IPool and sends it to Document Conversion process.
After enabling thread logging, index.update function shows the objects being extracted.

There are two modes of extraction:

Full extraction mode- Building index for the first time. Extractor traverses DTree table from highest dataid to lowest dataid creating IPool messages. After DTree, it uses DTreeNotify and WIndexNotify to get most recent data indexed first.
Incremental Extraction mode- Updates new items being added, runs every minute
Extraction can be captured by enabling thread log and connect log on Admin server hosting Extractor process. In this mode, DTreeNotify and WIndexNotify tables are used for new updates.


Fun fact:
If the following query returns “-1” then extractor is in Incremental Extraction mode:
SELECT IniValue FROM KIni WHERE IniSection=’OTIndex’;

Logging Extractor:
Extraction is captured in logs by enabling Livelink thread logs and Connect logs on Livelink Admin server hosting the Extractor process.
Enter the following in opentext.ini file:

Debug=2                                  /* for thread logs*/
wantLogs=TRUE                       /* for connect logs*/
wantTimings=TRUE
wantVerbose=TRUE
wantDebugSearch=TRUE

Restart the services.
(Note: Connect logs increase in size very rapidly. Use them only when necessary and disable them after use.)


Document Conversion Services:


DCS is a set of processes and services responsible for preparing data prior to indexing. DCS performs tasks such as managing the data flows and IPools during ingestion, extracting text and metadata from content. It is responsible for retrieving indexable content out of objects in an IPool message and passing them onto Update Distributor. It’s a scheduled process that starts every minute.
Otdoccnv process is run by Admin server every 1 minute. There are two instances of otdoccnv process running for every data source- master and slave.
Master- reads documents from IPool#1 and sends them to slave for conversion, reads the result from slave and writes to IPool#2.
Slave- converts the content to HTML format
Logging DCS:
Add the following lines [FilterEngine] section of opentext.ini file:
logfile=\livelinkInstall\logs\otdoccnv.log
This will generate two log files:
·         otdoccnv.log.master
·         otdoccnv.log.slave


Update Distributor:


The process involved is “otupdatedistributor”. As the name suggests, it distributes any new index update task to appropriate index engine. Update Distributor monitors IPool directory to check for new indexing requests. Then determines the index engine to send update request and sends it to that engine.
For selecting index engine, Update Distributor sends a message to all index engines, using key, to check for duplicate entry with them. The one that confirms having duplicate object is given that IPool for writing/updating or deleting. If data is not in any index engine then it gives IPool to one of them using Round-Robin method. During allocation, it also checks for the mode of partition such as Read/Write, Read-only and Update-only (we'll deal with partition modes in detail in next pages). It also considers parameters such as memory available with partition.

Update Distributor works in the following sequence:
1. Read from LLHome\config\search.ini file
2. Contact RMI Registry Server
3. Contact the index engine
4. Start to process IPool messages

Update Distributor is responsible for rolling back transaction if indexing of any IPool fails.

Logging Update Distributor:
System Object Volume -> Enterprise Data Source -> Enterprise Data Flow Manager -> Functions menu for Enterprise Update Distributor -> Properties -> Advanced Settings

Index Engine:

Index Engine (otindexengine) performs the task of adding, updating and deleting objects in the search index. It accepts the request from Update Distributor. Each partition has at least one index engine.
Ipupdate process is run by Admin server. This process separates content from metadata. While metadata is saved in *._d_ temp file, content is stored in *.urn temp file.  Eg 10._d_ is for metadata and 10.urn is created for content.

Logging Index Engine:
Index engines logs can be enabled from: Administration Page -> System Object Volume -> Enterprise Data Source -> Enterprise Data Flow Manager -> Functions menu for Enterprise Update Distributor -> Properties -> Index Engine
Ipupdate logs can be viewed from this location: Enterprise\index\ipupdate.log

IPool Structure:


IPool (Interchange Pool) is used to encapsulate data for indexing. IPool in simplest form consists of two directories. Enterprise Data Flow (EDF) process is responsible for writing to and reading from IPool directory.
It can be viewed from Enterprise Data Flow Manager page.



Tuesday, April 1, 2014

OpenText Content Server Search


Recently, while working on content server, I came across several search related issues. With time and lot of hard work I was able to resolve all of them, but this resulted in me having pretty good picture of how OpenText’s search engine works. Hence I thought of writing this blog for anyone facing similar issue or for anyone who just wants to learn more about its search engine.

Search Engine 10.0 (SE10) is the search engine provided as part of OpenText Content Server.

The entire Search Engine comprises two main flows – Indexing and Searching.

Indexing starts with extraction of data from content server and storing it in Search Index. Searching flow forms a cycle, starting with a user entering search query, the query being searched in search index and the result being displayed to user.



OpenText Livelink Content Server Search components and architecture
Livelink Search Overview




Search Engine 10 consists of following main components:
  1. Extractor – extracts new data and adds them to IPool
  2. IPool – contains content and metadata in object form for addition to search index
  3. Update Distributor – selects index engine for addition of IPool to search index
  4. Index Engine – carries out indexing operation
  5. Search Index – contains index on which search is carried out
  6. Search Federator – receives new search request and passes them to search engine
  7. Search Engine – searches the string provided by search federator in search index
  8. Grid Registry – Grid registry is used by SE components to coordinate their use of RMI
We'll be looking into all of these in detail in next pages. For now lets see some advantages of this search engine.


Advantages of SE10 are:

  • Scalability: Search Grid part of SE can be easily scaled with addition of more Admin server to distribute task and improve overall search performance. SE10 can be restructured to add capacity, rebalance the distribution of objects across servers, switch partition modes and perform addition or removal of metadata fields.
  • Upgrade Migration: SE10 includes conversion of older indexes to newer versions. Hence, addition of new features and capabilities does not require re-indexing of data.
  • Transactional Capability: If a catastrophic outage happens in the midst of a transaction, the system can recover without data corruption. Additionally, logical groups of objects for indexing can be treated as a single transaction, and the entire transaction can be rolled back in the event that one object cannot be handled properly.
  • Metadata Updates: The OT search technology has the ability to make in-place updates of some or all of the metadata for an object.
  • Search-Driven Update: SE10 can perform bulk operations of modification and deletion on sets of data that match search criteria. This allows for very efficient index updates for specific types of transactions.
  • Maintenance Commitment: OT supports search solution throughout the life of their ECM application including regular updates.
  • Data Integrity: Search Engine 10 has several features to allow quality, consistency and integrity of search index and the data to be assessed.