# Architecture overview

ADP Search

The search functionality in ADP is based on Azure Cognitive Search Services and uses Apache Lucene
for full-text searching.

Processing a full-text search consists of the following stages:

Query parsing
Lexical analysis
Document retrieval
Scoring results

Architectural View

# Query Parsing

The Query parser is responsible for pasring queries and is a selectable part of the search solution. For ADP purposes, we chose Default query parser, which supports simple query syntax. Simple query processing does not support either fuzzy, wildcard or regex syntax.

Query parser deconstructs a given query into subqueries. ADP Search supports the following subquery types:

Term query - for standalone search phrases, such as "factory"
Phrase query - for two or more word terms, such as "factory proxy"

In ADP Search, all query terms are always quoted, which causes the query to appear in searched documents
almost in the same form as was typed. Almost - because the proximity between words do not need to be immediate and words in search can be separated by other "non-essential" words.

For example:

If we want to find a phrase containing "Factory proxy", what should we expect in response?

"Factory proxy", "factory-proxy", "factory and proxy", "factory, proxy" and so on.

We definitely wouldn't receive:

"fact. proxy", "factory pro." and so on.

Assuming that we put "Factory proxy" at the end of query parser work, the following "Query tree" arises.

Query Tree

# Lexical analysis

The Lexical analyzer processes text included in term and phrase queries and changes them to tokenized terms.
The most common form of lexical analysis is linguistic analysis, specific for a given language.

The linguistic rules specific for English (the only language used in ADP):

Reducing a query term to the root form of a word
Removing "the", "and", "or", "-" and other non-content/function words
Breaking composite words into component parts ("multitenant" and "multi tenant" are linguistic equivalent)
Uppercase words are changed to lowercase

ADP uses the Standard Lucene text analyzer with lexical rules described in Unicode Text Segmentation. The result of lexical analysis is "Improved Query tree", which is sent to the Search Engine to process with documents retrieving. "Improved Query tree" looks like:

Improved Query Tree

# Document retrieval

This is the place where Search Index appears on stage. Primarily, let's clear up document indexing. A Search Index structure consists of fields that can be searchable, filterable, sortable and so on. Fields are filled based on data source content.

Blob storage

ADP Search stores documents inside blob storage. Those documents are appropriately transformed ADP pages.

Index fields preserve body content and meta tags of html files.
The unit of storage is an inverted index, one for each searchable field. Within an inverted index is a sorted list of all terms from all documents. Each term maps to the list of documents in which it occurs. To produce the terms in an inverted index, the search engine performs a lexical analysis over the content of documents, similar to what happens during query processing:

text inputs are passed through an analyzer, changed to lowercase, stripped of punctuation, and so forth
(depending on the analyzer configuration)
tokens are the output of lexical analysis.
terms are added to the index.

Main Index fields defined for ADP Search:

keywords of document (meta tag)
title of document (meta tag)
description of document (meta tag)
content of document

The same Lexical analyzer is used at the time of search index building and search query execution in order to maximize the searching efficiency.

Inverted index for field Content may looks like:

Term	Documents
factory	UUID=1, UUID=3, UUID=5
proxy	UUID=1, UUID=5

Inverted index for field Description may looks like:

Term	Documents
factory	UUID=6, UUID=9
proxy	UUID=6, UUID=8

# Matching query terms against indexed terms

Matching is processed for each searchable field.

For term query, matching is simple - we retrieve all documents, for term matched.
For phrase query, the phrase is divided into terms. Only documents with matching terms are considered. At the end, the proximity of terms is checked, as described above.

Let's assume, that documents with UUID=1, UUID=5 and UUID=6 meet the conditions of closeness. Still the open question is, which document will be displayed as first. To answer this question, we have to pass through the Scoring stage.

# Scoring

Every document in a search result set has a relevance score assigned. The purpose of the relevance score is to rank higher those documents that best answer a question, defined by the search query. The score is computed based on statistical properties of terms that matched. At the core of the scoring formula is TF/IDF (term frequency-inverse document frequency). In queries containing rare and common terms, TF/IDF promotes results containing the rare term. Also important thing is weight of each field, defined in scoring profile definition. Search term found in more weighted fields are more important than the same term, found in less weighted field. If a term is "more important", then obtains better score.

ADP uses scoring profile with the following definition:

Field name	Weight
keywords	5.0
description	2.5
title	1.5
content	1.0

According to the above, answering the question, what will be the order of documents in our search result:

UUID=6
UUID=5 ("proxy" is less popular than "factory")
UUID=1

# Search performance tuning

Azure Search offers additional functionalities, which help the user to be more effective at searching.
All of them supplement core search functionality and don't replace it.
The most important are:

Filters - for limiting the scope of searched documents. Utilized at core search.
Suggestion API - to deliver the context of searched terms/phrases. Used before core search.
Autocomplete API - to finish a partially typed query input using existing terms in the search index. Used before core search.

ADP Users

We are exposing Filters and Autocomplete API to ADP users.

# Filtering

Available filters correspond to ADP content structure and include the following scopes:

Platform
Cloud
Devices
API
Edge
Operations
Releases

All of the above are selected by default, but you can mix them however you like. Deselecting all and choosing something else narrows down the scope of searching.

# Autocomplete

Autocomplete API is activated after 0.5 seconds of inactivity when typing your search query. The number of
suggested terms/phrases is limited to 5.

Autocomplete mode is set to oneTermWithContext, which means that suggested phrases can consist of two terms maximum.

Autocomplete Example

Recommendation

Using Autocomplete is not mandatory but is recommended.

# Highlighting

Comfortable viewing

Additional custom functionality that makes viewing ADP content more comfortable is Highlighting
terms/phrases in viewed ADP pages.

Highlighting can work in two modes:

automatically enabled after selecting destined page from search results.
manually enabled by typing appropriate formatted URL in browser. Is enabled in context of page, indicated in URL.

To manually enable highlighting, use the following URL:

https://clientsuccess.ability.abb/<address-of-page>?hl=<search-phrase>

# Promoting articles

We encourage you to use static search phrases, which are available from the search drop-down list. That's how we are going to expose the most recent or most interesting content.

Author: Dariusz Pawlak

← Ability Azure Inventory (AAI)