Discovery Search Engine

by Dale Torres

Have you ever wondered if something you needed was available? Is a table of contents or an index enough to figure out if what you need is in a document? What questions do you ask to determine if that module is one you could use? If you have ever asked these questions or others like them, read on.

This article describes some of the technology that was used in the Discovery Search Engine developed by IBM Santa Theresa. A subset of the Discovery Search Engine is used in the Developer Connection CD-ROM.

Today
Different types of search technologies are used for different types of work.

You can spend a lot of time is spent browsing source (text or otherwise) to determine what information is available, how something works, or whether or not you can reuse some or all of what you are looking at. To find things, a simple search mechanism pe rforms a pattern-match with what you are looking for by scanning the source. There might be a few optional features, but generally browsers or editors are only good for a particular source with a particular tool.

The rest of this article descrives some of the search technologies available today to help you find words, source code, test cases, or other items.

Free Text Search
Free text search engines let the user pick their own identifiers to look for, as opposed to a sometimes arbitrary table of contents or index. Most search engines allow for a free text search, based on a technology that scans the source file. An index file is built that contains the primal elements found in the source file, along with offsets when duplicates are found. This file is used by the search engine to locate items of interest. An optional feature might exist to allow the file creator to ignore such noise words as "a", "to", or "the".

When compiling such a index for multiple source files an additional piece of information may be necessary. This information consists of an indicator that identifies what information is found in which source file. An additional header is appended. The extra indicators would help the search engine identify in which which source files the desired information might be found. The file a bit larger now, perhaps up to 15%. This type of search also might take longer.

Keyword Search
Keyword search engines provide a different type of capability for the end user. By associating source information with selected keywords, the user can obtain information quickly and easily. This association of keywords allows for combinations not available with free text searches and is very useful when there are many source files to be examined for specific information. The use of boolean expressions is a valuable part of this search technique. Verbs such as AND, OR, and NO help combine these keywords into extended search arguments. Faceted arguments may also be used, as well as additional qualifiers. Sounds pretty good so far, doesn't it?

There are several areas that limits keyword search. Many of these types of keyword search engines are implemented in a relational database, with a great number of tables that would need to be maintained. It can also be terribly slow. As many of you prob ably already know, when you are looking for something and you are distracted or annoyed (in this case with slow response time), the tool isn't very valuable.

Artificial Intelligence
Artificial Intelligence (AI) allows for new, additional opportunities. It allows for specialized searches, such as elements that meet a special category, parameter values for example. AI provides capabilities for not only free text searches and keyword search arguments but also specific context searches within facets. Say you wanted to know of elements within a library that contain length fields of less than 1,000 bytes, or an arrangement where when this condition is satisfied, list all associated components called by these modules. AI allows for wild-card searching as well. AI is beginning to look pretty good.

Of course, there is no free lunch with AI either. Someone has to generate the semantic nets and the rules. Someone also has to decide a bunch of other stuff as well. It can be a bit of a maintenance nightmare. Besides, such an AI index may be as large as 75% of the original library, a real space hog. It is also questionable when it comes to response time.

Structured Search
Structured searches let you navigate through the subdirectories and other similar structures that make up libraries. They present information that might not be available with the other search technologies, such as when a file was last updated, ownership, and security issues.

When combined with a keyword type of search the results can be of immense benefit for maintenance programmer. When a maintenance programmer wants to make an enhancement to an existing application program this type of search may be of great benefit.

Summary
What would be the optimum search tool? I propose combininng the best of all those described. A Keyword Search capability combined with a Structured approach that included the ability to search for intelligent answers, such as the ones sought with AI. It would also be lightning fast, and build an index smaller than 10% and provide automated method of accumulating the keywords, the structure, the rules and the semantic nets, without the need for human intervention. It would have provisions for self-maintenance as well as the capability of tailoring it to suit individual needs.

The prototype of such a tool is used in The Developer Connection Browser, providing the ability to search through the volumes of technical documentation on the CD-ROM.