Chapter 6. Searching
Material in this chapter is taken from the following:
a. US Patent Application 20020046205, filed September 25, 2001, Harry George Direen, Jr., “Method of operating a hierarchical data document system having a duplicate tree structure”
b. Direen, Harry, Ph.D., “Duplicate Tree Structures in DPP Virtual Associative Memories.” NeoCore Technical Paper, Release 1.0, 10/17/2000.
c. Direen, Harry, Ph.D., “Virtual Document Ordering and Set Operations Using Couplet Hierarchical Vectors,” NeoCore Technical Paper, Release 1.1, September 6, 2001.
In Chapter 3 we explained the Tag Dictionary, the Data Dictionary, the Map Store and their relationships. Now we need to add six additional hard disk files. The six additional files provide for efficient searching of the database. The six new files are:
Figure 6-1 shows all nine files used to store material in the database and search for it. The arrows in Figure 6-1 represent pointers in one file that point to addresses in another file.
The three core indices are the associative memory structures we called “Catalogs” in the previous chapter. The difference between the three indexes is that the icon keys for the Tag Index are generated from tag sets such as “Telephone_book>Entry>Name>First>”; the keys for the Data Index are
generated from data items such as “John”; the keys for the Tag & Data Index are generated from a complete tag/data couplet, such as “Telephone_book>Entry>Name>First>John”. The Association values in the three indices are pointers. Where they point depends on how many lines in the Map Store pertain to a specific entry in the dictionaries. For example, if the database is storing only one line containing the data item “unique”, then the Association in the Data Index will be a pointer direct to the single line in the Map Store that was created when we put “unique” into the Data Dictionary. But suppose we are searching for information about “Smith”. The data item “Smith” is in the Data Dictionary only once; the dictionaries hold no duplicate entries. But since it is likely we stored documents containing “Smith” on many occasions, there will be quite a few lines in the Map Store which point to the “Smith” entry in the Data Dictionary. In this case, the Association in the Data Index points to a “duplicate index”—in this case the Data Duplicate Index. We will explain the structure and operation of duplicate indices in the next section. For now, it is enough to realize that there will be several pointers in the Data Duplicate Index which point to the several “Smith” lines in the Map Store.
(There is a dotted arrow at the bottom of Figure 6-1, from the Data Index to the Data Dictionary. This represents a user option when creating the database. Because fully-qualified searches, with both tag and data, are more efficient than data-only searches, and because we typically know the complete tag structure whenever we want to do a data search, the user can opt to take away his capability of doing data-only searches. By doing this, one file, the Data Duplicate Index, can be eliminated. When doing stores, inserts, and updates, we still need the Data Index in order to quickly determine whether a data item needs to be added to the Data Dictionary. And because we need the Data Index, we need to be able to look inside the Data Dictionary to see if there has been a collision in updating the Data Index; therefore the Association in the Data Index points to the item in the Data Dictionary, whenever the user opts to disallow data-only searches.)
If there are 123 instances of the data item “Smith” in the documents stored in our database, then there will be 123 rows in the Map Store which have a pointer to “Smith” in the data dictionary.
The function of a duplicate index in the NeoCore XMS is to find very quickly all 123 addresses in the Map Store. Further, we want the 123 addresses into the Map Store to be stored adjacent to each other in the duplicate index. Due to inserts and deletes, physical lines get out of logical sequence in the Map Store. But in the duplicate indices, we force the physical order to match logical order. By putting the 123 addresses adjacent to each other and in logical sequence, we can more easily set up efficient searches whenever we are trying to find a specific row in the Map Store.
The size of the duplicate indices is fairly large. The Tag Duplicate Index, typically the largest, has a default size of 10 meg. The combined default size of all the files shown in Figure 6-1 is 68 meg. All the sizes are configurable by the database administrator.
At the present time, the architecture of the NeoCore duplicate indices is considered proprietary. NeoCore advises that the patent reference given above, at the start of this chapter, is correct as far as the role of a duplicate index is concerned; however, they have “moved forward” as far as the structure of a duplicate index is concerned. (Presumably, another patent covering the new structure will appear before very long.) Accordingly, in this current description, we are presenting only a “functional” description of a duplicate index, making no claim that the functional description mirrors the actual NeoCore architecture.
To understand the function of a duplicate index, consider it as consisting of two blocks for each duplicate. One block is a three-entry “management array”; the other is an indefinitely-sized “address array” holding the addresses to the corresponding lines in the Map Store. For example, suppose there are 123 instances of “Smith” in our database. Further suppose the “Smith” management array is at lines 49 through 52 of the Data Duplicate Index. Further, the first line in the address array is line 101. Then lines 101 to 223 contain the 123 pointers to the 123 “Smith” lines in the Map Store. There would usually be additional “reserved” but empty lines in the “Smith” address array, so that it is easy to add more addresses when more instances of “Smith” are added to the database; assume line 228 is the last line reserved for additional “Smith” entries. Given these assumptions, the Association in the Data Index for the “Smith” entry would be a pointer to line 49 in the Data Duplicate Index, since line 49 is the first line in the “Smith” management array. That first line in the “Smith” management array would be a pointer to line 101 of the Data Duplicate Index, the first entry in the “Smith” address array. The second line in the management array would be a pointer to line 223, the last “used” entry in the address array. The third line would be a pointer to line 228, the last line currently reserved for “Smith” entries. Should the 128 lines in the address array for “Smith” fill up, the “Smith” address array will be moved to any free location in the Data Duplicate Index where there is room for the existing 128 lines plus a number of additional reserved lines (perhaps 32 or 64); then the contents of the management array will be changed accordingly.
The end result is that, for each duplicate string in the database, its Association in the index is a pointer to an entry point in the duplicate index. Inside the duplicate index, for each duplicate string, is an array of pointers, each of which points to a line in the Map Store which points to the string in the dictionary or dictionaries. See Figure 6-2. Within the duplicate indices, the pointers in each address array are both contiguous and logically ordered.
Let me repeat: this explanation of the duplicate indices accurately describes only their function. The actual structure of the indices will remain proprietary until the publication of the appropriate patent document by the USPTO. (This assumes NeoCore is planning to request a new patent to replace or supplement the one referenced at the beginning of this chapter.) If the structure were exactly as I described it, inserts of duplicate items would sometimes take a very long time—whenever the insert required that most of the items in a large address array had to be moved to make room for the inserted item.
Suppose this is one of the flattened lines from a document:
PhoneBook>Listing>Address>Street>1234 Main Street
In the Data Index, “1234 Main Street” will be automatically indexed by the database.
In the Tag & Data index, “PhoneBook>Listing>Address>Street>1234 Main Street” will be automatically indexed.
Furthermore, when a document is stored in the database, it receives a time-stamp and a document ID. These two items are also indexed automatically.
At the user’s option, each document can also be assigned a user filename or ID; in addition, the user can opt to specify an external schema name to be associated with that document. If the user opts to provide a user filename and/or an external schema, those two items are also automatically indexed.
Appropriate sets of “metadata” tags are provided with the document ID, time-stamp, timestamp of the most current revision, user filename, and external schema name. Then these four data items and their associated tagsets are stored and indexed exactly like every other tagset and data item in the database.
Each document, as stored within the NeoCore database, consists then of the original document with its root element plus a sibling to the original root in the form of a “<metadata>” element. To avoid having sibling roots (illegal syntax), all documents stored in the NeoCore database are enclosed within a new root element, namely “<ND>”.
At present NeoCore is using a more-or-less standard search algorithm when searching for non-indexed substrings, and therefore claims only standard efficiencies. As long as the search is for a single substring or for multiple substrings of the same length, the searches are reasonably efficient. But substring searches for multiple substrings of different lengths are notoriously inefficient. The implication is that we should not expect our testing to display superb speed when using this sort of query.
One way to solve the substring search problem is to index virtually everything automatically—not just the individual words but also whatever phrases might be searched for. For text-heavy applications the indexing could become quite challenging.
Instead, NeoCore plans another approach. NeoCore has several related patents or patent applications that it claims will provide a significant advance over traditional algorithms for finding substrings. In particular, it claims the patented algorithms will result in significantly improved efficiencies when searching for several substrings of varying lengths. The substrings are transformed into icons; the text containing the substrings is transformed into a set of icons; then the search is conducted as a mathematical operation instead of a traditional text search.
The key patent has only recently been approved, and it is not yet implemented in the NeoCore XMS. According to NeoCore, the improved searching functionality will be included in a future build. For this current project, since the improved functionality is not part of the existing database, we are skipping any analysis and evaluation of the new algorithms. However, we will nonetheless conduct significant testing of sub-string searches as part of this project, so that when the improved algorithms are implemented, we will have baseline data for determining any efficiencies that might result. If the new search algorithm works as NeoCore predicts, that will be significant for the computer industry.
Since text searches are notoriously inefficient, especially when searching for several substrings of varying length, users are encouraged by NeoCore to use some techniques during document preparation that will force the database to index substrings the user considers important. These techniques are explained in the next section.
All data items and all tagsets are automatically indexed by the NeoCore database. However, sometimes, while preparing an XML document, the user can anticipate the need to search for sub-strings within a data item. In such cases, the user can make sure important sub-strings are also indexed.
For example, suppose we structure an address element in the following way:
<street>1234 South Main Street</street>
While this is exactly how it should be done in many contexts, it does not lend itself to the most efficient search when we want a list of all addresses on South Main Street—“South Main Street” is not automatically indexed. Furthermore, it does not lend itself to efficient storage: “South Main Street” would be contained as a substring in numerous addresses which are all contained as separate items in the Data Dictionary. We might instead structure our document in this fashion:
<street_name>South Main Street</street_name>
This reduces storage and makes some searches more efficient because “South Main Street” would be indexed. Of course, that still won’t index “S. Main St.” or just “Main St.” or even “Main”. We could get carried away with ourselves if we structured the document so that we indexed every possible variation; at some point it makes more sense to simply require some standardization and discipline from potential users.
Another common technique the user can use to force indexing is to use a list of keywords. But don’t do it like this:
<keywords>Bush, Arafat, intifada</keywords>
This would index “Bush, Arafat, intifada” but it would index neither “Bush” nor “Arafat” nor “intifada”. Instead, a list of keywords should be handled something like this:
An alternative way to make sure key sub-strings are indexed is to simply include appropriate tags in the body of a text field:
<paragraph>Reliable sources report that <kywd>Bush</kywd> has had seventeen secret meetings with <kywd>Arafat</kywd> since the start of the <kywd>intifada</kywd>.</paragraph>
The point is simple but important. Indexed searches are inherently more efficient than non-indexed searches. With a little common sense and creativity, documents can be prepared in such a way that appropriate indexing is accomplished. Unfortunately, it does take extra effort, and for some situations the amount of extra effort is simply too much.
Study again Figure 6-2, and see the relation of icons, indices, duplicate indices, the Map Store, and the dictionaries. Note that when we do a search, the search string, “Smith”, is encountered twice: (1) it is the input to the Icon Generator, and (2) it is stored in a single location in the Data Dictionary. But once the Verifier process accepts that we have found the correct line in the Data Index, the string, “Smith”, is not used again in the search. Most of the work in indexed searching is done in the duplicate indices and the Map Store. And there are no strings here—the search is done with pointers rather than with the strings themselves.
There are strings neither in the core indices nor the duplicate indices nor the Map Store. These are all fixed-width structures containing only addresses and the control/formatting codes in the Map Store. The NeoCore search engine can work extremely efficiently, simply because it uses such simple, fixed-width structures. But their simplicity hides a richness. We have to realize how much information is available within the simple structures of a duplicate index and the Map Store. For our examples, we will use the simplified XML document and its “flattened” version on page 3-2, and the corresponding Map Store on page 3-5.
· Line Number. Each line in the Map Store is of fixed width and is at a specific address; it is easy to compute its line number.
Depth. Look at line
3 of the flattened document on page 3-2:
We will say “Telephone_book” is at depth 1, “Entry” is at depth 2, “Name” is at depth 3, “First” is at depth 4, and “John” is at depth five. Since the data item is at depth five, we will also say that line 3 is of depth five. (If for some reason no data item was provided—just the tag set—then we would say the depth of the line was still depth five, just as though a data item were provided.)
P-Level. The P-Level
is stored in the Map Store as one of the control/formatting codes. To determine
the P-Level of any line, look at the flattened line. Start from the left end of
that flattened line, and find the first tag that opens on that line (or the
data item itself if no tag opens on that line). The depth of that item is
considered the P-Level of the line.
(In this explanation of P-level, we used the concept of a tag “opening” on a line. We can explain the concept of opening by using examples. “Telephone_book” opens on line one. There are two different instances of “Entry”; one opens on line one, and the other opens on line nine—this is not obvious from the flattened document; you have to look at the original XML to realize this. “Name” opens on line two and again on line 10. “First” also opens on line two and again on line 10.)
considers each line of the flattened document to have a parent line. For each
flattened line in the Map Store, the line number of its parent line is stored
in its control/formatting codes.
Within the XML community, “parent” normally is used for elements, not for lines. Each data item and each element except the root element of a document has a parent. In our example “First” is the parent of “John”, “Name” is the parent of “First”, “Entry” is the parent of “Name”, and “Telephone_book” is the parent of “Entry”.
The parent of a line, as used by NeoCore, is related to the parent of an element. Consider Line L in the flattened document. To find the “Parent” line of L, start with the P-Level of Line L. Find the element at the depth indicated by the P-Level. Identify the parent of that element. Finally, find the line at which that parent element opened. That line is the “parent” of the original Line L. For completeness, for this “Parent” value only, we will say that the root is its own parent.
For example, here are lines nine through 16 of the example flattened document:
Telephone_book>Entry>Address>Street>5678 Byway Street
Consider line 15, the next-to-last line, containing zip code 55556. Its P-level is 4, because the leftmost tag that opens on that line is “Zip” and the depth of “Zip” is four. The parent of “Zip” is “Address”; that tag opens on line 12 (the same line containing “5678 Byway Street” as the data item), and so the parent of line 15 is line 12. Similarly, the parent of line 12 is line 9, the line on which “Entry”, the parent of “Address”, opens. (Notice that a different “Entry” tag opened on line 1, but its scope stops with line 8, because of where the </Entry> tag was in the original document.) Finally, the parent of line 9 is line 1, because “Telephone_book” is the parent of “Entry”, and “Telephone_book” opens on line 1.
Hierarchy Vector. The Couplet Hierarchy Vector (CHV) of any line
is a numerical representation of a given line, providing its full ancestry,
i.e., its parent, its parent’s parent, etc. The CHV has a size equal to the
depth of the line; for example, using our line 3 from above, CHV3
equals [1 1 2 3 3]. There are five line numbers in this CHV3,
because the depth of line #3 is five. The first tag in flattened line #3 is
“Telephone_book”; it opens on line #1 of the flattened document, and so the
first item in CHV3 is 1. The second tag in line #3 is “Entry”; it
also opens on line 1, and so the second item in CHV3 is also one.
The “Name” tag opens on line two, and so the third item in CHV3 is
2. Finally, both “First” and “John” open on line 3, and so the final items in
CHV3 are both 3.
The CHV provides a virtual ordering of the document. For example, suppose much later we were to insert a “Title” tag into the document, to be associated with the “John Doe” entry, with the data of “Mr.”, and the added line in the Map Store happened to be line 5000. CHV5000 would be [1 1 2 5000 5000]. CHV5000 would then numerically follow CHV3 (namely, [1 1 2 3 3]) but would precede CHV4 (namely, [1 1 4 4 4]). Even if we were to much later add other children of the “Name” tag, such as Middle_name, Suffix, or Nickname, the CHVs would provide a virtual ordering, with all the children of “Name” together—no matter how the lines are physically ordered in the document.
The CHVs are not stored anywhere. They can be created as needed very quickly from the values for depth, parent, and P-Level.
Vector. In typical phonebook searches, we provide a
name and/or address and want to return the phone number. Using the structure of
our example, we are looking for a particular phonebook entry which contains the
appropriate data items, that is, we want data from the “John Doe” phonebook
entry, which opens on line 1 of the flattened document, or from the “Jennie
Buck” phonebook entry, which opens on line 9. All lines from the “John Doe”
entry, will have one thing in common: their CHV will start out [1 1 ...]. In
effect, the CHVs of the “John Doe” entry “contain” a two-element CHV, [1 1]. We
will say that these lines converge at depth 2, where they all share the same
converged vector, [1 1]. Given a name and address that converge at [1 1], the
search will be for a “Phone_number” tag that also converges at [1 1].
Similarly, if we are searching for data from the “Jennie Buck” phonebook entry, we will be looking for lines in the Map Store which converge on [1 9].
Using the variables just described, very rapid calculations can be performed to conduct searches. While the details of the NeoCore search engine are proprietary, the public materials describe very rapid set operations and binary searches. For example, suppose we are searching for a particular telephone number and the tagset (TS) is “Telephone_book>Entry>Phone_number”. Suppose we have already calculated that the phone number we want converges on [1 2500]. (We queried for a “Wenker” who lives in “CO”. The CHV for one of the “Wenker” entries was [1, 2500, 2501, 2502, 2502], and the CHV for one of the CO entries was [1, 2500, 2503, 2505, 2505]. Those two CHVs converge on [1, 2500].) Further suppose there are 4096 lines in the Map Store which contain pointers to TS in the Tag Dictionary. Then there will also be 4096 adjacent and ordered cells in the Tag Duplicate Index (TDI) which point to those 4096 Map Store lines. We will go to the cell in the middle of all the TS cells in the TDI, and follow its pointer to the Map Store line. We will quickly compute whether that cell converges on [1 2500] or converges on a higher or lower Converged Vector. Suppose it is higher; then we follow the well-known binary search technique, looking at the TS cell in the TDI that is midway between the first TS cell and the one we just examined. And we continue as long as needed; with our supposed 4096 TS lines in the Map Store, we will find the line that converges on [1 2500] in a maximum of 13 attempts. Once we have found it, we follow the pointer to the phone number in the Data Dictionary and return it.
The example we just provided is, given our purposes here, necessarily quite simple. But it does illustrate how quickly searches can be done using the NeoCore search architecture. In our example, we performed one binary search with a maximum of 13 comparisons. Other examples might require state machines and a variety of set operations and vector calculations. But in principle, our searches will be quite rapid compared to traditional string searches, because we have transformed our string search into mathematical comparisons. That is, the maximum of 13 comparisons in our example are not 13 comparisons of variable-width strings; rather, they are 13 numerical comparisons. And as each search gets more and more complex, the numerical techniques allow us simply to perform more complex mathematical comparisons rather than multiply the number of string operations. As the searches become more and more complex, the numerical techniques perform in a more and more superior fashion.
And don’t forget that we did not give the actual structure of the duplicate indices, because the patent documents covering that are not published yet. The previous documents describe a duplicate index structured in such a way that the search actually takes significantly less that 13 comparisons at a maximum.