<< Chapter_4_Icons

Index of HTML Docs

Chapter_6_Search_Architecture >>

 

 

Chapter 5.  Indexing Techniques

 

Associative Memory

Classic Hash Tables

The 1985 Brandin Catalog

Managing the Catalog

Walk-through of Catalog Management Operations during a Write

Advantages of the Brandin Store/Search Method

An Improved Catalog

Efficiency of the NeoCore Catalog

 

 

Material in this chapter is taken from the following materials:

a.      US Patent 4,527,239, July 2, 1985, “Digital data storage methods and apparatus”

b.      US Patent 6,324,636, November 27, 2001, “Memory management system and method”

c.      Chris Brandin and Harry Direen, “DPP Memory Management,” NeoCore Technical Paper, Release 2.0, February 14, 2001.

d.      Chris Brandin, “A Definition of Digital Pattern Processing,” NeoCore Technical Paper, Release 1.0, February 7, 2001.

e.      Morris, John, http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/hash_tables.html

 

 

Associative Memory

 

The standard method for getting something from hard disk, RAM, or other storage is to provide an address or offset. You read what’s at that address and return the content to the requesting process.

 

Sometimes the address is not readily available. Instead we have a key in the form of a string about which we need some information. The key needs to be associated in some way with the desired information on the disk or in RAM. When a system is set up to access information in such a way; we are said to be dealing with associative storage or memory.

 

Typical databases, of course, are associative in nature. For example, we query for data pertaining to “denver,” and expect the data management system to return the desired information. In an RDBMS, doing such a search is relatively straightforward: you search the specific column which contains city names. Of course the search can get more complicated, joining several columns in different tables. But however complicated it gets, you still search specific columns and not the entire database. If you expect a complicated search to be regularly repeated, you provide indexing to make it more efficient. And even then, the indexing is fairly straightforward, because you always have the column structure to limit what needs to be indexed.

 

But the NeoCore XML database does not use tables and columns for data organization. Further, it is utterly impractical, even with only medium size databases, to search every storage location looking for the string, “denver”. Instead, there must be some sort of structure which allows you to find the storage location for “denver” efficiently.

 

Unfortunately, most associative storage methods have been slow, expensive, or both[1] unless there is only a little data. It is important to understand the problems faced by traditional associative storage methods, if we are to understand how NeoCore solves this problem. So before we examine NeoCore’s indexing techniques, we want to look at classic hash tables and the problems they encounter.

 

Return to Top

 

 

Classic Hash Tables

 

A typical hash table can be interpreted as a one-column table. Every cell in the table is pre-loaded with an Association; we will assume here that the Association consists of data storage locations. When you attempt to store “denver,” the system first looks at the “denver” row in the hash table, reads the predetermined data storage location, and then writes it to disk at that storage location. Similarly, to query for “denver”, the system first looks at the “denver” row in the hash table and finds its specified storage location in the database. It checks that storage location and returns a “found” or “not found” as appropriate.

 

The problem is determining which row of the hash table is intended for “denver”. You can’t just store the string “denver” in the hash table (along with all other possible search keys) and then go searching for it. That would be like searching the database itself. And to have a hash table large enough to contain all possible keys boggles the imagination. The hash table would be bigger than the database. The hash table would help us find data in the database, but what would help us find strings in the hash table?

 

Instead, the software must provide a method for converting the key, “denver,” into an entry address for a reasonably-sized hash table. The access needs to be rapid, so we don’t merely start at row 1, then row 2, etc. Instead, we convert “denver” into a specific entry address (or offset or row) in the hash table.

 

For typical database applications, the potential number of keys is quite large; normally, only a small portion will ever actually be used. So, while there are, let’s say, ten trillion possible item numbers in our inventory system, we might actually have only 500 inventory items in the database. And our hash table has, let’s say, four trillion rows with possible data storage locations. So, whichever item number of the ten trillion possible I provide to the storage algorithm, it has to convert that item number to the entry address of one of the four trillion rows in the hash table. Since there are far more possible keys than rows in the hash table, there are quite a few possible keys that will convert to the same row. This in turn means that “collisions” can occur—when the table indicates that the data associated with two or more different keys is to be put into the same storage location.

 

Because of the possibility of collisions, the software must be able to detect the problem and then provide an alternate entry address for the hash table. Detecting a collision on an insert involves accessing the designated storage location in the data store and observing there is already data at that storage location. So we need to get an alternate storage location from the hash table. Let’s suppose we get it from the next row in the hash table. The problem is that now there are two “used” rows at that particular area of the hash table, and as more collisions occur, the number that occur at that area tends to increase; eventually it takes a major expenditure of computer time just to detect and resolve the collisions. Each collision can be resolved only after a time-consuming disk read. And as the database starts getting full, there can be hundreds or even billions of collisions before the location for a single insert can be determined. Over the years we have found somewhat more efficient ways of resolving such collisions, but as the database gets more and more data, the number of collisions necessarily gets more and more frequent. The primary way of alleviating the problem (and it’s expensive) is to have a lot more storage locations available than we will actually use.

 

Return to Top

 

 

The 1985 Brandin Catalog

 

In 1985, Christopher Brandin, now CTO of NeoCore, patented a variation on classic hash tables. In 2001 he filed another patent application, building on the first, that offered a second improvement. In this section we will examine the core concept as contained in the original 1985 patent; later in this chapter we will examine the more recent variation.

 

While Brandin’s technique is related to classic hash tables, it is so significantly different that it is far more than just a variation. Following his original 1985 terminology, we will call his structure a “Catalog” rather than a hash table.

 

The Brandin Catalog differs from a classic hash table in both structure and use.

 

The difference in usage is this. The traditional hash table has to be full of predetermined storage locations before you can insert the first bit of data into the database; you look in the hash table to determine where you are going to store your data. The 1985 Brandin Catalog, if nothing has ever been inserted into the database, starts out empty. (The space for the Catalog is allocated, but all the cells are empty.) Instead of first looking up the predetermined storage location in the hash table, Brandin’s method simply uses the first available storage location. The “Store” process sends that storage location back to the “Catalog Controller” process. The Catalog Controller then enters the storage location into an appropriate place in the Catalog. If there is only one item stored in the database, then there is only one row in the Catalog which contains any information.

 

Brandin’s Catalog also differs from hash tables in structure. Instead of a one-column hash table with nothing but database storage locations, the original 1985 Brandin Catalog has three columns. One is the Association, which typically contains the data store address, just like for the hash table. The second is the Confirmer, which is explained below. The third is the “Forward Pointer”: it contains the next row to look at in the Catalog, in case there is a collision (or in case the application allows duplicates).

 

Figure 5-1 shows the structure of a simplified 15-element Brandin Catalog after a single data entry has been made into the database. (In the NeoCore implementation the size of the Catalog (called a “Core Index”) is configurable. Ideally, system resources permitting, the Catalog would be in RAM for fast

 

Catalog

Entry

Address

Association

Confirmer

Forward

Pointer

1111

NULL

NULL

NULL

1110

NULL

NULL

NULL

1101

NULL

NULL

NULL

1100

NULL

NULL

NULL

1011

NULL

NULL

NULL

1010

NULL

NULL

NULL

1001

NULL

NULL

NULL

1000

NULL

NULL

NULL

0111

NULL

NULL

NULL

0110

NULL

NULL

NULL

0101

DB Address #1

1001

NULL

0100

NULL

NULL

NULL

0011

NULL

NULL

NULL

0010

NULL

NULL

NULL

0001

NULL

NULL

NULL

Figure 5-1 Simplified Brandin Catalog—one entry

 

access.[2]) To use the Catalog you first generate an icon, using the data to be stored as the input string to the icon generator, as explained in the previous chapter. We will assume in our simplified example here that our first data item was the string “atlanta” and the icon computed to 01011001. The first part of the icon is used as the Catalog Entry Address, in our example 0101; the second part of the icon is used as the Confirmer, in our example 1001. (More on the Confirmer later.) The Association is the location in disk storage where the insert was just made—in this case the first location in the data store, simply because this was the first insert made. The Forward Pointer is NULL because with no second entry yet, there cannot have been any collisions or duplicate entries.

 

Figure 5-2 shows the Catalog after a second entry has been made. Let us assume that the data item is “boston” and that the icon computed to 00111010. Note that DB Address #2 is not a fixed, pre-determined location. DB Address #2 depends on how much space was used when the first data entry was made at DB Address #1. As soon as the Store process inserted the first data item into

 

Catalog

Entry

Address

Association

Confirmer

Forward

Pointer

1111

NULL

NULL

NULL

1110

NULL

NULL

NULL

1101

NULL

NULL

NULL

1100

NULL

NULL

NULL

1011

NULL

NULL

NULL

1010

NULL

NULL

NULL

1001

NULL

NULL

NULL

1000

NULL

NULL

NULL

0111

NULL

NULL

NULL

0110

NULL

NULL

NULL

0101

DB Address #1

1001

NULL

0100

NULL

NULL

NULL

0011

DB Address #2

1010

NULL

0010

NULL

NULL

NULL

0001

NULL

NULL

NULL

Figure 5-2 Simplified Brandin Catalog—two entries

 

Catalog

Entry

Address

Association

Confirmer

Forward

Pointer

1111

NULL

NULL

NULL

1110

NULL

NULL

NULL

1101

NULL

NULL

NULL

1100

DB Address #3

1010

NULL

1011

NULL

NULL

NULL

1010

NULL

NULL

NULL

1001

NULL

NULL

NULL

1000

NULL

NULL

NULL

0111

NULL

NULL

NULL

0110

NULL

NULL

NULL

0101

DB Address #1

1001

NULL

0100

NULL

NULL

NULL

0011

DB Address #2

1010

1100

0010

NULL

NULL

NULL

0001

NULL

NULL

NULL

Figure 5-3 Simplified Brandin Catalog—three entries

 

DB Address #1, it knew that the New Record Address for the second data item would immediately follow the space used by the first data item. There are no empty spots in the data store; only a couple characters between data items to be used as a delimiter.

 

Figure 5-3 shows a third entry. Assume this time, when we input “columbus”, the icon generator created the same icon as it did for “boston”, the second entry, namely 00111010. This is a full collision. The first half of the icon is the Catalog Entry Address, 0011. The Catalog Controller looks at the corresponding row in the Catalog and sees that it has already been taken. So it simply selects any available Catalog Entry Address (in this example, 1100) and enters that value as the “Forward Pointer” in Row 0011. Then in Row 1100, it inserts the appropriate Association and Confirmer. Notice that the collision was resolved without a relatively time-consuming read of the data store.

 

Figure 5-4 shows a fourth entry. We will assume that the data item is “denver” and its icon is 11000010. The Catalog Controller sees that row 1100 already contains information. This amounts to a kind of “mini-collision”—the Catalog Entry Address is already taken, even though the icon is not the same. Because of the mini-collision, the Catalog Controller chooses an available Catalog Address (1110) and enters that value in the “Forward Pointer” in Row 1100. It inserts the appropriate Association and Confirmer in Row 1110. Note that the Confirmer is different for rows 1100 and 1110, because we had only a mini-collision, not a full collision.

 

Catalog

Entry

Address

Association

Confirmer

Forward

Pointer

1111

NULL

NULL

NULL

1110

DB Address #4

0010

NULL

1101

NULL

NULL

NULL

1100

DB Address #3

1010

1110

1011

NULL

NULL

NULL

1010

NULL

NULL

NULL

1001

NULL

NULL

NULL

1000

NULL

NULL

NULL

0111

NULL

NULL

NULL

0110

NULL

NULL

NULL

0101

DB Address #1

1001

NULL

0100

NULL

NULL

NULL

0011

DB Address #2

1010

1100

0010

NULL

NULL

NULL

0001

NULL

NULL

NULL

Figure 5-4 Simplified Brandin Catalog—four entries

 

For purposes of our example, to see how the Catalog works, we arbitrarily postulated that, in the first four data entries, we had a full collision and a mini-collision. It should be emphasized that, in practice, either kind of collision would be rare in a near-empty database—once in every 18(10)18 times and once in every 4 trillion times, respectively, if we are using 64-bit icons and 32-bit catalog entry addresses.

 

When a query is made for “columbus,” the Catalog Controller will first look at 0011, verify it contains the correct Confirmer, and then look at the data in DB Address #2. But the data it finds is “boston”; there has been a collision. So it will get the Forward Pointer from row 0011 (namely, 1100), and look at row 1100. Because row 1100 has the correct Confirmer, the system will look at DB Address #3, this time finding the correct data item. Assuming duplicates are not allowed, as in the NeoCore core indices, the search is complete. (If duplicates are allowed, the Controller will see that there is a Forward Pointer in Row 1100, and so it will look at row 1110. But this time the Confirmer in that row is not the correct one, so it will not look at DB Address #4. It will also notice that there is no Forward Pointer in row 1110, and so it will terminate the search at that point.)

 

When you access the Brandin Catalog, you are not accessing merely one entry; you are accessing a potential chain of entries. For example, in Figure 5-4, when you access row 0011, you will be led, chain-fashion, to 1100, and then to 1110.

 

Several different situations can move you along the chain. (1) You could allow duplicates and so have several different data elements all keyed, for example, to “columbus.” (2) In some cases (approximately once every 4 trillion Catalog queries, if you are using 64-bit icons and 32-bit catalog entry addresses and if you don’t allow duplicates) two different keys will yield the same Catalog Entry Address. (3) The row identified by a given Catalog Entry Address might already be occupied by a link in some other chain. (4) you might have a full-blown collision (approximately once every (18)1018 queries, if you are using 64-bit icons and don’t yet have much data in the database), where two different keys yield the same icon. Finally, (5) you can have a situation where, because of (1) through (3), you have already moved along some chain and are now at a pre-existing link which happens to have the same Confirmer (again, this would happen about one time in 4 trillion, provided you are in one of those rare events where you are at a link that is already in use).

 

When the first three of these occurs, the Catalog Controller handles them efficiently without a disk read, merely by examining the Confirmer. Cases (4) and (5), however, pose special problems for reads. They are the Brandin equivalent of full-blown collisions: they cannot be detected until there is a read of the data store. Using the Brandon method, such cases are not very frequent, and for some applications, the user might be able to accept such occasional mis-reads. But in most applications, they must be dealt with. Brandin provides a “Verifier” process for this purpose: each time data is read from the data store, the Verifier compares the key to the data itself. This is one of the limitations on the Brandin method: you have to use CPU cycles to conduct the verification. In this respect, however, the Brandin method is still far superior to traditional hash methods: they read and verify data on EVERY collision, not just the few cases that fall in categories (4) and (5) above. In an almost full database, traditional hash methods have to conduct erroneous disk access about 4 trillion times more frequently than Brandin does. And disk access is, by far, the slowest part of the entire process.

 

Return to Top

 

 

Managing the Catalog

 

A classic hash table has only one field for each element; Brandin’s Catalog has three. The classic hash table, once created, is static; Brandin’s Catalog is quite dynamic. The question is whether the advantages that Brandin’s method generates would be outweighed by the increased size of his Catalog and the overhead of managing it.

 

Brandin’s patent documentation provides ways to manage the Catalog for writes, reads and deletes. Writes. If the using application requires items to be entered into the Catalog no more than once (as in the NeoCore XMS), then any write will be preceeded by a read—to determine whether the write is permitted. The system will already have generated the icon and traversed any existing Catalog chain. It will already know where in the Catalog the new entry will be made. As the Store process is writing the item to disk, the Catalog Controller process will insert the new entry into the Catalog and, if needed, set the appropriate Forward Pointer to point to the new entry. During the write itself, managing the catalog in effect takes no time—it happens while the relatively time-consuming write to disk is being done. Reads.  It is rather much the same, except at each link in the chain, it is necessary to read the Confirmer at that row and compare it to the Confirmer register. If they agree, Store reads the data record in primary storage and has the Verifier check it. If it passes the Verifier test, the data item is returned in response to the original query. Deletes.  Deletes work much the same as Reads, except that instead of returning the data item, we mark it for deletion. Then we have to mark for deletion the associated Catalog entry in such a way that the chain is not broken when we delete one of its links.

 

In summary, Brandin provides methods for managing the Catalog very efficiently. Hash techniques can detect collisions only by accessing the primary data on the hard drive; for Brandin most collisions (and all of them on writes) will be detected by examining the RAM-based Catalog. CONCLUSION: EVEN COUNTING THE OVERHEAD OF MANAGING A DYNAMIC CATALOG, BRANDIN’S 1985 METHOD IS FAR FASTER THAN TRADITIONAL HASHING TECHNIQUES.

 

Return to Top

 

 

Walk-through of Catalog Management Operations during a Write

 

During this step-by-step walk-through, we will assume, for simplicity, that duplicates are allowed. When used in the XMS, duplicates are not allowed in the Catalog.

 

At the end of the previous write, the New Record Address Register (NRA) was already updated with the address for the next store. Also, the Free Chain Pointer Register (FCP) contains the address of an empty element in the Catalog; that address is established by a separate process that is not explained in the patent materials. We will assume all other relevant registers are empty or at least the content there is no longer relevant.

 

As the new data and the write instruction are given to the Store process, the Icon Generator computes the icon; the first part ends up in a Catalog Entry Address Register (CEA) and the second part in the Confirmer Register (CNF).

 

The Catalog Controller reads the relevant row in the Catalog. If that row is empty, it writes the New Record Address and the Confirmer into that element. That would be the only Catalog management operations needed for this write.

 

But if there already is information in that row of the Catalog, then the Catalog Controller will have to follow the Forward Pointers until it finds an empty spot. Three registers are needed. (1) The Free Chain Pointer Register (FCP) contains the address of an empty element in the Catalog that is provided by a separate subroutine. (2) The Previous Chain Pointer Register (PCP) will receive the value of the current Forward Pointer (FP); initially, it receives the value of the Catalog Entry Address (CEA). (3) The Queried Entry Pointer Register (QEP) initially receives the value of the FP at the Catalog Entry Address. As we move along the chain, it will contain the value of the FP that we are about to consider.

 

We will walk through the write process assuming the simplified Catalog and its contents as shown in Figure 5-4. We will also assume we are going to write data into DB Location #5, and that the icon will be the same icon that was used when the second and the third writes took place, namely, 00111010. The previous transaction has completed, and all registers have been reset to zero except the Free Chain Pointer Register (FCP), which contains the address of an empty element in the Catalog, and the New Record Address Register (NRA), which contains the address of DB Location #5.

 

1.      The system receives the write instruction and the data to be written. (In our example, we assume we allow duplicates, so that we don’t first have to do a query to determine whether this is a duplicate.)

2.      The data is sent to the icon generator; it puts “0011” into the Catalog Entry Address Register (CEA) and “1010” into the Confirmer Register (CNF).

3.      The Catalog Controller now reads the Confirmer value that is in the Catalog entry specified in the Catalog Entry Address Register, i.e., it reads “1010,” the Confirmer value in row 0011.

4.      When the Catalog Controller sees that the Confirmer value from step 4 is not zero, it knows that Catalog entry 0011 is already in use and it must use a different place for the information about the current write. It enters into Figure 5-5 at “BEGIN”.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.      The current Confirmer, 1010, is already in the Confirmer Register (CNF); and “DB Location 5” is already in the New Record Address Register (NRA). The contents of the Catalog Entry Address Register (CEA), i.e., “0011,” is copied into the Previous Chain Pointer Register (PCP). The Catalog Controller reads the Chain Link Address value in the Catalog at row 0011, the address in PCP, and puts it in the Queried Entry Pointer Register (QEP). QEP now holds 1100.

6.      QEP is tested for 0. Since its value is now 1100, the “NO” path is taken. PCP receives the value of QEP, namely 1100. The FP[PCP] is read from the Catalog, namely, 1110, and assigned to QEP.

7.      QEP is tested for 0 again. Since its value is now 1110, the “NO” path is taken again. PCP receives the value of QEP, namely 1110. The FP[PCP] is read from the Catalog, namely, NULL, and assigned to QEP.

8.      QEP is tested for 0 again. This time it is 0, and so the YES path is taken.

9.      The contents of the FCP (we will assume it is 1000) is assigned to QEP. FP[PCP], i.e., the Chain Link Address at 1110, receives the value of QEP, namely 1000. At Catalog entry 1000, the Confirmer value is written, namely 1010, and the NRA value is written, namely DB Location #5.

 

The important thing to notice here is that the loop in the upper right hand corner of Figure 5-5 is extremely fast, with only an assignment of values to two registers and a test for zero each time. No matter how many links are plausibly in the chain, it will be traversed very rapidly. This technique is what allows Brandin to resolve typical collisions so quickly.

 

Return to Top

 

 

Advantages of the Brandin Store/Search Method

 

A classic hash table technique cannot get around the collision problem. The fundamental problem is that collisions can be detected only by reading a primary data entry on the hard drive—inherently a very slow process. The problem can be alleviated somewhat with collision resolution schemes, but it doesn’t solve the problem. For Brandin, almost all collisions are detected by examining the RAM-based Catalog. Brandin can get the speed associated with hash tables while avoiding most of the delays caused by collisions.

 

If you were to use a classic hash table, your best approach—whatever your collision resolution scheme—would be to have huge data storage and a corresponding huge hash table. We’re much less likely to have collisions if we are using only five percent of our storage space and our hash table. But that is extremely expensive.

 

Brandin’s approach effectively allows arbitrary 64-bit integer keys. By separating the icon into the Catalog Entry Address and the Confirmer, Brandin has found a way to have 64-bit keys (i.e., 18(10)18 unique keys) and still have a Catalog with far fewer elements.

 

(And, of course, Brandin’s basic idea could be expanded if a 64-bit key is insufficient for extremely large scale needs. It would not take much to scale Brandin’s method to any size you might need. We would merely have to increase the length of the icon and the size of the Catalog as the number of possible entries in the data store increased.)

 

The net result of Brandin’s method is to have the advantage of hash tables (i.e., extremely fast speeds) without impossibly large hash tables and without using excessive time to detect and manage collisions.

 

Return to Top

 

 

An Improved Catalog

 

Here are the strings and their associated icons that we used in the simplified Catalog example earlier in this chapter:

                        atlanta 01011001

                        boston            00111010

                        columbus        00111010

                        denver            11000010

When we created this example, we assumed these icon values so that we could show what happened when we had a full collision when we tried to insert “columbus” and a mini-collision when we tried to insert “denver”. “atlanta” went into its natural row,  0101, and “boston” into its natural row, 0011. But then, because the natural row for “columbus” (0011) was already taken, the system happened to choose 1100 as the “columbus” row. But that choice (choosing 1100 as the row for “columbus”) then caused a problem for “denver”: its natural row was unavailable also, and so the system chose to insert it at row 1110. For as long as “columbus” and “denver” remain in the database, every time we query for either of those two data items, we will have to check the Catalog at their natural rows and then jump to their actual rows.

 

Figure 5-6 shows this situation; it is identical to Figure 5-4, except the irrelevant empty Catalog locations have been removed.

 

Catalog

Entry

Address

Association

Confirmer

Forward

Pointer

1110

DB Address #4

0010

NULL

1100

DB Address #3

1010

1110

0101

DB Address #1

1001

NULL

0011

DB Address #2

1010

1100

Figure 5-6. Portion of four-entry simplified Brandin Catalog

 

The arrows suggest that we swap the “columbus” entry and the “denver” entry; then “denver” would be at its natural address. We would find the “denver” entry during future queries without jumping to a new address. To make the swap, the system needs to recognize, at the time we are dealing with the “denver” insert, that the “columbus” information in row 1100 is not at its natural row. Then we could move the “columbus” Association and Confirmer to row 1110, freeing up row 1100 for “denver”. We would also have to change the Forward Pointer in row 0011, so that it reads 1110 rather than 1100. What we really want is the arrangement shown in Figure 5-7:

 

Catalog

Entry

Address

Association

Confirmer

Forward

Pointer

1110

DB Address #3

1010

NULL

1100

DB Address #4

0010

NULL

0101

DB Address #1

1001

NULL

0011

DB Address #2

1010

1110

Figure 5-7. Preferred Catalog Setup

 

In order to accomplish this, the system has to recognize two things quickly—without a read of the data store: (1) the “columbus” entry at 1100 is not at its natural address and (2) the natural address is 0011. Unfortunately, there is no way to figure these things quickly enough with the Catalog structure we have been using.

 

The patent for a revised Catalog technique was issued in 2001. The revision retains the central concept of the 1985 patent: the use of the Confirmer and the Forward Pointer to allow huge icons, efficient collision resolution, and a reasonably-sized Catalog. The revision adds two bits to the Catalog structure: a “Primary Flag” and an “Allocated Flag.” There are also some differences in usage, as will now be explained.

 

When the 1985 Catalog is initially created, all its cells are empty. But when the improved, 2001, Catalog is initially created, it is created as a kind of doubly linked list. All the Forward Pointers point to the next row; the Association is used to point to the previous row. This empty, doubly-linked list will be called the “Free List.” As one or another row is used, it will be removed from the Free List. Figure 5-8 shows the initial setup for the improved Catalog.

 

Catalog

Entry

Address

Primary Flag

Allocated Flag

Association

Confirmer

Forward

Pointer

1111

0

0

1110

NULL

0001

1110

0

0

1101

NULL

1111

1101

0

0

1100

NULL

1110

1100

0

0

1011

NULL

1101

1011

0

0

1010

NULL

1100

1010

0

0

1001

NULL

1011

1001

0

0

1000

NULL

1010

1000

0

0

0111

NULL

1001

0111

0

0

0110

NULL

1000

0110

0

0

0101

NULL

0111

0101

0

0

0100

NULL

0110

0100

0

0

0011

NULL

0101

0011

0

0

0010

NULL

0100

0010

0

0

0001

NULL

0011

0001

0

0

1111

NULL

0010

Figure 5-8. Improved Brandin Catalog—Initial Configuration

 

Now let’s see what happens when we put “atlanta” and “boston” in the Catalog. Figure 5-9 shows the Catalog after “atlanta” and “boston” are inserted.

1.            We first look at the Allocated Flags in the designated rows 0101 and 0011; since these Flags are not set, we know that the Catalog information will be stored in their natural rows. And so we take rows 0101 and 0011 out of the Free List by changing the Forward Pointer and Association values in the rows before and after 0101 and 0011.

2.            We make the Forward Pointers at both these locations point back to themselves.

 

Catalog

Entry

Address

Primary Flag

Allocated Flag

Association

Confirmer

Forward

Pointer

1111

0

0

1110

NULL

0001

1110

0

0

1101

NULL

1111

1101

0

0

1100

NULL

1110

1100

0

0

1011

NULL

1101

1011

0

0

1010

NULL

1100

1010

0

0

1001

NULL

1011

1001

0

0

1000

NULL

1010

1000

0

0

0111

NULL

1001

0111

0

0

0110

NULL

1000

0110

0

0

0100

NULL

0111

0101

1

1

DB Address #1

1001

0101

0100

0

0

0010

NULL

0110

0011

1

1

DB Address #2

1010

0011

0010

0

0

0001

NULL

0100

0001

0

0

1111

NULL

0010

Figure 5-9. Improved Catalog—Two Entries

 

3.            Then we set both the Primary Flag and the Allocated Flag in rows 0101 and 0011.

4.            Finally, we enter the Confirmer as before and we use the appropriate data store location for the Association. The same procedure is followed for any insert where the Catalog Entry Address is part of the Free List.

 

Now let’s see what happens when we attempt to insert “columbus”. The problem, of course, is that the icon generator returns 0011 as the Catalog Entry Address, and that location is already taken by “boston”. Figure 5-10 shows the state of the Catalog after inserting “columbus”.

1.            The Catalog Controller first observes that the Allocated Flag is set in row 0011. This means that a different row will be needed. As in our earlier example, we will assume that 1100 is chosen. This time, we will first have to take 1100 out of the Free List by changing the links found at the adjacent rows.

2.            Next we look at the Primary Flag in row 0011. In this case, it is set, and so we copy the Forward Pointer at 0011 to the Forward Pointer at 1100, and we set the Forward Pointer at 0011 to point to 1100. Now there is a singly-linked, circular mini-chain of Catalog entries which share the same Catalog Entry Address. The entry at 0011 points to 1100; the entry at 1100 points to 0011.

3.            At 1100, we will set the Allocated Flag but not the Primary Flag. The Allocated Flag establishes that the location is being used; the Primary Flag indicates that the item at that location is being stored in

 

Catalog

Entry

Address

Primary Flag

Allocated Flag

Association

Confirmer

Forward

Pointer

1111

0

0

1110

NULL

0001

1110

0

0

1101

NULL

1111

1101

0

0

1011

NULL

1110

1100

0

1

DB Address #3

1010

0011

1011

0

0

1010

NULL

1101

1010

0

0

1001

NULL

1011

1001

0

0

1000

NULL

1010

1000

0

0

0111

NULL

1001

0111

0

0

0110

NULL

1000

0110

0

0

0100

NULL

0111

0101

1

1

DB Address #1

1001

0101

0100

0

0

0010

NULL

0110

0011

1

1

DB Address #2

1010

1100

0010

0

0

0001

NULL

0100

0001

0

0

1111

NULL

0010

Figure 5-10. Improved Catalog—Three Entries

 

its natural row, i.e., the row indicated by the first part of the icon. Since “columbus”) at 1100 will not be in its “natural” position, we want its Primary Flag to be “off.”

4.            We enter the Confirmer and the Association for “columbus” at address 1100.

 

Now we will enter into the Catalog our values for “denver”. Figure 5-11 shows the Catalog when we are done.

1.            The Catalog Entry Address for “denver” is 1100. We look at that row and see that the Allocated Flag is set. We need another location, and, as before, select 1110. We take 1110 out of the Free List by redoing the pointers in the links before and after.

2.            Part A. We see that the primary flag at 1100 is not set. So we are going to put the “denver” entry at 1100 and move the existing “columbus” entry to 1110. The first thing we do is find the Forward Pointer which points to 1100 so that we can change it to 1110. Starting with the Forward Pointer at 1100, we follow the singly-linked circular chain as far as it goes until we find the Forward Pointer which points to 1100; in this case that is only one link, the one at 0011. We change the Forward Pointer at 0011 to 1110.
Part B. We copy the entire “columbus” entry—all five columns—from 1100 to 1110.
Part C. We set the Forward Pointer at 1100 to point to itself.

3.            We set both the Allocated Flag and Primary Flag at 1100.

4.            We put in the Confirmer and Association for “denver” into 1100.

 

Catalog

Entry

Address

Primary Flag

Allocated Flag

Association

Confirmer

Forward

Pointer

1111

0

0

1101

NULL

0001

1110

0

1

DB Address #3

1010

0011

1101

0

0

1011

NULL

1111

1100

1

1

DB Address #4

0010

1100

1011

0

0

1010

NULL

1101

1010

0

0

1001

NULL

1011

1001

0

0

1000

NULL

1010

1000

0

0

0111

NULL

1001

0111

0

0

0110

NULL

1000

0110

0

0

0100

NULL

0111

0101

1

1

DB Address #1

1001

0101

0100

0

0

0010

NULL

0110

0011

1

1

DB Address #2

1010

1110

0010

0

0

0001

NULL

0100

0001

0

0

1111

NULL

0010

Figure 5-11. Improved Catalog—Four Entries

 

When we insert a new data item, we compute its icon and use the first part of the icon for the Catalog Entry Address. By looking at the Primary and Allocated Flags in that row, we can tell immediately which model we will use in updating the Catalog. These are the variations and the model used:

 

Allocated Flag

Primary Flag

Model to Use

0

0

atlanta/boston

0

1

No such configuration

1

0

denver

1

1

columbus

 

On a query, the Primary Flag and the Allocated Flag are not used. Processing with the other three columns is identical to processing with the 1985 Brandin Catalog, as explained earlier.

 

Return to Top

 

 

Efficiency of the NeoCore Catalog

 

A single row in the NeoCore Catalog is wider than a single row in a traditional hash table. Nonetheless, contemporary computers with 128-bit memory controllers and 128-bit data busses can handle the full width of the NeoCore Catalog, assuming we are using NeoCore’s standard 64-bit icons (or less).  Therefore a single read will take about the same amount of time whether we are using the traditional hash table or the NeoCore Catalog. So, assuming the table/Catalog is in memory, the efficiencies depend only on the number of times we access the table/Catalog and the number of times we have to access other structures (e.g., the Map Store and the dictionaries) which are larger and therefore typically require an expensive disk read.

 

When a database is nearly empty, a classic hash table works perfectly well: collisions are rare, and so the collision resolution scheme, practically speaking, is irrelevant. For a nearly empty database, accesses will succeed in one hash table lookup, with only rare exceptions. The NeoCore Catalog will also require only the single lookup on reads, and so the times for hash table and Catalog will be nearly the same. NeoCore will take longer on writes, because it will have to update the row information and the link information. But with a near-empty data-base, everything is so efficient, the user is probably not even concerned about the difference. The real problem arises when the database contains significant amounts of data.

 

At the other extreme, when a database is nearly full, the classic hash table is a disaster, no matter what collision resolution scheme is used. For a new store command, if there are four trillion entries in the hash table, the number of hash table lookups (and data store lookups, since collision resolution requires a data store lookup) will average two trillion. Under such conditions, traditional hash tables are virtually useless.

 

The efficiencies of the NeoCore Catalog are realized when the database starts getting full. Classic hash tables start having significant problems at about 50% of capacity, gradually generating more and more collisions as more data is input. At about 75%, the curve showing the number of collisions per query is getting quite steep; at 90% the number of collisions is out of control.

 

But the curve for the NeoCore Catalog is nearly flat, even all the way to 100% full!

 

Suppose we start with a Catalog of 4 trillion entries. When the Catalog is, say, half full, the doubly-linked Free Chain has shortened to 2 trillion links, and there are many small singly-linked, circular mini-chains. If the Catalog is completely full, the Free Chain has disappeared, and the whole Catalog consists only of small, circular mini-chains. We never have to look through a substantial portion of the Catalog to find the row we are looking for; we only have to look through the relevant mini-chain.

 

The size of the circular mini-chains determines how efficient the Catalog is. At one extreme, it is logically possible that there is one mini-chain of some four trillion links. At the other extreme, it is logically possible that there are four trillion mini-chains with one link each. While the extremes are logically possible, the statistics which govern the four trillion transactions make such extremes practically impossible. Remember, our icon generator is designed so that every possible Catalog Entry Address is equally probable.

 

If we work through the statistical arithmetic, we will find that in a full Catalog without duplicates, 37% of the rows will be in one-link mini-chains. 37% of the rows will be in two-link chains; 18%, three-link; 6%, four-link; 1.5%, five-link; .3%, six-link, and .05%, seven link. All larger mini-chains combined will contain only .009% of the rows. Almost never will we have even a single 13-link chain in a four- trillion row, full Catalog. And because of the large number of individual transactions (we assumed four trillion), there will not be much statistical deviation from these numbers.

 

Another way of presenting the statistics is to say that over 63% of the rows will have the Primary Flag set, meaning we will find these by reading only one row in the Catalog. Another 26% will need two reads; 8%, three reads; under 2%, four reads. Only about 1% total will need more than four Catalog reads.

 

The bottom line: even with a full Catalog with trillions of items, assuming we do not allow duplicates, the average number of Catalog accesses to find a given item is 1.5.[3] (Compare this to billions and even trillions for a full hash-table.) Furthermore, because of the use of Confirmers, almost none of the multiple Catalog reads will require an extra read of the data store. With a hash table, every read of the table also requires a read of the data store.

 

Return to Top

 

 

 

<< Chapter_4_Icons

Index of HTML Docs

Chapter_6_Search_Architecture >>

 

 



[1] Martin, James, Computer Data-Base Organization, (Prentice-Hall, Inc.: Englewood Cliffs, NJ, 2nd Edition, 1977), Chapter 36, “Associative Memory.”

[2] Even if system resources do not permit keeping the full Catalog in RAM all the time, the operating system will attempt to keep as much in RAM as it can for as long as it can. Hence, we will assume in the rest of this chapter that the Catalog is in RAM. There will be some overhead involved if the operating system has to move portions of the Catalog from RAM to disk and back. This can be extensive: see our note on RAM sufficiency at the start of Chapter 8.

[3] When I first saw this claim in the NeoCore technical marketing materials, I was quite sure there was some sort of “creative” use of mathematics or at least misleading terminology. Either that or they were simply wrong. It takes a fairly good understanding of the NeoCore methodology to be convinced that they are simply correct. That’s why this chapter is fairly long: without a good grasp of how the Catalog works, it is difficult to believe the efficiencies that NeoCore has achieved.