Universal Directed Acyclic Word Graph or UDAWG
Introduction
A universal directed acyclic word graph is a data structure based on a traditional DAWG that stores a lexicon of character strings or “words” in a true bi-directional fashion. The objective of a UDAWG is to allow for rapid pattern matching. That is, to fill a static “word space” with all of the words in a specified lexicon that fulfill the word space constraints. A traditional DAWG data structure is traversed as though it were a trie and the universal variety is only different in that the order of traversal of a word space is defined by the root node that is chosen to start the traversal. The number of root nodes that a UDAWG has is a direct function of the number of characters in the lexicon’s longest word. Each root node represents a unique way to traverse a word space in an ordered fashion.
# of Root Nodes = 2 x ( (Max Number of Chars) - 1 )
Hence two character words will have only two unique traversal paths, whereas fifteen character words will have twenty eight unique paths. Unique paths are defined by two numbers, a reference position, and a distance from that reference position. The reference position can be either the start or end of a word space, and the distance is contained in the set [1, n - 1], where n is the number of character spaces in the word space. Traversal is carried out from the starting position towards the reference position, then from the starting position to the opposite reference position. There are no delimiters for the reference points stored inside of the UDAWG because this information is implicit. It is contained in the choice of the root node. This is a defining characteristic that sets the UDAWG apart from the GADDAG data structure.
The UDAWG Structure
The UDAWG data structure has been designed for performance optimization when a process has access to multiple computation streams; hence, it is a parallel processing construct. The GADDAG, due to the rarity of multiple core systems when it was proposed, is a data structure designed for sequential optimization. With only one root node, the word space being traversed is dynamic, and every word in the lexicon must be considered; making inclusion of the start of word delimiter a necessity. This also demonstrates another shortcoming of the GADDAG. It is not a true bi-directional data structure. Bi-Directionality can only be achieved by using the inverse data structure DAGGAD. The fundamental flaw lies in the reality that a dynamic word space will likely never need only one of the structures for optimal traversal.
The only reference point considered by a GADDAG is the start of a word. The only way to traverse a word space heading to the end of the word first is to start from the first character, or to use the DAGGAD.
Clearly, the path of highest constraint is unattainable and the inclusion of a non-implicit reference point increases the complexity of the graph structure, hence increasing its size. This concept is best demonstrated using an example. I have chosen the word “copy” due to its length and unique characters. ‘&’ will represent the GADDAG start of word delimiter:
GADDAG: & = start reference point
- 1. C&OPY
- 2. OC&PY
- 3. POC&Y
- 4. YPOC&
UDAWG: R = reference point ( S = start = 0, E = End = 1 ), D = displacement [1, n-1],
RN = root node = 2 x (D-1) + R
- 1. COPY RN = 0 4. YPOC RN = 1
- 2. OCPY RN = 2 5. PYOC RN = 3
- 3. POCY RN = 4 6. OPYC RN = 5
The COPY example is provided to demonstrate 4 governing principles. First, the UDAWG contains all directed traversals of the word COPY, and the GADDAG does not. Second, the structure of the GADDAG is more complex and all paths are available when entering from the single root node. Third, though the UDAWG contains more information, nodes of the 6 traversals can be shared without inclusion of the ‘&’ delimiter. Fourth, only a single path is visible once a root node has been chosen; this drastically reduces the size of the lexicon considered.
How to Create an Optimal UDAWG
I have broken down the creation of an optimal universal directed acyclic word graph into 11 not so easy pieces. I will provide the C source code that I used to create one to anyone who is interested.
This procedure is directly related to the creation of a traditional DAWG, which is laid out in detail right HERE. The only difference is that this universal structure has multiple root nodes so that when using a queue to traverse the structure breadth first, it is essential to push all of the “first level” nodes on to the queue before popping any nodes out of the queue. All other procedures are identical. The layout of the single unsigned integer used to store each node will require more bits to store the child reference because there will be more nodes but there are enough extra bits to go around. The UDAWG for the TWL06 Scrabble word list is 6,858 KB in size. It contains 1,755,423 nodes. It will take a modest computer system a very long time to compile an optimal UDAWG. The reason for designing this data structure was to make a real world compromise for the perfect Scrabble machine. Here is the documentation, code, and the program.
Contact Information
Contact: JohnPaul Adamovsky – ( logarithm69@hotmail.com ) - Please feel free to ask me questions.
Phone: (416) 231-7577
This is the mark of my trade...
All The Very Best,
JohnPaul Adamovsky