Directed Acyclic Word Graph or DAWG


Introduction | | The DAWG Structure | | How to Create an Optimal DAWG | | C Implamentation | | Dense Boggle Board Solution - Full Disclosure | | Contact Information |


Introduction

    A directed acyclic word graph is simply a data structure that stores a lexicon of character strings or words in an extremely small amount of memory. Any process that makes use of the DAWG data structure traverses it as though it were a trie. A TRIE is a data tree where every prefix is shared to reduce the amount of space required to store a lexicon and reduce the time required to “retrieve” a word. The DAWG data structure results from an attempt to reduce the size of a trie while maintaining it’s near minimal word search time. The way that this is done is to attempt to replace all redundant postfix information. It is critical to understand that all words ending with an “s” or “es” or “tion” or “istic” do NOT necessarily use the same nodes in the DAWG data structure. I will provide an example to clarify this truth succinctly:

Consider the trie of 11 words stored in 23 nodes:

  • 1. ablate
  • 2. ablated
  • 3. ablates
  • 4. ablating
  • 5. ablation
  • 6. ablations
  • 7. abject
  • 8. abjection
  • 9. abjections
  • 10. abjectly
  • 11. abjectness


  •     It can be clearly observed that because there is no “e” following the “t” in “abject,” the “t” contained in "ablate" can not be the same node in a DAWG.  Hence the “tion” shared by each word does not share the same postfix structure, and must not be the same path in the DAWG.  It is true however that the “ons” structure will be shared by both words in a DAWG.  That is only true for the optimal DAWG because the “ons” part of “abjections” is a direct child.  A terminal node in the DAWG is defined as the final element of a child list as well as containing a NULL child reference.  In a minimal node DAWG there will surely be only 26 of these terminal nodes but they will not end every word.  It is also a fact that there will be many more ternimal nodes in a optimal DAWG due to the "direct child law."  In fact, the optimal DAWG for the TWL06 tournament Scrabble word list contains a startling 5956 terminal nodes.  A graphical representation is presented above.  It is not computer generated.  That there is what's know as the real deal!

        Due to the objective of space reduction being the essence of a DAWG, it is important to note that the minimal number of nodes will not produce the smallest DAWG.  Space size of a data structure is given by the following equation:

        STRUCTURE SIZE = (NUMBER OF NODES) x (SIZE OF EACH NODE)

        The size of each node in computer science is not arbitrary.  Due to the 32 bit architecture of most modern computer machines, 32 bits is the de facto standard minimal unit size for any structure type variable.  Standard C treats this as divine truth.  Therefore if the size of each node can be reduced by half and the number of nodes does not double as a result, the total structure size has fallen.


    The DAWG Structure

        A node in a DAWG can be viewed as containing a character, and references to its children.  Now because a single character does not represent an entire word or key, computer science would not refer to (a single character and pointers) as a “node.”  Rather each character would be called an "edge."  In order to simplify this treatment of the DAWG, this convention has been disposed of.  A node in a DAWG is now defined as a character and references to all of its children and a flag to indicate the end of a word.

        We are attempting to minimize the size of a single “node” so that it will be less than or equal to 32 bits while at the same time keeping the number of nodes low enough to justify the reduction of information in each node.

        In short, the 26 letters in the English alphabet require at least 5 bits to store a character.  It also requires a single bit to store information about the end of word flag.  A list of children should make sense because this would reduce the number of references for children to just two.  One reference is for the First Child of the list below a node, and a second reference is for Next node in the current list.

        Experimentation has shown that there is one further reduction that will ultimately reduce the size of the final DAWG, but increase the number of nodes.  The DAWG will be stored into an array, and the Next node reference will be replaced by a single bit flag indicating the end of a list, and the Next node in a list is assumed to be the next element in the array unless the (end of list) bit flag is flying.  This will increase the total number of nodes required because the only nodes that can be tested for redundancy are those that are direct children of another node, in other words, the first node in a Child list.  Clearly every node in the next list and every node below a redundant node will also be removed, so there will not be as much of a discrepancy in the node reduction as one would expect, given this condition.

        This development is far from trivial as it requires a form of tree traversal known as “breadth” first traversal.  This type of traversal is required to enter all of the temporary trie nodes into a set of arrays that will hold pointers to each node where they will be tested for redundancy.  Also, this type of traversal is required once more when assigning index values to all of the nodes that are not redundant.  The complexity stems from the requirement of a secondary data structure know as a queue so that traversal is carried out level by level, as opposed to the standard recursive “depth” first traversal.

        With seven bits spoken for, that leaves 25 bits remaining for the reference to the Child list.  25 bits translates into a maximum number of array elements equal to 33,554,432.  This is great because there will be more information to store in each node down the line for advanced users and experiment shows that the 178,691 words in the TWL06 Scrabble word list fit into exactly 123,669 nodes using the constraints set out above.  That translates into 494,676 Bytes.  That is less than half of a Mega Byte to store the English Language up to and including 15 letter words.  Compare that to the 801,680 Bytes requires to store the least node DAWG with 100,210 nodes.  And that is why we choose the optimal DAWG 10 times out of 10.


    How to Create an Optimal DAWG

        I have broken down the creation of an optimal directed acyclic word graph into 11 not so easy pieces.  I will provide the C source code that I used to create one to anyone who is interested.


    C Implamentation

        Due to popular demand, I have decided to clean up the DAWG creation code and publish it on this page. Download the untampered with files below. The second program published on this page is an alphabetical-order anagrammer. For applications that require speed and arbitrary-order, please refer to the ADTDAWG web page.

    1) Lexicon.txt - TWL06 Scrabble word list in proper text file format
    2) GoogleDawgCreatorSubmit.c - The DAWG-creation C code.
    3) GoogleDawgAnagrammerSubmit.c - The DAWG-testing anagram C code.


    GoogleDawgCreatorSubmit.c
    // This program demonstrates the sleek construction of a lexicon data structure known as the "Directed Acyclic Word Graph".
    // Weighing in at nearly 800 lines, this algorithm was not forged through trivial and comprehensive documentation.
    
    // To set up this traditional DAWG for creation, set up the following 3 conditions.
    // 1) "Lexicon.txt" is a word list, where all characters are capital, and the number of words is written on the very first line. Linux format, so no DOS CR character.
    // 2) Set the constant MAX below to the length of the longest word in the lexicon.  One letter words are invalid in this program.
    // 3) Understand the procedures for extracting information from the unsigned integer nodes.
    
    // Keep in mind that traversing a traditional DAWG is only optimal when it takes place in alphabetical order.
    // The unsigned integer encoding that we are using for this "Dawg" allows for up to 33554432 nodes.  
    // That is very many more nodes than an current English lexicon will require.
    
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #define MAX 15
    #define MIN 2
    #define NUMBER_OF_REDUCED_LETTERS 14
    #define INPUT_LIMIT 30
    #define BIG_IT_UP -32
    #define LETTER_BIT_SHIFT 25
    #define LETTER_BIT_MASK 1040187392
    #define  CHILD_INDEX_BIT_MASK 33554431
    #define END_OF_WORD_BIT_MASK 2147483648
    #define END_OF_LIST_BIT_MASK 1073741824
    #define BINARY_STRING_LENGTH 38
    
    // we will store the DAWG into a ".txt" file to verify the output, and a ".dat" file for use.
    #define RAW_LEXICON "Lexicon.txt"
    #define DAWG_DATA "Dawg_For_Lexicon.dat"
    #define DAWG_TEXT_DATA "Dawg_Text_For_Lexicon.txt"
    
    // When reading strings from a file, the new-line character is appended, and this macro will remove it before processing.
    #define CUT_OFF_NEW_LINE(string) (string[strlen(string) - 1] = '\0')
    
    // C requires a boolean variable type so use C's typedef concept to create one.
    typedef enum { FALSE = 0, TRUE = 1 } Bool;
    typedef Bool* BoolPtr;
    
    // This array streamlines the conversion of digit strings into integers.  Unsigned integers do not exceed the low billions, so ten numbers will suffice.
    unsigned int PowersOfTen[10] = {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000};
    unsigned int PowersOfTwo[32] = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 
    32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648};
    
    // Returns the unsigned integer rerpresented by "TheNumberNotYet" string, and if not a number greater then zero, returns 0.
    unsigned int StringToUnsignedInt(char *TheNumberNotYet){
      unsigned int Result = 0;
      unsigned int X;
      unsigned int Digits = strlen(TheNumberNotYet);
      for ( X = 0; X < Digits; X++ ) {
        if ( !(TheNumberNotYet[X] >= '0' && TheNumberNotYet[X] <= '9') ) return 0;
        Result += (TheNumberNotYet[X] - '0')*PowersOfTen[Digits - X - 1];
      }
      return Result;
    }
    
    // This Function converts any lower case letters in the string "RawWord", into capitals, so that the whole string is made of capital letters.
    void MakeMeAllCapital(char *RawWord){
      unsigned int X;
      for ( X = 0; X < strlen(RawWord); X++ ){
        if (RawWord[X] >= 'a' && RawWord[X] <= 'z' ) RawWord[X] = RawWord[X] + BIG_IT_UP;
      }
    }
    
    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    // "Tnode" functionality.  These will be the nodes contained in a "Trie" designed to convert into a "Dawg".
    
    struct tnode {
      struct tnode* Next;
      struct tnode* Child;
      struct tnode* ParentalUnit;
      // When populating the array, you must know the indices of child nodes.  
      // Hence we Store "ArrayIndex" in every node so that we can mine the information from the reduced "Trie".
      int ArrayIndex;
      char DirectChild;
      char Letter;
      char MaxChildDepth;
      char Level;
      char NumberOfChildren;
      char Dangling;
      char EndOfWordFlag;
    };
    
    typedef struct tnode Tnode;
    typedef Tnode* TnodePtr;
    
    // Extracting "Tnode" member data will consume a lot of time when finding redundant nodes, so define these functions as macros.
    // The only negative effect is that the programmer must be diligent to use these constructs on the right data.  Otherwise, this decision makes the program faster.
    #define TNODE_ARRAY_INDEX(thistnode) (thistnode->ArrayIndex)
    #define TNODE_DIRECT_CHILD(thistnode) (thistnode->DirectChild)
    #define TNODE_NEXT(thistnode) (thistnode->Next)
    #define TNODE_CHILD(thistnode) (thistnode->Child)
    #define TNODE_PARENTAL_UNIT(thistnode) (thistnode->ParentalUnit)
    #define TNODE_LETTER(thistnode) (thistnode->Letter)
    #define TNODE_MAX_CHILD_DEPTH(thistnode) (thistnode->MaxChildDepth)
    #define TNODE_NUMBER_OF_CHILDREN(thistnode) (thistnode->NumberOfChildren)
    #define TNODE_END_OF_WORD_FLAG(thistnode) (thistnode->EndOfWordFlag)
    #define TNODE_LEVEL(thistnode) (thistnode->Level)
    #define TNODE_DANGLING(thistnode) (thistnode->Dangling)
    
    TnodePtr TnodeInit(char Chap, TnodePtr OverOne, char WordEnding, char Leveler, int StarterDepth, TnodePtr Parent, char IsaChild){
      TnodePtr Result = (Tnode*)malloc(sizeof(Tnode));
      Result->Letter = Chap;
      Result->ArrayIndex = 0;
      Result->NumberOfChildren = 0;
      Result->MaxChildDepth = StarterDepth;
      Result->Next = OverOne;
      Result->Child = NULL;
      Result->ParentalUnit = Parent;
      Result->Dangling = FALSE;
      Result->EndOfWordFlag = WordEnding;
      Result->Level = Leveler;
      Result->DirectChild = IsaChild;
      return Result;
    }
    
    // Define all of the member-setting functions as macros for faster performance in "Trie" creation.
    #define TNODE_SET_ARRAY_INDEX(thistnode, thewhat) (thistnode->ArrayIndex = thewhat)
    #define TNODE_SET_CHILD(thistnode, nexis) (thistnode->Child = nexis)
    #define TNODE_SET_NEXT(thistnode, nexi) (thistnode->Next = nexi)
    #define TNODE_SET_PARENTAL_UNIT(thistnode, beforeme) (thistnode->ParentalUnit = beforeme)
    #define TNODE_ADD_ONE_CHILD(thistnode) (thistnode->NumberOfChildren += 1)
    #define TNODE_SET_MAX_CHILD_DEPTH(thistnode, newdepth) (thistnode->MaxChildDepth = newdepth)
    #define TNODE_SET_DIRECT_CHILD(thistnode, status) (thistnode->DirectChild = status)
    #define TNODE_FLY_END_OF_WORD_FLAG(thistnode) (thistnode->EndOfWordFlag = TRUE)
    #define TNODE_DANGLE_ME(thistnode) (thistnode->Dangling = TRUE)
    
    // This function Dangles a node, but also recursively dangles every node under it as well, that way nodes that are not direct children do hit the chopping block.
    // The function returns the total number of nodes dangled as a result.
    int TnodeDangle(TnodePtr ThisTnode){
      int Result = 0;
      if ( TNODE_DANGLING(ThisTnode) == TRUE ) return 0;
      if ( TNODE_NEXT(ThisTnode) != NULL ) Result += TnodeDangle(TNODE_NEXT(ThisTnode));
      if ( TNODE_CHILD(ThisTnode) != NULL ) Result += TnodeDangle(TNODE_CHILD(ThisTnode));
      TNODE_DANGLE_ME(ThisTnode);
      Result += 1;
      return Result;
    }
    
    // This function returns the pointer to the Tnode in a parallel list of nodes with the letter "ThisLetter", and returns NULL if the Tnode does not exist.
    // In the "NULL" case, an insertion will be required.
    TnodePtr TnodeFindParaNode(TnodePtr ThisTnode, char ThisLetter){
      if ( ThisTnode == NULL ) return NULL;
      TnodePtr Result = ThisTnode;
      if ( TNODE_LETTER(Result) == ThisLetter ) return Result;
      while ( TNODE_LETTER(Result) < ThisLetter ) {
        Result = TNODE_NEXT(Result);
        if ( Result == NULL ) return NULL;
      }
      if ( TNODE_LETTER(Result) == ThisLetter ) return Result;
      else return NULL;
    }
    
    // This function inserts a new Tnode before a larger letter node or at the end of a Para-List If the list does not exist then it is put at the beginnung.  
    // The new node has "ThisLetter" in it.  "AboveTnode" is the node 1 level above where the new node will be placed.
    // This function should never be passed a "NULL" pointer.  "ThisLetter" should never exist in the "Child" Para-List.
    void TnodeInsertParaNode(TnodePtr AboveTnode, char ThisLetter, char WordEnder, int StartDepth){
      TNODE_ADD_ONE_CHILD(AboveTnode);
      TnodePtr Holder = NULL;
      TnodePtr Currently = NULL;
      // Case 1:  Para-List does not exist yet, so start it.
      if ( TNODE_CHILD(AboveTnode) == NULL ) TNODE_SET_CHILD(AboveTnode, TnodeInit(ThisLetter, NULL, WordEnder, TNODE_LEVEL(AboveTnode) + 1, StartDepth, AboveTnode, TRUE));
      // Case 2: ThisLetter should be the first in the Para-List that already exists.
      else if ( TNODE_LETTER(TNODE_CHILD(AboveTnode)) > ThisLetter ) {
        Holder = TNODE_CHILD(AboveTnode);
        // The "Holder" node will no longer be a direct child, so set it as such.
        TNODE_SET_DIRECT_CHILD(Holder, FALSE);
        TNODE_SET_CHILD(AboveTnode, TnodeInit(ThisLetter, Holder, WordEnder, TNODE_LEVEL(AboveTnode) + 1, StartDepth, AboveTnode, TRUE ));
        // "Holder"'s "ParentalUnit" is now the new node, so make the necessary change.
        TNODE_SET_PARENTAL_UNIT(Holder, TNODE_CHILD(AboveTnode));
      }
      // Case 3: The ParaList exists and ThisLetter is not first in the list.  This is the default case condition:  "if ( TNODE_LETTER(TNODE_CHILD(AboveTnode)) < ThisLetter )".
      else {
        Currently = TNODE_CHILD(AboveTnode);
        while ( TNODE_NEXT(Currently) != NULL ) {
          if ( TNODE_LETTER(TNODE_NEXT(Currently)) > ThisLetter ) break;
          Currently = TNODE_NEXT(Currently);
        }
        Holder = TNODE_NEXT(Currently);
        TNODE_SET_NEXT(Currently, TnodeInit(ThisLetter, Holder, WordEnder, TNODE_LEVEL(AboveTnode) + 1, StartDepth, Currently, FALSE));
        if ( Holder != NULL ) TNODE_SET_PARENTAL_UNIT(Holder, TNODE_NEXT(Currently));
      }
    }
    
    // This function returns "TRUE" if "FirstNode" and "SecondNode" are found to be the parent nodes of equal tree branches.  This includes identical nodes in the current list.
    // The "MaxChildDepth" of the two nodes can not be assumed equal due to the recursive nature of this function, so we must check for equivalence.
    Bool TnodeAreWeTheSame(TnodePtr FirstNode, TnodePtr SecondNode){
      if ( FirstNode == SecondNode ) return TRUE;
      if ( FirstNode == NULL || SecondNode == NULL ) return FALSE;
      if ( TNODE_LETTER(FirstNode) != TNODE_LETTER(SecondNode) ) return FALSE;
      if ( TNODE_MAX_CHILD_DEPTH(FirstNode) != TNODE_MAX_CHILD_DEPTH(SecondNode) ) return FALSE;
      if ( TNODE_NUMBER_OF_CHILDREN(FirstNode) != TNODE_NUMBER_OF_CHILDREN(SecondNode) ) return FALSE;
      if ( TNODE_END_OF_WORD_FLAG(FirstNode) != TNODE_END_OF_WORD_FLAG(SecondNode) ) return FALSE;
      if ( TnodeAreWeTheSame(TNODE_CHILD(FirstNode), TNODE_CHILD(SecondNode)) == FALSE ) return FALSE;
      if ( TnodeAreWeTheSame(TNODE_NEXT(FirstNode), TNODE_NEXT(SecondNode)) == FALSE ) return FALSE;
      else return TRUE;
    }
    
    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    // "Trie" functionality.
    
    struct trie {
      int NumberOfTotalWords;
      int NumberOfTotalNodes;
      TnodePtr First;
    };
    
    typedef struct trie Trie;
    typedef Trie* TriePtr;
    
    // Set up the "First" node in the Trie.
    TriePtr TrieInit (void){
      TriePtr Result;
      Result = (TriePtr)malloc(sizeof(Trie));
      Result->NumberOfTotalWords = 0;
      Result->NumberOfTotalNodes = 0;
      // The "First" node in the "Trie" will be a dummy node.
      Result->First = TnodeInit('0', NULL, FALSE, 0, 0, NULL, FALSE);
      return Result;
    }
    
    TnodePtr TrieRootTnode(TriePtr ThisTrie){
      if ( ThisTrie->NumberOfTotalWords > 0 ) return TNODE_CHILD(ThisTrie->First);
      else return NULL;
    } 
    
    // This function simply makes "TrieAddWord" look a lot smaller.  It returns the number of new nodes that it just inserted.
    int TnodeAddWord(TnodePtr ParentNode, const char *Word){
      int Result = 0;
      int X, Y;
      int WordLength = strlen(Word);
      TnodePtr HangPoint = NULL;
      TnodePtr Current = ParentNode;
      // Try to follow the path of "Word" until it doesn't exist, and then hang a new chain of nodes to include it in the trie.
      for ( X = 0; X < WordLength; X++ ) {
        HangPoint = TnodeFindParaNode(TNODE_CHILD(Current), Word[X]);
        if ( HangPoint == NULL ) {
          TnodeInsertParaNode(Current, Word[X], ((X == (WordLength - 1))? TRUE: FALSE), WordLength - X - 1);
          Result += 1;
          Current = TnodeFindParaNode(TNODE_CHILD(Current), Word[X]);
          for ( Y = X + 1; Y < WordLength; Y++ ) {
            TnodeInsertParaNode(Current, Word[Y], ((Y == (WordLength - 1))? TRUE: FALSE), WordLength - Y - 1);
            Result += 1;
            Current = TNODE_CHILD(Current);
          }
          break;
        }
        else if ( TNODE_MAX_CHILD_DEPTH(HangPoint) < (WordLength - X - 1) ) TNODE_SET_MAX_CHILD_DEPTH(HangPoint, (WordLength - X - 1));
        Current = HangPoint;
        // The path for the word that we are trying to insert already exists, so just make sure that the end flag is flying on the last node.
        // This should never happen if words are added in alphabetical order, but this is not a requirement of the program.
        if ( X == WordLength - 1 ) TNODE_FLY_END_OF_WORD_FLAG(Current);
      }
      return Result;
    }
    
    // This function adds to "ThisTrie"'s counter variables and adds "NewWord", using "TnodeAddWord", where the real addition algorithm exists.
    void TrieAddWord(TriePtr ThisTrie, char *NewWord){
      ThisTrie->NumberOfTotalWords += 1;
      ThisTrie->NumberOfTotalNodes += TnodeAddWord( ThisTrie->First, NewWord );
    }
    
    // This is a standard depth-first preorder tree traversal, whereby the objective is to count nodes of various "MaxChildDepth"s.
    // The counting results are placed into the "Tabulator" array.  This will be used to streamline the "Trie"-to-"Dawg" conversion process.
    void TnodeGraphTabulateRecurse(TnodePtr ThisTnode, int Level, int* Tabulator){
      if ( Level == 0 ) TnodeGraphTabulateRecurse(TNODE_CHILD(ThisTnode), Level + 1, Tabulator);
      else if ( TNODE_DANGLING(ThisTnode) == FALSE ) {
        Tabulator[TNODE_MAX_CHILD_DEPTH(ThisTnode)] += 1;
        // Go Down if possible.
        if ( TNODE_CHILD(ThisTnode) != NULL ) TnodeGraphTabulateRecurse(TNODE_CHILD(ThisTnode), Level + 1, Tabulator );
        // Go Right through the Para-List if possible.
        if ( TNODE_NEXT(ThisTnode) != NULL ) TnodeGraphTabulateRecurse(TNODE_NEXT(ThisTnode), Level, Tabulator );
      }
    }
    
    // Count the "Tnode"'s in "ThisTrie" for each "MaxChildDepth".  "Count" needs to be an integer array of size "MAX".
    void TrieGraphTabulate(TriePtr ThisTrie, int* Count){
      int *Numbers = (int*)malloc(MAX*sizeof(int));
      int X;
      for ( X = 0; X < MAX; X++ ) Numbers[X] = 0;
      if ( ThisTrie->NumberOfTotalWords > 0) {
        TnodeGraphTabulateRecurse(ThisTrie->First, 0, Numbers);
        for ( X = 0; X < MAX; X++ ) {
          Count[X] = Numbers[X];
        }
      }
      free(Numbers);
    }
    
    // This function can never be called after a trie has been mowed because this will free pointers twice resulting in a core dump!
    void FreeTnodeRecurse(TnodePtr ThisTnode){
      if ( TNODE_CHILD(ThisTnode) != NULL ) FreeTnodeRecurse(TNODE_CHILD(ThisTnode)) ;
      if ( TNODE_NEXT(ThisTnode) != NULL ) FreeTnodeRecurse(TNODE_NEXT(ThisTnode));
      free(ThisTnode);
    }
    
    // An important function, it returns the first node in the list "MaxChildDepthWist", that is identical to "ThisTnode".
    // If the function returns a value equal to "ThisTnode", then it is the first of its kind in the "Trie"
    TnodePtr TnodeMexicanEquivalent( TnodePtr ThisTnode, TnodePtr ** MaxChildDepthWist ) {
      int Tall = TNODE_MAX_CHILD_DEPTH(ThisTnode);
      int X = 0;
      while ( TnodeAreWeTheSame(ThisTnode, MaxChildDepthWist[Tall][X]) == FALSE ) {
        X += 1;
      }
      return MaxChildDepthWist[Tall][X];
    }
    
    // Recursively replaces all redundant nodes in a trie with their first equivalent.
    void TnodeLawnMowerRecurse(TnodePtr ThisTnode, TnodePtr ** HeightWits){
      if ( TNODE_LEVEL(ThisTnode) == 0 ) TnodeLawnMowerRecurse(TNODE_CHILD(ThisTnode), HeightWits);
      else {
        // When replacing a "Tnode", we must do so knowing that "ThisTnode" is how we got to it.
        if ( TNODE_NEXT(ThisTnode) == NULL && TNODE_CHILD(ThisTnode) == NULL ) return;
        if ( TNODE_CHILD(ThisTnode) != NULL ) {
          // we have found a node that has been tagged for mowing, so let us destroy it.
          // Replace it with its first equivelant in the "HeightWits" list, which won't be tagged.
          if ( TNODE_DANGLING(TNODE_CHILD(ThisTnode)) == TRUE ) {
            TNODE_SET_CHILD(ThisTnode, TnodeMexicanEquivalent(TNODE_CHILD(ThisTnode), HeightWits));
          }
          else TnodeLawnMowerRecurse( ThisTnode->Child, HeightWits );
        }
        // Traverse the rest of the "Trie", but a "Tnode" that is not a direct child will never be directly replaced.
        // This will allow the resulting "Dawg" to fit into a contiguous array of node lists.
        if ( TNODE_NEXT(ThisTnode) != NULL ) TnodeLawnMowerRecurse(TNODE_NEXT(ThisTnode), HeightWits);
      }
    }
    
    // Replaces each redundant list in "ThisTrie" with its first equivalent.
    void TrieLawnMower(TriePtr ThisTrie, TnodePtr ** HeightWise){
      TnodeLawnMowerRecurse(ThisTrie->First, HeightWise );
    }
    
    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    // A queue is required for breadth first traversal, and the rest is self evident.
    
    struct breadthqueuenode {
      TnodePtr Element;
      struct breadthqueuenode *Next;
    };
    
    typedef struct breadthqueuenode BreadthQueueNode;
    typedef BreadthQueueNode* BreadthQueueNodePtr;
    
    // In the spirit of keeping the code fast, use macros for basic functionality.
    #define BREADTH_QUEUE_NODE_NEXT(thisbreadthqueuenode) (thisbreadthqueuenode->Next)
    #define BREADTH_QUEUE_NODE_ELEMENT(thisbreadthqueuenode) (thisbreadthqueuenode->Element);
    #define BREADTH_QUEUE_NODE_SET_NEXT(thisbreadthqueuenode, nexit) (thisbreadthqueuenode->Next = nexit)
    #define BREADTH_QUEUE_NODE_SET_ELEMENT(thisbreadthqueuenode, element) (thisbreadthqueuenode->Element = element)
    
    BreadthQueueNodePtr BreadthQueueNodeInit(TnodePtr NewElement){
      BreadthQueueNodePtr Result = (BreadthQueueNode*)malloc(sizeof(BreadthQueueNode));
      Result->Element = NewElement;
      Result->Next = NULL;
      return Result;
    }
    
    struct breadthqueue {
      BreadthQueueNodePtr Front;
      BreadthQueueNodePtr Back;
      int Size;
    };
    
    typedef struct breadthqueue BreadthQueue;
    typedef BreadthQueue* BreadthQueuePtr;
    
    BreadthQueuePtr BreadthQueueInit( void ) {
      BreadthQueuePtr Result = (BreadthQueue*)malloc(sizeof(BreadthQueue));
      Result->Front = NULL;
      Result->Back = NULL;
      Result->Size = 0;
    }
    
    void BreadthQueuePush(BreadthQueuePtr ThisBreadthQueue, TnodePtr NewElemental){
      BreadthQueueNodePtr Noob = BreadthQueueNodeInit(NewElemental);
      if ( (ThisBreadthQueue->Back) != NULL ) BREADTH_QUEUE_NODE_SET_NEXT(ThisBreadthQueue->Back, Noob);
      else ThisBreadthQueue->Front = Noob;
      ThisBreadthQueue->Back = Noob;
      ThisBreadthQueue->Size += 1;
    }
    
    TnodePtr BreadthQueuePop(BreadthQueuePtr ThisBreadthQueue){
      TnodePtr Result;
      if ( ThisBreadthQueue->Size == 0 ) return NULL;
      if ( ThisBreadthQueue->Size == 1 ) {
        ThisBreadthQueue->Back = NULL;
        ThisBreadthQueue->Size = 0;
        Result = BREADTH_QUEUE_NODE_ELEMENT(ThisBreadthQueue->Front);
        free(ThisBreadthQueue->Front);
        ThisBreadthQueue->Front = NULL;
        return Result;
      }
      Result = BREADTH_QUEUE_NODE_ELEMENT(ThisBreadthQueue->Front);
      BreadthQueueNodePtr Holder = ThisBreadthQueue->Front;
      ThisBreadthQueue->Front = BREADTH_QUEUE_NODE_NEXT(ThisBreadthQueue->Front);
      free(Holder);
      ThisBreadthQueue->Size -= 1;
      return Result;
    }
    
    void BreadthQueuePopulateReductionArray(BreadthQueuePtr ThisBreadthQueue, TnodePtr Root, TnodePtr **Holder){
      printf( "Inside external BreadthQueuePopulateReductionArray function.\n" );
      int Taker[MAX];
      int X = 0;
      int PopCount = 0;
      int CurrentMaxChildDepth = 0;
      TnodePtr Current = Root;
      for ( X = 0; X < MAX; X++ ) Taker[X] = 0;
      // Push the first row onto the queue.
      while ( Current != NULL ) {
        BreadthQueuePush(ThisBreadthQueue, Current);
        Current = TNODE_NEXT(Current);
      }
      // Initiate the pop-followed-by-push-all-children loop.
      while ( (ThisBreadthQueue->Size) != 0 ) {
        Current = BreadthQueuePop(ThisBreadthQueue);
        PopCount += 1;
        CurrentMaxChildDepth = TNODE_MAX_CHILD_DEPTH(Current);
        Holder[CurrentMaxChildDepth][Taker[CurrentMaxChildDepth]] = Current;
        Taker[CurrentMaxChildDepth] += 1;
        Current = TNODE_CHILD(Current);
        while ( Current != NULL ) {
          BreadthQueuePush(ThisBreadthQueue, Current);
          Current = TNODE_NEXT(Current);
        }
      }
      printf( "Completed Populating The Reduction Array.\n" );
    }
    
    // This function will label all of the remaining nodes in the Trie-Turned-Dawg so that they will fit contiguously into an unsigned integer array.
    // The return value is the final index number handed out to the "Tnode"'s, and hence one less than the size of the array that they need to fit into.
    int BreadthQueueUseToIndex(BreadthQueuePtr ThisBreadthQueue, TnodePtr Root){
      // The first index to be given out will be "1" because "0" denotes "NULL". 
      int IndexNow = 0;
      TnodePtr Current = Root;
      // Push the first Para-List onto the queue.
      while ( Current != NULL ) {
        BreadthQueuePush(ThisBreadthQueue, Current);
        Current = TNODE_NEXT(Current);
      }
      // Pop each element off of the queue and only push its children if has not been dangled yet.
      // Assign an index to the node, if one has not been given to it yet. Nodes will be placed on the queue many times.
      while ( (ThisBreadthQueue->Size) != 0 ) {
        Current = BreadthQueuePop(ThisBreadthQueue);
        // If the "Current" node already has an index, don't give it a new one.
        if ( TNODE_ARRAY_INDEX(Current) == 0 ) {
          IndexNow += 1;
          TNODE_SET_ARRAY_INDEX(Current, IndexNow);
          Current = TNODE_CHILD(Current);
          while ( Current != NULL ) {
            BreadthQueuePush(ThisBreadthQueue, Current);
            Current = TNODE_NEXT(Current);
          }
        }
      }
      printf( "Completed Assigning Indices To Living Nodes.\n" );
      return IndexNow;
    }
    
    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    // This section contains code related to the format of unsigned integers that represent "Dawg" nodes.
    
    // The "BinaryNode" string must be at least 32 + 5 + 1 bytes in length.  Space for the bits, the seperation pipes, and the end of string char.
    void ConvertUnsignedIntNodeToBinaryString(unsigned int TheNode, char* BinaryNode){
      int X;
      int Bit;
      BinaryNode[0] = '|';
      // Bit 31, the leftmost bit, represents the "EndOfWordFlag".
      BinaryNode[1] = (TheNode & PowersOfTwo[31])?'1':'0';
      BinaryNode[2] = '|';
      // Bit 30 represents the "EndOfListFlag".
      BinaryNode[3] = (TheNode & PowersOfTwo[30])?'1':'0';
      BinaryNode[4] = '|';
      // Bit 29 to bit 25 are the "Letter" bits.
      BinaryNode[5] = (TheNode & PowersOfTwo[29])?'1':'0';
      BinaryNode[6] = (TheNode & PowersOfTwo[28])?'1':'0';
      BinaryNode[7] = (TheNode & PowersOfTwo[27])?'1':'0';
      BinaryNode[8] = (TheNode & PowersOfTwo[26])?'1':'0';
      BinaryNode[9] = (TheNode & PowersOfTwo[25])?'1':'0';
      BinaryNode[10] = '|';
      // Bit 24 to bit 0, are space of the first child index.
      Bit = 24;
      for ( X = 11; X <= 35; X++ ) {
        BinaryNode[X] = (TheNode & PowersOfTwo[Bit])? '1': '0';
        Bit -= 1;
      }
      BinaryNode[36] = '|';
      BinaryNode[37] = '\0';
    }
    
    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    
    // This function is the core of the "Dawg" creation procedure.  Pay close attention to the order of the steps involved.
    
    unsigned int *DawgInit(char **Dictionary, int *SegmentLenghts, int MaxStringLength, int MinStringLength){
      unsigned int *Result;
      int X = 0;
      int Y;
      int *ChildCount;
      char *ChildStrings;
      TriePtr TemporaryTrie;
      int *NodeNumberCounter;
      int *NodeNumberCounterInit;
      int TotalNodeSum = 0;
      TnodePtr ** HolderOfAllTnodePointers;
      BreadthQueuePtr Populator;
      int NumberDangled;
      int TotalDangled = 0;
      int W;
      int DangleCount[MaxStringLength];
      int NumberOfLivingNodes;
      int TotalDangledCheck = 0;
      BreadthQueuePtr OrderMatters;
      int IndexCount;
      int IndexFollow;
      int IndexFollower = 0;
      unsigned int UnsignedEncodingValue;
      FILE *Text;
      FILE *Data;
      char TheNodeInBinary[BINARY_STRING_LENGTH];
      
      printf("\nStep 0 - Get ready to watch the Dawg creation procedure.\n");
      
      while ( SegmentLenghts[X] == 0 ) X += 1;
    
      printf("\nStep 1 - Create the intermediate Trie and begin filling it with the provided lexicon.\n");
      // Create a Temp Trie structure and then feed in the given lexicon.
      TemporaryTrie = TrieInit();
      for ( X = MinStringLength; X <= MaxStringLength; X++ ) {
        for ( Y = 0; Y < SegmentLenghts[X]; Y++ ) {
          TrieAddWord(TemporaryTrie, &(Dictionary[X][(X + 1)*Y]));
        }
      }
    
      printf("\nStep 2 - Finished adding words to the temporary Trie, so set up the space to sort the Tnodes by MaxChildDepth.\n");
      printf("\nStep 2 - Finished adding words to the temporary Trie, so set up the space to sort the Tnodes by MaxChildDepth.\n");
      // Create the associated pointer array with all the nodes inside it.
      NodeNumberCounter = (int*)malloc(MaxStringLength*sizeof(int) );
      for ( X = 0; X < MaxStringLength; X++ ) NodeNumberCounter[X] = 0;
      NodeNumberCounterInit = (int*)malloc(MaxStringLength*sizeof(int));
    
      printf("\nStep 3 - Count the total number of nodes in the raw Trie by MaxChildDepth.\n");
      TrieGraphTabulate(TemporaryTrie, NodeNumberCounter);
      
      printf("\nStep 4 - Initial counting of Trie nodes complete, so display the results.\n\n");
      
      for ( X = 0; X < MaxStringLength; X++ ) {
        NodeNumberCounterInit[X] = NodeNumberCounter[X];
        TotalNodeSum += NodeNumberCounter[X];
      }
      for ( X = 0; X < MaxStringLength; X++ ) {
        printf("Initial Count For MaxChildDepth |%2d| is |%6d| nodes\n", X, NodeNumberCounterInit[X]);
      }
      printf("\nThe Total Number Of Nodes In The Trie = |%d| \n", TotalNodeSum);
      // We will have exactly enough space for all of the Tnode pointers.
    
      printf("\nStep 5 - Allocate a 2 dimensional array of Tnode pointers to minimize the trie into a Dawg.\n");
      HolderOfAllTnodePointers = (TnodePtr **)malloc(MaxStringLength*sizeof(TnodePtr*));
      for ( X = 0; X < MAX; X++ ) HolderOfAllTnodePointers[X] = (TnodePtr *)malloc(NodeNumberCounterInit[X]*sizeof(TnodePtr));
      
      // When populating the "HolderOfAllTnodePointers", we are going to do so in a breadth first manner to ensure that the next "Tnode" in a list located at the next array index. 
      printf("\nStep 6 - Populate the 2 dimensional array of Trie node pointers.\n");
      Populator = BreadthQueueInit();
      BreadthQueuePopulateReductionArray(Populator, TrieRootTnode(TemporaryTrie), HolderOfAllTnodePointers );
    
      printf("\nStep 7 - Population complete.\n");
      // Flag all of the reduntant nodes.
      // Flagging requires the node comparison function that will take a very long time for a big dictionary.
      // This is especially true when comparing the nodes with small "MaxChildDepth"'s because there are so many of them. 
      // It is faster to start with nodes of the largest "MaxChildDepth" to recursively reduce as many lower nodes as possible. 
    
      printf("\nStep 8 - Begin to tag redundant nodes as dangled - Use recursion because only direct children are considered for elimination to keep the remaining lists in tact.\n");
      // Start at the largest "MaxChildDepth" and work down from there for recursive reduction to take place early on to reduce the work load for the shallow nodes.
      for ( Y = (MaxStringLength - 1); Y >= 0 ; Y--) {
        NumberDangled = 0;
        // Move through the holder array from the beginning, looking for any nodes that have not been dangled, these will be the surviving nodes.
        for ( W = 0; W < (NodeNumberCounterInit[Y] - 1); W++ ) {
          // The Node is already Dangling.  Note that this node need not be the first in a child list, it could have been dangled recursively.
          // In order to eliminate the need for the "Next" index, the nodes at the root of elimination must be the first in a list, in other words, a "DirectChild".
          // The node that we replace the "DirectChild" node with can be located at any position.
          if ( TNODE_DANGLING(HolderOfAllTnodePointers[Y][W]) == TRUE ) continue;
          // Traverse the rest of the array looking for equivalent nodes that are both not dangled and are tagged as direct children.
          // When we have found an identical list structure further on in the array, dangle it, and all the nodes coming after, and below it.
          for ( X = W + 1; X < NodeNumberCounterInit[Y]; X++ ) {
            if ( TNODE_DANGLING(HolderOfAllTnodePointers[Y][X]) == FALSE && TNODE_DIRECT_CHILD(HolderOfAllTnodePointers[Y][X]) == TRUE ) {
              if ( TnodeAreWeTheSame(HolderOfAllTnodePointers[Y][W], HolderOfAllTnodePointers[Y][X]) == TRUE ) {
                NumberDangled += TnodeDangle(HolderOfAllTnodePointers[Y][X]);
              }
            }
          }
        }
        printf("Dangled |%5d| Nodes In all, through recursion, for MaxChildDepth of |%2d|\n", NumberDangled, Y);
        DangleCount[Y] = NumberDangled;
        TotalDangled += NumberDangled;
      }
      printf("\nTotal Number Of Dangled Nodes |%d|\n", TotalDangled);
      NumberOfLivingNodes = TotalNodeSum - TotalDangled;
      printf("\nTotal Number Of Living Nodes |%d|\n", NumberOfLivingNodes);
    
      printf("\nStep 9 - Count the number of living nodes in the trie before replacement so that we check our numbers.\n");
      // By running the graph tabulation function on a different array, and before we replace the nodes, we can determine if our numbers are correctish.
      TrieGraphTabulate(TemporaryTrie, NodeNumberCounter);
      for ( X = 0; X < MaxStringLength; X++ ) {
        printf( "Count for living nodes of MaxChildDepth |%2d| is |%5d|. It used to be |%6d| and so the number dangled is |%6d| \n", X, NodeNumberCounter[X],
        NodeNumberCounterInit[X], NodeNumberCounterInit[X] - NodeNumberCounter[X] );
      }
      for ( X = 0; X < MAX; X++ ) {
        TotalDangledCheck += (NodeNumberCounterInit[X] - NodeNumberCounter[X]);
      }
      if ( TotalDangled == TotalDangledCheck ) printf("The total number of nodes dangled adds up.\n");
      else printf("Something went terribly wrong, so fix it.\n");
    
      printf("\nStep 10 - Dangling is complete, so replace all dangled nodes with their first mexican equivelant in the Trie to make a compressed Dawg.\n");
      // Node replacement has to take place before indices are set up so nothing points to redundant nodes. - This step is absolutely critical.  Mow The Lawn so to speak!  Then assign indicies.
      TrieLawnMower( TemporaryTrie, HolderOfAllTnodePointers );
    
      printf("\nStep 11.1 - Mowing of the lawn is now complete, so start to assign array indices to all living nodes.\n");
      printf("Step 11.2 - The use of a breadth first queue during this step ensures that lists of contiguous nodes in the array will eliminate the need for a Next pointer.\n\n");
      OrderMatters = BreadthQueueInit();
      // Try to find out if the nodes we are setting are unique before we set them.
      IndexCount = BreadthQueueUseToIndex( OrderMatters, HolderOfAllTnodePointers[MAX - 1][0] );
      printf("Finished indexing.\n");
      printf("NumberOfLivingNodes from after the dangling process|%d|\n", NumberOfLivingNodes);
      printf("IndexCount from the index-handing-out breadth first traversal |%d|\n", IndexCount);
      if ( NumberOfLivingNodes == IndexCount ) printf("The numbers add up properly once again.\n");
      else {
        printf("The Numbers got Scrooged, so you still have some problems to iron out.\n");
        return NULL;
      }
    
      // Allocate the space needed to store the "Dawg" inside of an array.
      Result = (unsigned int*)calloc((NumberOfLivingNodes + 1), sizeof(unsigned int));
      
      // Roll through the pointer arrays and use the correct "BIT_SHIFT" values to encode the proper unsigned ints.
      // Set the "NULL" entry at the beginning of the array.
      Result[0] = 0;
    
      printf("\nStep 12 - Populate the unsigned integer array with a bitwise encoding.\n");
      // Traverse the entire 2D pointer array and search for undangled nodes.  When an undangled node is found, encode it, and place it at position "ArrayIndex".
      for ( X = (MaxStringLength - 1); X >= 0; X-- ) {
        for (W = 0; W < NodeNumberCounterInit[X]; W++ ){
          if ( TNODE_DANGLING(HolderOfAllTnodePointers[X][W]) == FALSE ) {
            UnsignedEncodingValue = TNODE_LETTER(HolderOfAllTnodePointers[X][W]) - 'A';
            UnsignedEncodingValue <<= LETTER_BIT_SHIFT;
            UnsignedEncodingValue += (TNODE_END_OF_WORD_FLAG(HolderOfAllTnodePointers[X][W]) == FALSE)? 0: END_OF_WORD_BIT_MASK;
            UnsignedEncodingValue += (TNODE_NEXT(HolderOfAllTnodePointers[X][W]) == NULL)? END_OF_LIST_BIT_MASK: 0;
            UnsignedEncodingValue += (TNODE_CHILD(HolderOfAllTnodePointers[X][W]) == NULL)? 0: TNODE_ARRAY_INDEX(TNODE_CHILD(HolderOfAllTnodePointers[X][W]));
            IndexFollow = TNODE_ARRAY_INDEX(HolderOfAllTnodePointers[X][W]);
            if ( IndexFollow > IndexFollower ) IndexFollower = IndexFollow;
            Result[IndexFollow] = UnsignedEncodingValue;
          }
        }
      }
      printf( "IndexFollower, which is the largest index assigned in the array = |%d|\n", IndexFollower );
      printf( "NumberOfLivingNodes|%d|, assert that these two are equal because they must be.\n", NumberOfLivingNodes );
      if ( IndexFollower == NumberOfLivingNodes ) printf("The numbers add up again, excellent!\n");
      else {
        printf("Don't jump!  You are very close to getting this program working.\n");
        return NULL;
      }
      
      // Do Garbage cleanup and free the whole Trie, which is no longer needed.  Free all nodes from the holding array.
      for ( X = 0; X < MaxStringLength; X++ ) for ( W = 0; W < NodeNumberCounterInit[X]; W++ ) free(HolderOfAllTnodePointers[X][W]);
      free(TemporaryTrie);
      free(NodeNumberCounter);
      free(NodeNumberCounterInit);
      for ( X = 0; X < MaxStringLength; X++ ) if (HolderOfAllTnodePointers[X]!= NULL) free(HolderOfAllTnodePointers[X]);
      free(HolderOfAllTnodePointers);
      
      printf("\nStep 13 - Creation of traditional Dawg is complete, so store it into a text file for verification and a 32-bit binary file for use.\n");
      // We now need to include the "NULL" node at position "0" in the living node collection.
      NumberOfLivingNodes += 1;
      
      Text = fopen(DAWG_TEXT_DATA,"w");
      
      fprintf(Text, "Total number of Dawg nodes = |%d|, including \"0 = NULL\".\n\n", NumberOfLivingNodes);
      
      for ( X = 1; X < NumberOfLivingNodes; X++ ) {
        ConvertUnsignedIntNodeToBinaryString(Result[X], TheNodeInBinary);
        fprintf(Text, "Node|%6d|-Letter|%c|-EOW|%3s|-EOL|%3s|-Child|%6d| - Binary%s\n", X, ((Result[X]&LETTER_BIT_MASK)>>LETTER_BIT_SHIFT) + 'A',
        (Result[X]&END_OF_WORD_BIT_MASK)? "yes": "no", (Result[X]&END_OF_LIST_BIT_MASK)? "yes": "no", Result[X]&CHILD_INDEX_BIT_MASK, TheNodeInBinary);
      }
      //printf("Beep\n");
      fclose(Text);
      printf("Out of 32 bit traditional text output to file clean.\n");
      
      Data = fopen(DAWG_DATA,"wb");
      // It is critical, especially in a binary file, that the first integer written to the file be the number of nodes stored in the file.
      // Simply write the entire unsigned integer array "Result" into the data file.
      fwrite(&NumberOfLivingNodes, sizeof(int), 1, Data);
      fwrite(Result, sizeof(unsigned int), NumberOfLivingNodes, Data);
      fclose(Data);
      printf("Out of 32 bit traditional data output to file clean.\n");
      return Result;
    }
    
    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    // The "Main()" function just runs the show.
    int main() {
      int X = 0;
      // All of the words of similar length will be stored sequentially in the same array so that there will be (MAX + 1)  arrays in total.  The Smallest length of a string is assumed to be 2.
      char *AllWordsInEnglish[MAX + 1];
      FILE *Input;
      char ThisLine[INPUT_LIMIT];
      unsigned int FirstLineIsSize;
      int LineLength;
      int KeepTracker[MAX + 1];
      int CurrentTracker[MAX + 1];
      int DictionarySizeIndex[MAX + 1];
      int CurrentLength = 0;
      unsigned int *TheLexiconDawg;
      unsigned int Testing;
      char BinaryStringTester[BINARY_STRING_LENGTH];
      
      printf("The main function lives\n");
      
      for ( X = 0; X <= MAX; X++ ) AllWordsInEnglish[X] = NULL;
      Input = fopen(RAW_LEXICON,"r");
      fgets(ThisLine, INPUT_LIMIT, Input);
      
      // Correction is needed to get rid of the new line character that "fgets()" appends to the string.
      CUT_OFF_NEW_LINE(ThisLine);
      LineLength = strlen(ThisLine);
      FirstLineIsSize = StringToUnsignedInt(ThisLine);
      printf("FirstLineIsSize is |%u|\n", FirstLineIsSize);
      
      // Read the memory straight into ram using dynamically allocated space to count the words of each length.
      for ( X = 0; X <= MAX; X++ ) KeepTracker[X] = 0;
      char **LexiconInRam = (char**)malloc(FirstLineIsSize*sizeof(char*)); 
      
      // The first line gives us the number of words so read in all of them into ram temporarily.
      for ( X = 0; X < FirstLineIsSize; X++ ) {
        fgets(ThisLine, INPUT_LIMIT, Input);
        CUT_OFF_NEW_LINE(ThisLine);
        MakeMeAllCapital(ThisLine);
        LineLength = strlen( ThisLine );
        if ( LineLength <= MAX ) KeepTracker[LineLength] += 1;
        LexiconInRam[X] = (char*)malloc((LineLength + 1)*sizeof(char));
        strcpy(LexiconInRam[X], ThisLine);
      }
      printf("Lexicon.txt read is now complete\n\n");
      for ( X = 0; X <= MAX; X++ ) printf("There are |%5d| words of length |%2d|\n", KeepTracker[X], X);
      // Allocate enough space to hold all of the words in strings so that we can add them to the trie by length.
      for ( X = MIN; X <= MAX; X++ ) AllWordsInEnglish[X] = (char*)malloc((X + 1)*KeepTracker[X]*sizeof(char));
      printf("\nInitial malloc() complete\n");
      
      for ( X = 0; X <= MAX; X++ ) CurrentTracker[X] = 0;
      // Copy all of the strings into "AllWordsInEnglish".
      for ( X = 0; X < FirstLineIsSize; X++ ) {
        CurrentLength = strlen(LexiconInRam[X]);
        // As convoluted as this command may appear, it simply copies a string from its temporary ram location to the array of length equivelant strings for adding to the intermediate "Trie".
        if ( CurrentLength <= MAX ) strcpy(&((AllWordsInEnglish[CurrentLength])[(CurrentTracker[CurrentLength]*(CurrentLength + 1))]), LexiconInRam[X]);
        CurrentTracker[CurrentLength] += 1;
      }
      printf("The words have now been sorted by length\n");
      
      // Make sure that the counting has resulted in all of the strings being placed correctly.
      for ( X = 0; X < (MAX + 1); X++ ) {
        if ( KeepTracker[X] != CurrentTracker[X] ) {
          printf("The number of words is not adding up properly, so fix it.\n");
          return 0;
        }
      }
      
      // Free the initial ram read space
      for ( X = 0; X < FirstLineIsSize; X++ ) free(LexiconInRam[X]);
      free(LexiconInRam);
      
      for ( X = 0; X <= MAX; X++ ) DictionarySizeIndex[X] = KeepTracker[X];
      printf( "The Dawg init function will now be run, so be patient, it will take some time to complete.\n" );
      TheLexiconDawg = DawgInit(AllWordsInEnglish, DictionarySizeIndex, MAX, MIN);
      
      // Free up the array that holds the uncompressed English language.
      for ( X = 0; X <= MAX; X++ ) if ( AllWordsInEnglish[X] != NULL ) free(AllWordsInEnglish[X]);
      
      return 0;
    }
    


    GoogleDawgAnagrammerSubmit.c
    // This program demonstrates the use of a traditional DAWG as applied to in-order anagramming, where the data structure really shines.
    
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    
    #define MAX 15
    #define MAX_INPUT 120
    #define BIG_IT_UP -32
    #define DAWG_DATA "Dawg_For_Lexicon.dat"
    #define WORD_LIST_OUTPUT "Dawg_Word_List.txt"
    
    // These values define the format of the "Dawg" node encoding.
    #define LETTER_BIT_SHIFT 25
    #define LETTER_BIT_MASK 1040187392
    #define CHILD_INDEX_BIT_MASK 33554431
    #define END_OF_WORD_BIT_MASK 2147483648
    #define END_OF_LIST_BIT_MASK 1073741824
    
    // Define the boolean type as an enumeration.
    typedef enum {FALSE = 0, TRUE = 1} Bool;
    typedef Bool* BoolPtr;
    
    // When reading strings from a file, the new-line character is appended, and this macro will remove it before processing.
    #define CUT_OFF_NEW_LINE(string) (string[strlen(string) - 1] = '\0')
    
    // For speed, define these two simple functions as macros.  They modify the "LettersToWorkWith" string in the recursive anagrammer.
    #define REMOVE_CHAR_FROM_STRING(thestring, theposition, shiftsize) ( memmove(thestring + theposition, thestring + theposition + 1, shiftsize) )
    #define INSERT_CHAR_IN_STRING(thestring, theposition, thechar, shiftsize) 
    ( (memmove(thestring + theposition + 1, thestring + theposition, shiftsize)), (thestring[theposition] = thechar) )
    
    // This function converts any lower case letters in the string "RawWord", into capitals, so that the whole string is made of capital letters.
    void MakeMeAllCapital(char *RawWord){
      unsigned int X;
      for ( X = 0; X < strlen(RawWord); X++ ){
        if ( RawWord[X] >= 'a' && RawWord[X] <= 'z' ) RawWord[X] = RawWord[X] + BIG_IT_UP;
      }
    }
    
    // This function removes all non-letter chars from "ThisString".
    void RemoveIllegalChars(char *ThisString){
      unsigned int X;
      for ( X = 0; X < strlen(ThisString); X++ ) {
        if ( !(ThisString[X] >= 'A' && ThisString[X] <= 'Z') && !(ThisString[X] >= 'a' && ThisString[X] <= 'z') ) {
          memmove(ThisString + X, ThisString + X + 1, strlen(ThisString) - X);
          X -= 1;
        }
      }
    }
    
    // This is a simple Bubble Sort.  There is no need for a an optimal algorithm here, because the user input of this program will be very short.
    void Alphabetize(char* Word){
      int X;
      int Y;
      char WorkingChar;
      int WordSize = strlen(Word);
      /* This nested "for" loop structure ensures that the highest letter filters to the back of the string each time that we  increment X in the outer loop. */
      for( X = 1; X < WordSize; X++ ) {
        for( Y = 0; Y <= (WordSize - X - 1); Y++ ) {
          if (Word[Y] > Word[Y + 1]){
            WorkingChar = Word[Y + 1];
            Word[Y + 1] = Word[Y];
            Word[Y] = WorkingChar;
          }
        }
      }
    }
    
    // Define the "Dawg" functionality as macros for speed.
    #define DAWG_LETTER(thearray, theindex) (((thearray[theindex]&LETTER_BIT_MASK)>>LETTER_BIT_SHIFT) + 'A')
    #define DAWG_END_OF_WORD(thearray, theindex) ((thearray[theindex]&END_OF_WORD_BIT_MASK)? TRUE: FALSE)
    #define DAWG_NEXT(thearray, theindex) ((thearray[theindex]&END_OF_LIST_BIT_MASK)? 0: (theindex + 1))
    #define DAWG_CHILD(thearray, theindex) (thearray[theindex]&CHILD_INDEX_BIT_MASK)
    
    // A recursive depth first traversal of "TheDawg" lexicon to produce a readable wordlist in "TheStream".
    void DawgTraverseLexiconRecurse(unsigned int *TheDawg, int CurrentIndex, int FillThisPosition, char *WorkingString, int *TheCount, FILE *TheStream){
      int PassOffIndex;
      WorkingString[FillThisPosition] = DAWG_LETTER(TheDawg, CurrentIndex);
      if ( DAWG_END_OF_WORD(TheDawg, CurrentIndex) ) {
        *TheCount += 1;
        WorkingString[FillThisPosition + 1] = '\0';
        // Include the Windows Carriage Return char.
        fprintf(TheStream, "|%6d|-|%-15s|\r\n", *TheCount, WorkingString);
      }
      if ( PassOffIndex = DAWG_CHILD(TheDawg, CurrentIndex) ) DawgTraverseLexiconRecurse(TheDawg, PassOffIndex, FillThisPosition + 1, WorkingString, TheCount, TheStream);
      if ( PassOffIndex = DAWG_NEXT(TheDawg, CurrentIndex) ) DawgTraverseLexiconRecurse(TheDawg, PassOffIndex, FillThisPosition, WorkingString, TheCount, TheStream);
    }
    
    // Move through the entire "ThisDawg" lexicon, and print the words into "ThisStream".
    void DawgTraverseLexicon(unsigned int *ThisDawg, FILE *ThisStream){
      char *BufferWord = (char*)malloc((MAX + 1)*sizeof(char));
      int *WordCounter = (int*)malloc(sizeof(int));
      *WordCounter = 0;
      // Include the Windows Carriage Return char.
      fprintf(ThisStream, "This is the lexicon contained in the file |%s|.\r\n\r\n", DAWG_DATA);
      DawgTraverseLexiconRecurse(ThisDawg, 1, 0, BufferWord, WordCounter, ThisStream);
      free(BufferWord);
      free(WordCounter);
    }
    
    // This function is the core component of this program.  It requires that "UnusedChars" be in alphabetical order because the tradition Dawg is a list based structure.
    void DawgAnagrammerRecurse(unsigned int *DawgOfWar, int CurrentIndex, char *ToyWithMe, int FillThisPosition, char *UnusedChars, int SizeOfBank, int *ForTheCounter){
      int X;
      char PreviousChar = '\0';
      char CurrentChar;
      int TempIndex = DAWG_CHILD(DawgOfWar, CurrentIndex);
      
      ToyWithMe[FillThisPosition] = DAWG_LETTER(DawgOfWar, CurrentIndex);
      //ToyWithMe[FillThisPosition + 1] = '\0';
      //printf("UnusedChars|%s|\n", UnusedChars);
      if ( DAWG_END_OF_WORD(DawgOfWar, CurrentIndex) ) {
        *ForTheCounter += 1;
        ToyWithMe[FillThisPosition + 1] = '\0';
        printf("|%4d| - |%-15s|\n", *ForTheCounter, ToyWithMe);
      }
      if ( (SizeOfBank > 0) && (TempIndex != 0) ) {
        for ( X = 0; X < SizeOfBank; X++ ) {
          CurrentChar = UnusedChars[X];
          //printf("Looking For |%c|\n", CurrentChar);
          if ( CurrentChar == PreviousChar ) continue;
          do {
            //printf("Compare -|%c| with ^|%c|-|%d|\n", CurrentChar, DAWG_LETTER(DawgOfWar, TempIndex), TempIndex);
            if ( CurrentChar == DAWG_LETTER(DawgOfWar, TempIndex) ) {
              //printf("Bingo\n");
              REMOVE_CHAR_FROM_STRING(UnusedChars, X, SizeOfBank - X);
              DawgAnagrammerRecurse(DawgOfWar, TempIndex, ToyWithMe, FillThisPosition + 1, UnusedChars, SizeOfBank - 1, ForTheCounter);
              //printf("Back|%d|\n", FillThisPosition);
              INSERT_CHAR_IN_STRING(UnusedChars, X, CurrentChar, SizeOfBank - X);
              TempIndex = DAWG_NEXT(DawgOfWar, TempIndex);
              break;
            }
            else if ( CurrentChar < DAWG_LETTER(DawgOfWar, TempIndex) ) break;
          } while ( TempIndex = DAWG_NEXT(DawgOfWar, TempIndex) );
          if ( TempIndex == 0 ) break;
          PreviousChar = CurrentChar;
        }
      }
    }
    
    // This function uses "MasterDawg" to determine the all of the words that can be made from the letters in "CharBank".
    // The return value is the total number of words found.
    int DawgAnagrammer(unsigned int *MasterDawg, char * CharBank){
      int X;
      int Result;
      int BankSize = strlen(CharBank);
      int *ForTheCount = (int*)malloc(sizeof(int));
      char *TheWordSoFar = (char*)malloc((MAX + 1)*sizeof(char));
      char *LettersToWorkWith = (char*)malloc((MAX_INPUT)*sizeof(char));
      char PreviousChar = '\0';
      char CurrentChar;
      int NumberOfLetters;
      strcpy(LettersToWorkWith, CharBank);
      NumberOfLetters = strlen(LettersToWorkWith);
      
      *ForTheCount = 0;
      for ( X = 0; X < BankSize; X++ ) {
        CurrentChar = CharBank[X];
        // Move to the next letter if we have already processed the "CurrentChar".
        if ( CurrentChar == PreviousChar ) continue;
        // We can assume that every letter in the lexicon exists on the first row, so don't bother looking for them.  Just remove the one we are using and plug away.
        REMOVE_CHAR_FROM_STRING(LettersToWorkWith, X, NumberOfLetters - X);
        DawgAnagrammerRecurse(MasterDawg, CurrentChar - '@', TheWordSoFar, 0, LettersToWorkWith, NumberOfLetters - 1, ForTheCount);
        INSERT_CHAR_IN_STRING(LettersToWorkWith, X, CurrentChar, NumberOfLetters - X);
        PreviousChar = CurrentChar;
      }
      Result = *ForTheCount;
      free(ForTheCount);
      free(TheWordSoFar);
      free(LettersToWorkWith);
      return Result;
    }
    
    int main() {
      int NumberOfNodes;
      unsigned int *TheDawgArray;
      FILE *Lexicon;
      FILE *WordList;
      char *DecisionInput;
      Bool FetchData = TRUE;
      char FirstChar;
      int InputSize;
      
      DecisionInput = (char*)malloc(MAX_INPUT*sizeof(char));
      Lexicon = fopen(DAWG_DATA, "rb");
      fread(&NumberOfNodes, sizeof(int), 1, Lexicon);
      printf("The lexicon DAWG contains |%d| nodes.\n", NumberOfNodes);
      TheDawgArray = (unsigned int*)malloc(NumberOfNodes*sizeof(unsigned int));
      fread(TheDawgArray, sizeof(unsigned int), NumberOfNodes, Lexicon);
      fclose(Lexicon);
      printf("\nLexicon data file TWL06 has been read into memory.\n\n");
      
      // This program relies on a compressed lexicon data file, so allow the user to see a readable word list in an output file.
      while ( FetchData == TRUE ) {
        printf("\nWould you like to print the Dawg word list into a file?(Y/N):  ");
        fgets(DecisionInput, MAX_INPUT, stdin);
        FirstChar = DecisionInput[0];
        if ( FirstChar == 'Y' || FirstChar == 'y' ) {
          WordList = fopen(WORD_LIST_OUTPUT, "w");
          DawgTraverseLexicon(TheDawgArray, WordList);
          fclose(WordList);
          FetchData = FALSE;
        }
        else if ( FirstChar == 'N' || FirstChar == 'n' ) FetchData = FALSE;
      }
      FetchData = TRUE;
      
      // Now the user can enter strings of letters for anagramming, to see the words that can be made from them.
      while ( FetchData == TRUE ) {
        printf("\nEnter the string of letters that you want to anagram(2 or more letters):  ");
        fgets(DecisionInput, MAX_INPUT, stdin);
        CUT_OFF_NEW_LINE(DecisionInput);
        RemoveIllegalChars(DecisionInput);
        MakeMeAllCapital(DecisionInput);
        Alphabetize(DecisionInput);
        InputSize = strlen(DecisionInput);
        if ( InputSize >= 2 ) {
          printf("\nThis is the set of letters that you just input |%s|.\n\n", DecisionInput);
          printf("\n|%d| Words were found in the lexicon Dawg.\n", DawgAnagrammer(TheDawgArray, DecisionInput));
        }
        else FetchData = FALSE;
      }
      printf("\nThank you for playing the GoogleDawgAnagrammer.  GAME OVER.\n\n");
      return 0;
    }
    



    Contact Information



    Contact: JohnPaul Adamovsky – ( logarithm69@hotmail.com )

    28 Year Old - Man - Engineer

    Phone: (416) 231-7577

    Hit Counter by Digits



    Go Back to the Top |