SMLG (Statistical Machine Learning Group) Discussion Forum

by **efrain.gonzalez0** » Fri Jan 05, 2018 7:45 pm

Here I will post the most recent developments in the code.

by **efrain.gonzalez0** » Fri Jan 05, 2018 7:56 pm

Good evening,

Recently I have been able to use the SMILE library to import data sets and create structures based on an order provided by the user. The program requires two inputs from the user an order and the location of a file that contains the data set. I have tested this code several times using a data set with 8 variables that was provided to me by Lauren thus far it seems to be working well. The code also uses the BOOST library's Bernoulli Distribution to help decide whether a variable should be the parent of another variable. In this manor the structures created are random.

Cheers,

Efrain Gonzalez

by **efrain.gonzalez0** » Fri Feb 09, 2018 3:39 pm

Here I will be looking into the code and explaining how some of the code works. All of this is done with the help of a simple data set that I have attached below.
The order that we will be looking at in this example is <A,B,C>.

PART 1:
Lets begin with the step in the code labeled as follows:

Code: Select all: //Total Families Ui,alpha for a particular variable in the order

The great thing about this part of the code is that so long as the amount of variables in your ordering remains the same the results will be the same. Therefore, I would say that it is independent of the order itself and only dependent on the number of variables in your order and the maximum amount of parents allowed for each variable. The number of entries will always be equivalent to the amount of variables in your order. In our example we only have 3 variables so the vector labeled families will only have three entries. The value of the families vector will be <1,2,4> assuming that we allowed for the maximum amount of parents. The maximum amount of parents in this case is 2 because C which is the last variable in the order can have A and B as parents. If we had set the maximum amount of parents to 1 then the families vector would be <1,2,3>. The code goes one variable at a time so lets work out the code by hand for our example:

Code: Select all: First variable is A: A cannot have any parents so the amount of parents is the same as 0 choose 0 which is 1. Second variable is B: B can be without parents and it can have A as a parent so the total amount of parent sets for B is 2 which is the summation of 1 choose 0 and 1 choose 1 Third variable is C: C can be without parents, it can have A as a parent, it can have B as a parent, and it can have A and B as parents. Therefore the total amount of parent sets for C is 4 which is the summation of 2 choose 0, 2 choose 1, and 2 choose 2.

Since the maximum amount of parents affects the total number of parent sets we sometimes see that the total amount of parent sets for any given variable is not the maximum amount of parent sets.

PART 2:
Now lets look at the part of the code that begins with the comment

Code: Select all: //How many parent combinations for each step? As well as there counts

This part of the code finds the parent combinations (qi) for each parent set (Ui,alpha) and the counts (Nijk) for a particular parent combination. We look at the parent combinations first. The number of parent combinations for each parent set is based entirely on the amount of parents and the amount of categories for each parent. In our example the three variables are binary. The information for these parent combinations is stored in the vector of vectors names ParentCombos. The amount of vectors in this vector of vectors is equivalent to the amount of variables in your order. In this manor the vector in the first row represents the parent combinations for the first variable in the order and the vector in the last row represents the parent combinations for the last variable in the order. The length of each vector in this vector of vectors is dependent on the number of parent sets associated with that variable in the families vector. So for our example with maximum number of parents the ParentCombos will look like this:

Code: Select all: 1 1 2 1 2 2 4

With only 1 parent as the maximum then the ParentCombos vector of vectors will look like this:

Code: Select all: 1 1 2 1 2 2

So what is happening? Recall that each of the variables is binary and that the amount of parent sets for each variable can be found in the vector named families. Again we work through the code by hand for our example:

Code: Select all: First Variable A: A cannot have any parents and so it cannot have any parent combinations and so the value in [u]ParentCombos[/u] must be 1. Second variable is B: B has 2 different parent sets in one B has no parents and in the other B's parent is A. The first case is like the above case for variable A and so [u]ParentCombos[/u] first entry must be 1. In the second case since A has two categories the second entry for [u]ParentCombos[/u] must be 2. Third variable is C: C has 4 different parent sets in one it has no parents, in the other A is its parent, in another B is its parent, and in the last one A and B are its parents. Therefore the total amount of entries into [u]ParentCombos[/u] is 4 for this row. The first entry is again 1 for the same reason as before. The next two entries are 2 because A and B are binary. The last entry is 4 because since A and B are binary the total combinations of A and B is 4.

The second part of this step is calculating the counts for the parent combinations. The full set of counts is stored in the vector of vectors named fullNijkvector. The process for iterating through all of the combinations is complex and requires its own post. Here instead I will focus on helping you understand what each vector in fullNijkvector represents. For our example with maximum parents set to 2 the following is fullNijkvector:

Code: Select all: 12 4 7 9 8 1 4 3 12 4 3 1 9 3 3 1 4 8 2 1 1 0 2 7 2 1

The first row represents the counts when A = 1 and when A = 0, respectively. The second row represents the counts when B = 1 and B= 0, respectively. The third row represents the counts when (A,B) = (1,0), (A,B) =(0,0), (A,B) =(1,1), and (A,B) = (0,1), respectively. The fourth row represents the counts when C = 1 and C = 0 respectively. The fifth row represents the counts when (A,C) = (1,0), (A,C) = (0,0), (A,C) = (1,1), and (A,C) = (0,1), respectively. The sixth row represents the counts when (B,C) = (1,0), (B,C) = (0,0), (B,C) = (1,1), and (B,C) = (0,1), respectively. The seventh row represents the counts when (A,B,C) = (1,1,0), (A,B,C) = (1,0,0), (A,B,C) = (0,1,0), (A,B,C) = (0,0,0), (A,B,C) = (1,1,1), (A,B,C) = (1,0,1), (A,B,C) = (0,1,1), and (A,B,C) = (0,0,1), respectively.

by **efrain.gonzalez0** » Fri Feb 16, 2018 12:36 pm

Good afternoon,

I have posted two sprinkler data sets here. One is the original file that Dr. Yoo provided in the PGM class and the second is a transformation that I created in R in order to make the file friendly for my code.

Respectfully,

Efrain Gonzalez

by **cwyoo** » Fri Feb 16, 2018 7:42 pm

efrain.gonzalez0 wrote:Good afternoon,

I have posted two sprinkler data sets here. One is the original file that Dr. Yoo provided in the PGM class and the second is a transformation that I created in R in order to make the file friendly for my code.

Respectfully,

Efrain Gonzalez

The generating Bayesian network structure can be found at: http://smlg.fiu.edu/phpbb/viewtopic.php?f=38&t=110&p=663&hilit=sprinkler#p663

Also attached is the Sprinkler Bayesian network in GeNIe http://smlg.fiu.edu/phpbb/viewforum.php?f=32 data format.

by **efrain.gonzalez0** » Mon Feb 26, 2018 11:51 pm

Recently I realized that my own method for iterating through all combinations of parent sets could be improved and made simpler. I have not implemented this new method but it is possible that such a method may be made:
Let 1 represent that a parent is included in the set and let 0 represent the opposite. Then for the fourth variable a possible parent set is 111, which means that all variables that come before the fourth variable in the order are the parents of the fourth variable. We can get all parent sets for the fourth variable by subtracting one at each step and then adjusting when a nine occurs in the outcome. The following shows this method in the form of an example:

Code: Select all: 111 (subtract 1) 110 (subtract 1) 109 (adjust) --> 101 (subtract 1) 100 (subtract 1) 099 (adjust) --> 011 (subtract 1) 010 (subtract 1) 009 (adjust) --> 001 (subtract 1) 000 (done) this could represent the empty set

Of course each one of these steps requires a transformation from a vector or array to an integer and vice versa. The former (vector<int> to int) can probably be done by using the position of a 1 and multiplying by the appropriate amount of 10s (ex <1,1,0,1> is 10^3 + 10^2 + 10^0). The latter (int to vector<int>) may require some use of strings in order to extract one value at a time and then will require a conversion from string to integer (ex "1101" is <"1","1","0","1"> which is <1,1,0,1>). However, it may also be possible to avoid using the strings by coming up with a reverse system that would use the amount of 10s to place 1s in appropriate locations in the vector.

The reason that I bring this up is that the logic in this type of reverse odometer is easier to follow than the one I have implemented in the different scoring codes. For efficiency purposes I believe that this method may even be slightly more efficient because it seems on a first glance that it would require less for loops and less if and else statements. Of course this new implementation would require changes to the rest of the code because the pattern which it follows does not match that of the original code.

by **efrain.gonzalez0** » Tue Mar 06, 2018 2:35 pm

Good afternoon,

I have attached below the file for sprinkler data with 1000 observations.

-Efrain Gonzalez

by **efrain.gonzalez0** » Fri Mar 09, 2018 2:50 pm

Good afternoon,

To begin, your data probably looks something like the following:

Code: Select all: Cheese Ham Eggs Sandwich 1 0 1 0 1 1 0 1 0 1 1 0

In the above the columns are variables and the rows are observations this is similar to the way that Banjo requires its inputs.

First remove the header line (the line with the names) so that your new data looks like this:

Code: Select all
1 0 1 0 1 1 0 1 0 1 1 0

Save this as a new file. The header line when read by the current code will cause an error and so it must be removed. Do not fear though the code will use the positioning of each value in order to differentiate between variables. For example the first, second, and third column will be named "0","1", and "2", respectively. Using the fact that each column is named in this way we will later be able to match these numbers to there actual variable names. Make sure that you have kept both the original data file and the one without the header line so that we may know the actual variables that were used.
Copy or Download the MCMC code off of GitLab and save it onto your server. If you copied the code make sure that you save the file as a .cpp. Then you must use the following commands to compile the code on the server:
Code: Select all
g++ -x c++ -std=c++11 -o MCMCE (filepath)/(MCMCfilename).cpp

You must replace (filepath) with the correct path to the MCMC file and replace (MCMCfilename) with the correct name for the file containing the code.
If you want to use prior information the following are some additional steps that you must take (I use the above example data set in this list):
1. Create a new text file. You must remember the name and file path for this text file.
2. Let us state that based on prior information Ham and Eggs should come before Cheese and Sandwich in our order. Let us also state that we are fairly certain of this case and so we assign a probability of .9 to the fact that we are correct and .1 to the fact that we are wrong. Our input into the text file should look as follows:
  Code: Select all
  1,2(tab).9(tab)0,4(tab).1
  
  Where (tab) represents an actual tab in your computer because we are creating a tab delimited text file. As you can see 0 represents Cheese and 4 represents Sandwich. In this case we have only four variables and so it is easy for us to enumerate all of the variables but in the case that we want a quick way to say that the ones on the left are higher in order than all the other variables and so we can use the following notation:
  Code: Select all
  1,2(tab).9(tab)Rest(tab).1 We are following this outline: (variables separated by commas) tab (probability for previous variables) tab (another set of variables separated by commas) tab (another probability for previous variables)
  
  In this way the Rest will help us consider all variables that are not considered in the first statement (which here is 1,2). At the time that this is being posted this is not implemented yet. Note here that commas are used to separate the variables and tabs are used to separate the information. If a variable is not included in either set then there is no prior information for this variable.
3. Save this file.
Now you may run the code by using the following command on the terminal from your home directory ./MCMCE.
You will first be asked for a data file. Then you will be asked for how many variables are in your file followed by the file name for the file containing the prior information. These file names include the path of the file. For example in my case a file name for the data set may be /home/efraingonzalez0/newsprinkler100.txt. Next you are asked for the maximum categorical value of each of the variables in the data. So for gene data the maximum value should be 2 for each gene. If you have 2 binary clinical variables and 2 gene variables then your input may look like this 1 1 2 2. Next you are asked for the minimum categorical value of each of the variables in the data so your input will most likely be all zeros. Then you are asked for the file that you want to export the order and score information to (depending on when you are reading this the feature mentioned may have been removed for efficiency purposes). Now you will set the amount of unique orders you want the code to run through, the maximum amount of time in hours that you want the code to run through, the maximum amount of parents you want to consider, the epsilon difference value that you want to set, and whether or not you want to start with a particular order.

I have attached a screenshot of what the code looks like when you run it.

by **efrain.gonzalez0** » Fri Mar 09, 2018 5:43 pm

Good evening,

These are some results from running the code.

by **efrain.gonzalez0** » Wed Mar 14, 2018 3:34 pm

Good afternoon,

Here I have attached the files that contain the results of running the MCMC order scoring code. I used the Sprinkler data set with 100 observations. As can be seen in the screenshots I used the same starting order in each instance. The starting order was <4,5,7,2,0,1,6,3>. This translates to <MyLawnWet,Fertilizer,MyLawnGreener,NBLawnWet,Sprinkler,Rain,MyLawnGreen,NBLawnGreen>. The sprinkler model suggests that we should expect a final order similar to <0,1,2,4,5,6,3,7> which translates to <Sprinkler,Rain,NBLawnWet,MyLawnWet,Fertilizer,MyLawnGreen,NBLawnGreen,MyLawnGreener>. You will notice that the phrase "Maximum repeats achieved" appears in the screen shots. The phrase signifies that one of the orders was repeated five times. In this case it means five times because I hard coded it to this value. The high probability case uses the priors .9 and .1 whereas the low probability case uses the priors .6 and .4. Since the sprinkler data set has a only eight variables, I used seven as the maximum amount of parents to be considered.
Times:

Random: 8s
Correct with high probability: 87s
Correct with low probability: 120s
Wrong with high probability: 68s
Wrong with low probability: 77s

Cheers,

Efrain Gonzalez

SMLG (Statistical Machine Learning Group) Discussion Forum

Order Scoring and Structure Creation

Order Scoring and Structure Creation

Structures based on Orders

Re: Understanding The Order Scoring Code

Sprinkler Data sets

Re: Sprinkler Data sets

New Reverse Odometer

Sprinkler with 1000 samples

How to use new MCMC order search code?

Results from running C++ code

Results with prior information

Who is online