SMLG (Statistical Machine Learning Group) Discussion Forum

by **efrain.gonzalez0** » Mon Jun 11, 2018 12:25 pm

Good afternoon,

Here I cover how to use and the requirements for running my version of the program for order searching.

Respectfully,

Efrain Gonzalez

by **efrain.gonzalez0** » Mon Jun 11, 2018 3:33 pm

To begin, there are many versions of the code on the SMLG GitLab and all of them can be found under Cpp_Programming/Order Score. So how do you decide which one to use? Below you will find a list of the programs currently (at the time of this post) on the GitLab and a brief explanation as to when you should use them:

MCMC.cpp - This is an old version of the MCMC program used for searching through orders. THIS CODE SHOULD NEVER BE USED AS THERE ARE TOO MANY PROBLEMS WITH IT.
MCMCFASTCASH_percentscore.cpp - This code chooses the appropriate order to continue with by checking the order produced by the preprior method and by creating an order based on the distribution of orders thus far calculated. This code can handle more than 10 variables.This code has also been termed by Dr. Yoo as being the PercentScore Algorithm.
MCMC_Cache_ALL.cpp - This is a code used for obtaining all orders and their respective scores. Their is no smart sampling of orders being implemented in this code. The code just cycles through all possible orders using std::next_permutation. It should only be used when you want to score all orders for a dataset that has less than 10 variables. A string implementation forced it do be less than 10 variables.
MCMC_Cache_ALL_V2.cpp - This is a code used for obtaining all orders and their respective scores. Their is no smart sampling of orders being implemented in this code. The code just cycles through all possible orders using std::next_permutation. It should only be used when you want to score all orders for a dataset. I do not currently know what the maximum amount of variables is for this code. It uses vector of integers instead of strings in order to allow for more variables.
MCMC_NormRatio.cpp - This is a code in which I implemented a normalized version of the original probability that was used to determine whether we should continue in the direction of a particular order or not. Since it is normalized the probability was always between 0 and 1. Based on our initial impressions this code performed slightly better than the Paper_MCMC_V2.cpp code. I do not currently know what the maximum amount of variables is for this code.
MCMC_PrescreeningSwap.cpp - This is a code that checks every random swap based on the priors that the user has set. If no priors were set then every swap is equally likely. Based on our initial impressions this code performed significantly better than the Paper_MCMC_V2.cpp code. I do not currently know what the maximum amount of variables is for this code. We will probably continue using this code in the future. This code has also been termed by Dr. Yoo as being the PrePrior Algorithm.
OrderToStructure.cpp - A code intended to receive an order as input and output a random structure in xdsl format meaning that it can be viewed in GeNie. I am not sure that you will ever need to use this code
Paper_MCMC.cpp - This code is one of the original MCMC codes that was constantly updated to reflect the much needed changes that MCMC.cpp required. This should be used if you have less than 10 variables because it uses strings.
Paper_MCMC_V2.cpp - This code is the version of the Paper_MCMC.cpp that can handle more than 10 variables. This code should be used if you are not interested in the normalized version or the code that implements pre-screening of swapped variables. This code has also been termed by Dr. Yoo as being the Prior Algorithm.
ScoreAnOrder.cpp -This is the first code that I created for scoring an order. No MCMC is implemented here. Only one order is scored. THIS CODE SHOULD NOT BE USED AS IT HAS NOT BEEN UPDATED.

Of the above mentioned codes the most important for future use are Paper_MCMC_V2.cpp, MCMC_PrescreeningSwap.cpp, MCMC_Cache_ALL_V2.cpp, MCMC_NormRatio.cpp, and MCMCFASTCASH_percentscore.cpp.

by **efrain.gonzalez0** » Mon Jun 11, 2018 5:32 pm

Below you will find a list of the programs currently (at the time of this post) on the GitLab and the configuration/settings file that corresponds with that program:

MCMC.cpp - No configuration/settings file for this program everything is input by the user for each run on the terminal.
MCMCFASTCASH_percentscore.cpp - The appropriate configuration/settings file is newwithpercent.config.
MCMC_Cache_ALL.cpp - The appropriate configuration/settings file is new.config.
MCMC_Cache_ALL_V2.cpp - The appropriate configuration/settings file is new.config.
MCMC_NormRatio.cpp - The appropriate configuration/settings file is new.config.
MCMC_PrescreeningSwap.cpp - The appropriate configuration/settings file is new.config.
OrderToStructure.cpp - No configuration/settings file for this program everything is input by the user for each run on the terminal.
Paper_MCMC.cpp - The appropriate configuration/settings file is new.config.
Paper_MCMC_V2.cpp - The appropriate configuration/settings file is newwithpercent.config.
ScoreAnOrder.cpp - No configuration/settings file for this program everything is input by the user for each run on the terminal.

by **efrain.gonzalez0** » Mon Jun 11, 2018 7:15 pm

Below I provide some instructions that clarify how to use the configuration/settings files and what each variable means:

DataFile - This is the location of the file that you want to analyze.
TotalVariables - This is an integer that represents the total amount of variables in the above file.
PriorFile - This is the location of the prior file
MaximumCategory - For each variable place the maximum category that the variable can take on. Each number must be separated by a space. The order in which the categories are placed must be the same as the order in which the variables are presented within the data file. For example:
Code: Select all
You have the first variable that can take on the values 0 and 1, a second variable that can take on the values 3 and 4, and a third variable that can take on the values 0, 1, and 2. So this line would look like the below MaximumCategory = 1 4 2
MinimumCategory - For each variable place the minimum category that the variable can take on. Each number must be separated by a space. The order in which the categories are placed must be the same as the order in which the variables are presented within the data file. For example:
Code: Select all
You have the first variable that can take on the values 0 and 1, a second variable that can take on the values 3 and 4, and a third variable that can take on the values 0, 1, and 2. So this line would look like the below MinimumCategory = 0 3 0
MaximumTime - The longest amount of time that you would want the program to take for the analysis. The program may end up taking a bit longer because it does not interrupt any process when the time has been exceeded and instead waits until the process has completed. The time is in hours. This may be a decimal.
MaximumParents - An integer that represents the maximum amount of parents that you will allow for the analysis. The amount of parents has a major effect on the amount of time that the program will require in order to score a single order.
EpsilonDifference/PercentEpsilonDifference - Here there are two possibilities the first is "EpsilonDifference" which is the difference between two log scores. The second is "PercentEpsilonDifference" which is the difference between two log scores in terms of percentages. They both have the same job which is to control when the program will perform a cut the deck operation. The smaller the value the less likely the program will perform a cut the deck operation. Choosing "EpsilonDifference" to be .001 is the same as choosing "PercentEpsilonDifference" to be .1.
StartingOption - This can take on two values Y or n. This lets the program know that you would like it to start its search at a particular order (Y) or at a random order (n).
StartingOrder - This is the order that you would like the program to start at. Make sure that this is always equal in size to the TotalVariables value that you specified.

by **efrain.gonzalez0** » Mon Jul 02, 2018 6:35 pm

To begin, a prior file is a way for the user to provide the program with some of the background knowledge that the user has obtained by reading the relevant literature on the data/disease being analyzed. By using priors in your order search you will be directing the order search towards an order that takes into account the information that is known by the scientist. The following is an example of the process that one would use to create the prior file:

Code: Select all: Let us suppose that you have 5 variables A, B, C, D, and E and they appear in this order within the data file. Then in the prior file A, B, C, D, and E will be represented by 0, 1, 2, 3, and 4, respectively. Now let us suppose that the scientist strongly believes that A comes before C and B, and C comes before D in the order. Then the following is what the prior file would look like: 0(tab)2,1(tab).9 2(tab)3(tab).9 Where one would use the actual tab key on the keyboard to replace "(tab)" in the above example. In the above example we see that tabs are used to separate groups of variables whereas commas are used to separate variables within each group. Also, notice that the same variable (in this case 2/C) may be used in several lines of the prior file so long as each time it speaks towards a new relationship (relationship between 2 and 3 whereas the first was relationships between 0, 1, and 2). In the above example .9 is the probability of the variables in the first group coming before the variables in the second group in the order. The probability is determined by the scientist/user and is completely based on how strong the beliefs of the user/scientist are. In general the prior file should look something like this: Group1(tab)Group2(tab)Probability Group3(tab)Group4(tab)Probability . . . GroupN-1(tab)GroupN(tab)Probability

F.A.Q.:

Can I use a probability of 1? Can I use a probability of 0? - Yes, but I would recommend sticking to numbers that represent actual probabilities which would be between 0 and 1. If you choose 1 then the program will always choose to go with the order that has the variables in the first group (Group1) before the variables in the second group (Group2). If you choose 0 then the program will choose to go with the order that has the variables in the second group (Group2) before the variables in the first group (Group1).
Should I leave any spaces at the end of a line? - No, do not leave any spaces at the end of a line. Type the probability and if you are going to create a new line immediately press enter otherwise just save the file. Make sure that there are no extra spaces at the end of the line as this could cause an error.

by **efrain.gonzalez0** » Mon Jul 02, 2018 7:31 pm

Good evening,

Before compiling make sure you have boost installed on the server that you are using. You can install boost by using the following command: sudo apt-get install libboost-all-dev.

If you look at the top of some of the programs (Paper_MCMC_V2.cpp, MCMC_PrescreeningSwap.cpp, MCMC_Cache_ALL_V2.cpp, MCMC_NormRatio.cpp, and MCMCFASTCASH_percentscore.cpp) available on the GitLab you will notice that there is a comment box (comments in C++ are labeled with "//"). In the comment box there is a section titled "Compile" which shows an example of how one can compile the file. Below I will give a more general example:

Code: Select all: g++ -x c++ -std=c++11 -o NameOfProgram Path/To/The/CppFile

The "NameOfProgram" is whatever the user chooses it to be. The "Path/To/The/CppFile" is the path to the cpp file that contains the code you are interested in compiling.
After you have executed the above command on the terminal you can use ./NameOfProgram to run the program.

If you are using eclipse to run my program you will need to make the following changes to eclipse:

Go to the Window tab on the top left hand corner and press the Preferences tab from within this menu. A pop up should appear. On the left hand side of the pop up click on C/C++ this will open up a menu from within this menu click on Build and then on Settings. Now click on the tab labeled Discovery and then click on the CDT GCC Built-in Compiler Settings part of the menu. Below you will see a text box which contains ${COMMAND} ${FLAGS} -E -P -v -dD "${INPUTS}". Add a space to the text box and type in the following -std=c++11. The text box should now look like this: ${COMMAND} ${FLAGS} -E -P -v -dD "${INPUTS}" -std=c++11. Now press the Apply and Close button on the lower right hand corner.
Go to the Project Explorer and right click on the project. In the pop up menu click Properties and then click on C/C++ Build. Now click on Miscellaneous from within the GCC C++ Compiler menu (this will be the first Miscellaneous that appears on the screen). You will see a text box on the right hand side that reads as follows: -c -fmessage-length=0. Add a space to the text box and type -std=c++11. Now the text box should look like this: -c -fmessage-length=0 -std=c++11. Now press the Apply and Close button on the lower right hand corner.

by **cwyoo** » Thu Aug 03, 2023 9:00 pm

efrain.gonzalez0 wrote:Below I provide some instructions that clarify how to use the configuration/settings files and what each variable means:
DataFile - This is the location of the file that you want to analyze.
TotalVariables - This is an integer that represents the total amount of variables in the above file.
PriorFile - This is the location of the prior file
MaximumCategory - For each variable place the maximum category that the variable can take on. Each number must be separated by a space. The order in which the categories are placed must be the same as the order in which the variables are presented within the data file. For example:
Code: Select all
You have the first variable that can take on the values 0 and 1, a second variable that can take on the values 3 and 4, and a third variable that can take on the values 0, 1, and 2. So this line would look like the below MaximumCategory = 1 4 2
MinimumCategory - For each variable place the minimum category that the variable can take on. Each number must be separated by a space. The order in which the categories are placed must be the same as the order in which the variables are presented within the data file. For example:
Code: Select all
You have the first variable that can take on the values 0 and 1, a second variable that can take on the values 3 and 4, and a third variable that can take on the values 0, 1, and 2. So this line would look like the below MinimumCategory = 0 3 0
MaximumTime - The longest amount of time that you would want the program to take for the analysis. The program may end up taking a bit longer because it does not interrupt any process when the time has been exceeded and instead waits until the process has completed. The time is in hours. This may be a decimal.
MaximumParents - An integer that represents the maximum amount of parents that you will allow for the analysis. The amount of parents has a major effect on the amount of time that the program will require in order to score a single order.
EpsilonDifference/PercentEpsilonDifference - Here there are two possibilities the first is "EpsilonDifference" which is the difference between two log scores. The second is "PercentEpsilonDifference" which is the difference between two log scores in terms of percentages. They both have the same job which is to control when the program will perform a cut the deck operation. The smaller the value the less likely the program will perform a cut the deck operation. Choosing "EpsilonDifference" to be .001 is the same as choosing "PercentEpsilonDifference" to be .1.
StartingOption - This can take on two values Y or n. This lets the program know that you would like it to start its search at a particular order (Y) or at a random order (n).
StartingOrder - This is the order that you would like the program to start at. Make sure that this is always equal in size to the TotalVariables value that you specified.

Please note the DataFile should be formatted as tab delimited without the variable names as a header. Columns should be variables that are being modeled and rows as cases.

SMLG (Statistical Machine Learning Group) Discussion Forum

Order Search MCMC

Order Search MCMC

Which of the programs should I use?

Should I use a configuration/settings file?

How to use a configuration/settings file?

How to create a prior file?

How to compile and then run the program?

Re: How to use a configuration/settings file?

Who is online