README file for 'mvnhmm' program -- implementation of the update rules, simulation of data, and calculation of statistics for variations of (M)ulti(V)ariate (N)onhomogeneous (H)idden (M)arkov (M)odels. May 15, 2002 Updated October 04, 2002 Updated November 01, 2002 Revised and updated April 03, 2003 Revised and updated August 28, 2003 Updated October 03, 2003 Updated October 24, 2003 Updated November 19, 2003 Updated November 20, 2003 Updated December 16, 2003 Updated December 19, 2003 Updated December 23, 2003 Updated December 27, 2003 Updated February 27, 2004 Updated April 06, 2004 Second revision, May 3, 2004 Updated June 21, 2004 Updated July 22, 2004 Updated August 4, 2004 Updated October 12, 2004 Updated April 18, 2005 Updated June 23, 2005 Updated October 26, 2005 Updated November 9, 2005 Updated November 21, 2005 Updated December 12, 2005 Updated February 16, 2006 Updated February 25, 2006 Updated March 16, 2006 Updated April 25, 2006 Updated August 10, 2006 Updated August 27, 2006 Updated August 28, 2006 Updated September 7, 2006 Updated September 12, 2006 Updated September 13, 2006 Updated October 2, 2006 By Sergey Kirshner, Alberta Ingenuity Centre for Machine Learning (AICML), University of Alberta, Edmonton, Canada (formerly of Donald Bren School of Information and Computer Science, University of California, Irvine, USA). This file contains information necessary to run the code which this file accompanies. Compilation: To compile the code using g++, run 'make' (default makefile). If the code is to be run on Sun Solaris, run 'make -f makefile.sun' for faster running code. The Sun makefile was tested on the Solaris 8 system. To compile the code for running on the Linux platform, run 'make' (default for Linux). Linux make was in part developed on RedHat 9 Linux platform and was tested on RedHat 9 platform. It was also tested on the SuSe 10.0 Linux. The default make requires g++ compiler with standard C and C++ libraries. g++ version 3.0.4 was used during the development. Running the program: To run the program, either add the directory with the compiled code to path, or run ./mvnhmm from the directory. hmm takes two inputs: parameter file name (required) and seed for the random number generator (optional, 0 if omitted). If run without any inputs, the program will print the proper usage information. Parameter file specification: Without a proper parameter file, the program would not run. Certain parameters are required for the program to run; others are need for running the program together with outside scripts. For easier readablilty, the parameter file allows for comments and ignores extra whitespaces and tabs. Also, the order in which the parameters are entered is non-essential for most cases. Any text after the character '#' and to the end of the line is ignored; thus use comments as frequently as desired. The parameters are read in the following manner: first, the parameter name is specified, and then the parameter is specified. Currently, the following parameters are defined: num_states -- number of hidden states in the model. (2 by default) (used by all actions except 'analysis' or if 'transition' is defined) Type: positive integer. model_type -- type of the model. ('hmm' by default). Type: currently, these types are supported: 'hmm', 'nhmm', 'mixture', 'nmixture', and 'stateless'. 'hmm' uses hidden Markov model; 'nhmm' uses non-homogeneous hidden Markov model; 'mixture' runs mixture model ignoring the order of observations; 'nmixture' runs a mixture model with mixing probabilities dependent on the values of the input variables; 'stateless' runs model without hidden states where the probability of each observations is determined from the previous ovservation and the value of the corresponding input variable. (NOTE: 'stateless' is currently unavailable) 'model_type' is not needed if 'transition' is defined. action -- defines what the program should calculate. ('learn' by default) Type: currently, these options are implemented: 'learn', 'viterbi', 'll', 'll-train', 'simulation', 'analysis', and 'filling'. 'learn' calculates the parameters of the model using the EM algorithm. 'viterbi' finds the sequences of most likely states given the data and the model. 'll' and ''ll-train' calculate the log-likelihood of the test and train set, respectively, given the model. 'simulation' generates data according to the specified model with or without input variable values (Xs). 'analysis' calculates the statistics of the provided data. 'filling' provides the evaluation of the model on data using hole-filling method. 'prediction' computes the log-likelihood of specified (by parameter 'lookahead') consecutive observations (given the parameters of the model) for each time point. 'init' generates multiple initializations of parameters for the model and stores them in the output file. 'KL' computes the entropy of the passed distribution and the KL-divergence between the first and the passed second distribution. transition -- specification of the transition distribution. Type: the distribution specification string (same as emission -- see below). emission -- specification of the emission distributions (output states). Type: The code and the parameter specifications (below). Types of emission distributions -- conditionally-independent (super-distribution), univariate Bernoulli, double-chain (conditional univariate Bernoulli), Chow-Liu tree, conditional Chow-Liu tree, sigmoidal belief network, exponential, gamma, Gaussian, AR (conditional Gaussian), conditionally independent, finite mixture, tree-dependent mixture. Bernoulli code: bernoulli Parameters -- one parameter: the number of possible outcomes (2 for binary data). Bernoulli with Dirichlet prior: bernoulli-prior Parameters -- two parameters: the number of possible outcomes (2 for binary data) and the pseudo-counts for the prior (non-negative) same for all outcomes. Conditional Bernoulli code: chain-bernoulli Parameters -- one parameter: the number of possible outcomes. Conditional Bernoulli (averaged) code: chain-bernoulli-global The difference between this distribution and regular conditional Bernoulli is that the first entry in the sequence distribution is updated as an average of probabilities over all entries in the sequence. Parameters -- one parameter: the number of possible outcomes. Chow-Liu code: chow-liu Parameters -- two parameters: the number of discrete variables (nodes in the tree), the number of possible outcomes for each node (2 for binary data). Chow-Liu with MDL penalty on MI terms code: chow-liu-mdl Parameters -- three parameters: the number of discrete variables (nodes in the tree), the number of possible outcomes for each node (2 for binary data), the penalty factor for each parameter. Chow-Liu with Dirichlet prior on the bivariate probabilities and MDL penalty on MI terms code: chow-liu-prior Parameters -- four parameters: the number of discrete variables (nodes in the tree), the number of possible outcomes for each node (2 for binary data), pseudo-counts for Dirichlet prior (non-negative), same for all outcomes; the penalty factor for each parameter. Conditional Chow-Liu code: conditional-chow-liu Parameters -- two parameters: the number of discrete variables (dimension of the distribution variable), the number of possible outcomes for each node (2 for binary data).a Conditional Chow-Liu with MDL penalty on MI terms code: conditional-chow-liu-mdl Parameters -- three parameters: the number of discrete variables (nodes in the tree), the number of possible outcomes for each node (2 for binary data), the penalty factor for each parameter. Full binary bivariate MaxEnt model code: maxent-full Parameters -- one parameter: the number of binary variables. Bayesian network of conditional binary MaxEnt models with univariate and bivariate features (sigmoidal belief network) code: BN-maxent Parameters -- two parameters: the number of binary variables and the MDL penalty factor. Conditional Bayesian network of conditional binary MaxEnt models with univariate and bivariate features code: BN-cond-maxent Parameters -- two parameters: the number of binary variables and the MDL penalty factor. Delta-exponential mixture distribution code: delta-exponential Parameters -- one parameter: number of mixture components (1+number of exponential components). Delta-gamma mixture distribution code: delta-gamma Parameters -- one parameter: number of mixture components (1+number of gamma components). Dirac delta distribution code: delta Attributes -- one attribute: location (real-valued). Exponential (geometric) distribution code: exponential Parameters -- one parameter: real-valued. Gamma distribution code: gamma Parameters -- two parameters: real-valued. Log-normal distribution code: log-normal Parameters -- two parameters: real-valued. Gaussian code: gaussian Parameters -- one parameter: dimension. Gaussian Auto-Regressive on the outputs: chain-gaussian Parameters -- one parameter: dimension. Tree-structured Gaussian code: tree-gaussian Parameters -- one parameter: dimension. Conditionally independent: independent Parameters -- number of components followed by the distribution for all components (assumed the same). For example, independent 5 bernoulli 2 indicates a conditionally-independent distribution with each component a 2-state Bernoulli. Mixture: mixture Parameters -- number of components of the mixture followed by the distribution for all components (assumed the same). Tree-dependent mixture code: cl-mixture Parameters -- three parameters -- number of variables, number of states for each variable, MDL penalty factor -- followed by the distributions for each state (distributions are assumed the same for all variables). input_dimensionality -- number of input components. Type: non-negative integer. Types of input distributions -- logistic or trans-logistic Input distributions are assigned according to the type of the model. xval_type -- the type of cross-validation. Type: string -- 'none' (default), 'leave_n_out'. 'none' -- no cross-validation. 'leave_n_out' -- leave-n-out cross-validation examples_out -- number of examples to leave out for leave-n-out cross- validation Type: positive integer (default 1). data -- file name of the data set. Required for all actions except 'simulation'. Type: string num_data_sequences -- number of data sequences in the data file. (1 by default) With action 'simulation', indicates the number of base simulated sequences if no input file provided. With action 'analysis', indicates the total number of sequences in the data file. Type: positive integer data_sequence_length -- length of each sequence in the data file. (1 by default) With action 'simulation', indicates the length of the simulated sequences if no input file provided. Type: positive integer data_sequence_length_distinct -- same as data_sequence_length, but with different value for each data sequence. Type: num_data_sequences of positive integers num_discrete_data_components -- the number of finite-valued components for each data point in the data set. Type: non-negative integer num_real_data_components -- the number of real-valued components for each data point in the data set. Type: non-negative integer model_filename -- file name of the file with the parameters for the model(s). (required for 'viterbi', 'll', 'sim', optional for 'learn'.) Type: string output -- file name of the file where the output is sent. (stdout by default) Type: string num_restarts -- number of different starting points for EM. (1 by default, used only with action 'learn') Type: positive integer input_filename -- file name for the set of values of inputs. Input data is required for input distribution to be used. Type: string num_input_sequences -- number of input data sequences. (1 by default) Type: positive integer input_sequence_length -- length of each input sequence. (1 by default) Type: positive integer input_sequence_length_distinct -- same as input_sequence length but one for each input sequence. Type num_input sequences of positive integers. num_discrete_input_components -- the number of finite-valued components for each data point in the input set. Type: non-negative integer num_real_input_components -- the number of real-valued components for each data point in the input set. Type: non-negative integer extra_data_filename -- file name for the set of extra values. Type: string num_extra_sequences -- number of extra data sequences. (1 by default) Type: positive integer extra_sequence_length -- length of each extra sequence. (1 by default) Type: positive integer num_discrete_extra_components -- the number of finite-valued components for each data point in the extra data set. Type: non-negative integer num_real_extra_components -- the number of real-valued components for each data point in the extra set. Type: non-negative integer state_filename --- file name where hidden states would be recorded if the data is being simulated (Hidden states are not recorded if this option is not invoked.) Type: string num_models -- number of models in model file. (1 by default) If cross- validation is selected, num_models would be adjusted automatically. Type: positive integer em_precision -- specifying the sensitivity threshold for EM algorithm. For the models with hidden states, EM is terminated once the change in per- dimension log-likelihood is less than the threshold. For the stateless model, the learning algorithm switches to the optimization of the parameters for another component once the change in per-bit log-likelihood for a particular component is less than the threshold. Type: positive real (5E-05 by default) num_simulations -- number of simulations per sequence per model. Must be used with actions 'simulation' and 'analysis'. With action 'analysis', indicates the number of simulations used to obtain the data set. The total number of simulated sequences would be num_simulations*num_data_sequences. Type: positive integer (1 by default) analysis -- type of analysis to perform on the data. The following options are supported: 'mean' (default, proportion of rainy observations per station), 'correlation' (pairwise correlation of rainfall), 'persistence' (probability that no rain/rain persists for each station, summary of the dry/wet spell distribution), 'dry' (dry spell length counts per station), 'wet' (wet spell length counts per station), 'information' (mutual information for pairs of data components), 'logodds' (log-odds ratio for each pair of components). filling -- the type of evaluation calculated using hole-filling analysis. This type of analysis emulates filling missing entries. Types of analysis supported: 'log_p' (cummulative log-probability of the left-out data); 'prediction' (number of correct predictions based on P(left out|everything else)); 'missing-probabilities' (marginal probabilities for missing values); 'hidden-states' (probabilities of hidden states as computed by forward-backward procedure). lookahead -- number of timepoints to be predicted. Must be used with action 'prediction'. type: positive integer (1 by default) em_verbose -- option to display verbose messages during EM runs. No inputs. dim-index-display -- option to output/read dimension indices into/from parameter file. May be needed for backward compatibility. bare-display -- when outputs models, writes only the parameters, without descriptions. initialization -- type of parameter initialization used with non-homogeneous models. Types of initialization supported: 'random' (default) -- initialization at random from the allowed range; 'em' -- solution of the EM on the corresponding homogeneous model. robust_first_state -- option to use all entries in the data for first state probabilities with HMMs. No inputs. Other file formats: data file -- contains only the data (comments with '#' are ok). Missing values can be specified as 'nan'. input variables file -- contains only the values of input variables. The number of entries must be the same as in the data file. model file -- lists probabilities for the first state (first), entries of the transition matrix (second), and parameters for the emissions (third). Comments with '#' are ok. simulation file -- each row will contain the data simulated for one observation (day). Top to botton, data is simulated first for each simulation, then for each sequence, then for each day. hidden state file -- each row will contain the value for the hidden state. |