mvnhmm README

README file for 'mvnhmm' program -- implementation of the update rules,

simulation of data, and calculation of statistics for variations of

(M)ulti(V)ariate (N)onhomogeneous (H)idden (M)arkov (M)odels.


May 15, 2002

Updated October 04, 2002

Updated November 01, 2002

Revised and updated April 03, 2003

Revised and updated August 28, 2003

Updated October 03, 2003

Updated October 24, 2003

Updated November 19, 2003

Updated November 20, 2003

Updated December 16, 2003

Updated December 19, 2003

Updated December 23, 2003

Updated December 27, 2003

Updated February 27, 2004

Updated April 06, 2004

Second revision, May 3, 2004

Updated June 21, 2004

Updated July 22, 2004

Updated August 4, 2004

Updated October 12, 2004

Updated April 18, 2005

Updated June 23, 2005

Updated October 26, 2005

Updated November 9, 2005

Updated November 21, 2005

Updated December 12, 2005

Updated February 16, 2006

Updated February 25, 2006

Updated March 16, 2006

Updated April 25, 2006

Updated August 10, 2006

Updated August 27, 2006

Updated August 28, 2006

Updated September 7, 2006

Updated September 12, 2006

Updated September 13, 2006

Updated October 2, 2006


By Sergey Kirshner,

Alberta Ingenuity Centre for Machine Learning (AICML), University of Alberta,

Edmonton, Canada (formerly of Donald Bren School of Information and Computer

Science, University of California, Irvine, USA).


This file contains information necessary to run the code which this file

accompanies.



Compilation:

To compile the code using g++, run 'make' (default makefile).  If the code is

to be run on Sun Solaris, run 'make -f makefile.sun' for faster running code.

The Sun makefile was tested on the Solaris 8 system.  To compile the code for

running on the Linux platform, run 'make' (default for Linux).  Linux make

was in part developed on RedHat 9 Linux platform and was tested on RedHat 9

platform.  It was also tested on the SuSe 10.0 Linux.  The default make 

requires g++ compiler with standard C and C++ libraries.  g++ version 3.0.4

was used during the development.



Running the program:

To run the program, either add the directory with the compiled code to path,

or run ./mvnhmm from the directory.  hmm takes two inputs: parameter file name

(required) and seed for the random number generator (optional, 0 if 

omitted).  If run without any inputs, the program will print the proper

usage information.



Parameter file specification:

Without a proper parameter file, the program would not run.  Certain

parameters are required for the program to run; others are need for running

the program together with outside scripts.  For easier readablilty, the

parameter file allows for comments and ignores extra whitespaces and tabs.

Also, the order in which the parameters are entered is non-essential for

most cases.  


Any text after the character '#' and to the end of the line is ignored; thus

use comments as frequently as desired.  The parameters are read in the

following manner: first, the parameter name is specified, and then the

parameter is specified.  Currently, the following parameters are defined:



num_states -- number of hidden states in the model. (2 by default)

(used by all actions except 'analysis' or if 'transition' is defined)

Type: positive integer.



model_type -- type of the model. ('hmm' by default).

Type: currently, these types are supported: 'hmm', 'nhmm', 'mixture', 

'nmixture', and 'stateless'.  'hmm' uses hidden Markov model; 'nhmm' uses 

non-homogeneous hidden Markov model; 'mixture' runs mixture model ignoring the 

order of observations; 'nmixture' runs a mixture model with mixing

probabilities dependent on the values of the input variables; 'stateless' runs

model without hidden states where the probability of each observations is 

determined from the previous ovservation and the value of the corresponding

input variable.  (NOTE: 'stateless' is currently unavailable)

'model_type' is not needed if 'transition' is defined.


action -- defines what the program should calculate. ('learn' by default)

Type: currently, these options are implemented: 'learn', 'viterbi', 'll',

'll-train', 'simulation', 'analysis', and 'filling'. 'learn' calculates the 

parameters of the model using the EM algorithm.  'viterbi' finds the sequences

of most likely states given the data and the model. 'll' and ''ll-train'

calculate the log-likelihood of the test and train set, respectively, given the

model.  'simulation' generates data according to the specified model with or

without input variable values (Xs). 'analysis' calculates the statistics of the

provided data.  'filling' provides the evaluation of the model on data using

hole-filling method.  'prediction' computes the log-likelihood of specified (by

parameter 'lookahead') consecutive observations (given the parameters of the

model) for each time point.  'init' generates multiple initializations of

parameters for the model and stores them in the output file.  'KL' computes the

entropy of the passed distribution and the KL-divergence between the first and

the passed second distribution.


transition -- specification of the transition distribution.

Type: the distribution specification string (same as emission -- see below).


emission -- specification of the emission distributions (output states).

Type: The code and the parameter specifications (below).


Types of emission distributions -- conditionally-independent

(super-distribution), univariate Bernoulli, double-chain (conditional 

univariate Bernoulli), Chow-Liu tree, conditional Chow-Liu tree,

sigmoidal belief network, exponential, gamma, Gaussian, AR (conditional

Gaussian), conditionally independent, finite mixture, tree-dependent

mixture.


Bernoulli code: bernoulli

Parameters -- one parameter: the number of possible outcomes (2 for binary

data).

Bernoulli with Dirichlet prior: bernoulli-prior

Parameters -- two parameters: the number of possible outcomes (2 for binary

data) and the pseudo-counts for the prior (non-negative) same for all

outcomes.

Conditional Bernoulli code: chain-bernoulli

Parameters -- one parameter: the number of possible outcomes.

Conditional Bernoulli (averaged) code: chain-bernoulli-global

The difference between this distribution and regular conditional Bernoulli

is that the first entry in the sequence distribution is updated as an

average of probabilities over all entries in the sequence.

Parameters -- one parameter: the number of possible outcomes.

Chow-Liu code: chow-liu

Parameters -- two parameters: the number of discrete variables (nodes in the

tree), the number of possible outcomes for each node (2 for binary data).

Chow-Liu with MDL penalty on MI terms code: chow-liu-mdl

Parameters -- three parameters: the number of discrete variables (nodes in the

tree), the number of possible outcomes for each node (2 for binary data),

the penalty factor for each parameter. 

Chow-Liu with Dirichlet prior on the bivariate probabilities and MDL penalty

on MI terms code: chow-liu-prior

Parameters -- four parameters: the number of discrete variables (nodes in the

tree), the number of possible outcomes for each node (2 for binary data),

pseudo-counts for Dirichlet prior (non-negative), same for all outcomes;

the penalty factor for each parameter.

Conditional Chow-Liu code: conditional-chow-liu

Parameters -- two parameters: the number of discrete variables (dimension of

the distribution variable), the number of possible outcomes for each node (2

for binary data).a

Conditional Chow-Liu with MDL penalty on MI terms

code: conditional-chow-liu-mdl

Parameters -- three parameters: the number of discrete variables (nodes in the

tree), the number of possible outcomes for each node (2 for binary data),

the penalty factor for each parameter.

Full binary bivariate MaxEnt model code: maxent-full

Parameters -- one parameter: the number of binary variables.

Bayesian network of conditional binary MaxEnt models with univariate and

bivariate features (sigmoidal belief network) code: BN-maxent

Parameters -- two parameters: the number of binary variables and the MDL

penalty factor.

Conditional Bayesian network of conditional binary MaxEnt models with

univariate and bivariate features code: BN-cond-maxent

Parameters -- two parameters: the number of binary variables and the MDL

penalty factor.

Delta-exponential mixture distribution code: delta-exponential

Parameters -- one parameter: number of mixture components (1+number of

exponential components).

Delta-gamma mixture distribution code: delta-gamma

Parameters -- one parameter: number of mixture components (1+number of gamma

components).

Dirac delta distribution code: delta

Attributes -- one attribute: location (real-valued).

Exponential (geometric) distribution code: exponential

Parameters -- one parameter: real-valued.

Gamma distribution code: gamma

Parameters -- two parameters: real-valued.

Log-normal distribution code: log-normal

Parameters -- two parameters: real-valued.

Gaussian code: gaussian

Parameters -- one parameter: dimension.

Gaussian Auto-Regressive on the outputs: chain-gaussian

Parameters -- one parameter: dimension.

Tree-structured Gaussian code: tree-gaussian

Parameters -- one parameter: dimension.

Conditionally independent: independent

Parameters -- number of components followed by the distribution for all

components (assumed the same).  For example, independent 5 bernoulli 2 

indicates a conditionally-independent distribution with each component a 

2-state Bernoulli.

Mixture: mixture

Parameters -- number of components of the mixture followed by the distribution

for all components (assumed the same).

Tree-dependent mixture code: cl-mixture

Parameters -- three parameters -- number of variables, number of states for

each variable, MDL penalty factor -- followed by the distributions for each 

state (distributions are assumed the same for all variables).


input_dimensionality -- number of input components.

Type: non-negative integer.


Types of input distributions -- logistic or trans-logistic

Input distributions are assigned according to the type of the model.



xval_type -- the type of cross-validation.

Type: string -- 'none' (default), 'leave_n_out'.

'none' -- no cross-validation.

'leave_n_out' -- leave-n-out cross-validation


examples_out -- number of examples to leave out for leave-n-out cross-

validation

Type: positive integer (default 1).


data -- file name of the data set.  Required for all actions except

'simulation'.

Type: string



num_data_sequences -- number of data sequences in the data file. (1 by

default)  With action 'simulation', indicates the number of base simulated

sequences if no input file provided.  With action 'analysis', indicates the

total number of sequences in the data file.

Type: positive integer



data_sequence_length -- length of each sequence in the data file. (1 by

default)  With action 'simulation', indicates the length of the simulated

sequences if no input file provided.

Type: positive integer



data_sequence_length_distinct -- same as data_sequence_length, but with

different value for each data sequence.

Type: num_data_sequences of positive integers


num_discrete_data_components -- the number of finite-valued components for

each data point in the data set.

Type: non-negative integer



num_real_data_components -- the number of real-valued components for each

data point in the data set.

Type: non-negative integer


model_filename -- file name of the file with the parameters for the

model(s). (required for 'viterbi', 'll', 'sim', optional for 'learn'.)

Type: string



output -- file name of the file where the output is sent. (stdout 

by default)

Type: string



num_restarts -- number of different starting points for EM. (1 by default,

used only with action 'learn')

Type: positive integer



input_filename -- file name for the set of values of inputs.  Input data

is required for input distribution to be used.

Type: string



num_input_sequences -- number of input data sequences. (1 by default)

Type: positive integer



input_sequence_length -- length of each input sequence. (1 by default)

Type: positive integer


input_sequence_length_distinct -- same as input_sequence length but one for

each input sequence.

Type num_input sequences of positive integers.


num_discrete_input_components -- the number of finite-valued components for

each data point in the input set.

Type: non-negative integer



num_real_input_components -- the number of real-valued components for each

data point in the input set.

Type: non-negative integer



extra_data_filename -- file name for the set of extra values.

Type: string



num_extra_sequences -- number of extra data sequences. (1 by default)

Type: positive integer



extra_sequence_length -- length of each extra sequence. (1 by default)

Type: positive integer



num_discrete_extra_components -- the number of finite-valued components for

each data point in the extra data set.

Type: non-negative integer



num_real_extra_components -- the number of real-valued components for each

data point in the extra set.

Type: non-negative integer



state_filename --- file name where hidden states would be recorded if the

data is being simulated (Hidden states are not recorded if this option is

not invoked.)

Type: string



num_models -- number of models in model file. (1 by default)  If cross-

validation is selected, num_models would be adjusted automatically.

Type: positive integer



em_precision -- specifying the sensitivity threshold for EM algorithm.  For

the models with hidden states, EM is terminated once the change in per-

dimension log-likelihood is less than the threshold.  For the stateless

model, the learning algorithm switches to the optimization of the parameters

for another component once the change in per-bit log-likelihood for a

particular component is less than the threshold.

Type: positive real (5E-05 by default)



num_simulations -- number of simulations per sequence per model.  Must be

used with actions 'simulation' and 'analysis'.  With action 'analysis', 

indicates the number of simulations used to obtain the data set.  The total

number of simulated sequences would be num_simulations*num_data_sequences.

Type: positive integer (1 by default)



analysis -- type of analysis to perform on the data.  The following

options are supported: 'mean' (default, proportion of rainy observations

per station), 'correlation' (pairwise correlation of rainfall),

'persistence' (probability that no rain/rain persists for each station,

summary of the dry/wet spell distribution), 'dry' (dry spell length

counts per station), 'wet' (wet spell length counts per station),

'information' (mutual information for pairs of data components), 'logodds'

(log-odds ratio for each pair of components).



filling -- the type of evaluation calculated using hole-filling analysis.  

This type of analysis emulates filling missing entries.  Types of analysis 

supported: 'log_p' (cummulative log-probability of the left-out data); 

'prediction' (number of correct predictions based on 

P(left out|everything else)); 'missing-probabilities' (marginal probabilities

for missing values); 'hidden-states' (probabilities of hidden states as

computed by forward-backward procedure).



lookahead -- number of timepoints to be predicted.  Must be used with action

'prediction'.

type: positive integer (1 by default)



em_verbose -- option to display verbose messages during EM runs.

No inputs.



dim-index-display -- option to output/read dimension indices into/from

parameter file.  May be needed for backward compatibility.



bare-display -- when outputs models, writes only the parameters, without

descriptions.



initialization -- type of parameter initialization used with non-homogeneous

models.  Types of initialization supported: 'random' (default) --

initialization at random from the allowed range; 'em' -- solution of the EM

on the corresponding homogeneous model.



robust_first_state -- option to use all entries in the data for first state

probabilities with HMMs.

No inputs.



Other file formats:

data file -- contains only the data (comments with '#' are ok).

Missing values can be specified as 'nan'.


input variables file -- contains only the values of input variables.  

The number of entries must be the same as in the data file.


model file -- lists probabilities for the first state (first), entries

of the transition matrix (second), and parameters for the emissions

(third).  Comments with '#' are ok.


simulation file -- each row will contain the data simulated for one

observation (day).  Top to botton, data is simulated first for each

simulation, then for each sequence, then for each day.


hidden state file -- each row will contain the value for the hidden state.