mvnhmm README
README file for 'mvnhmm' program -- implementation of the update rules,
simulation of data, and calculation of statistics for variations of
(M)ulti(V)ariate (N)onhomogeneous (H)idden (M)arkov (M)odels.
May 15, 2002
Updated October 04, 2002
Updated November 01, 2002
Revised and updated April 03, 2003
Revised and updated August 28, 2003
Updated October 03, 2003
Updated October 24, 2003
Updated November 19, 2003
Updated November 20, 2003
Updated December 16, 2003
Updated December 19, 2003
Updated December 23, 2003
Updated December 27, 2003
Updated February 27, 2004
Updated April 06, 2004
Second revision, May 3, 2004
Updated June 21, 2004
Updated July 22, 2004
Updated August 4, 2004
Updated October 12, 2004
Updated April 18, 2005
Updated June 23, 2005
Updated October 26, 2005
Updated November 9, 2005
Updated November 21, 2005
Updated December 12, 2005
Updated February 16, 2006
Updated February 25, 2006
Updated March 16, 2006
Updated April 25, 2006
Updated August 10, 2006
Updated August 27, 2006
Updated August 28, 2006
Updated September 7, 2006
Updated September 12, 2006
Updated September 13, 2006
Updated October 2, 2006
By Sergey Kirshner,
Alberta Ingenuity Centre for Machine Learning (AICML), University of Alberta,
Edmonton, Canada (formerly of Donald Bren School of Information and Computer
Science, University of California, Irvine, USA).
This file contains information necessary to run the code which this file
accompanies.
Compilation:
To compile the code using g++, run 'make' (default makefile). If the code is
to be run on Sun Solaris, run 'make -f makefile.sun' for faster running code.
The Sun makefile was tested on the Solaris 8 system. To compile the code for
running on the Linux platform, run 'make' (default for Linux). Linux make
was in part developed on RedHat 9 Linux platform and was tested on RedHat 9
platform. It was also tested on the SuSe 10.0 Linux. The default make
requires g++ compiler with standard C and C++ libraries. g++ version 3.0.4
was used during the development.
Running the program:
To run the program, either add the directory with the compiled code to path,
or run ./mvnhmm from the directory. hmm takes two inputs: parameter file name
(required) and seed for the random number generator (optional, 0 if
omitted). If run without any inputs, the program will print the proper
usage information.
Parameter file specification:
Without a proper parameter file, the program would not run. Certain
parameters are required for the program to run; others are need for running
the program together with outside scripts. For easier readablilty, the
parameter file allows for comments and ignores extra whitespaces and tabs.
Also, the order in which the parameters are entered is non-essential for
most cases.
Any text after the character '#' and to the end of the line is ignored; thus
use comments as frequently as desired. The parameters are read in the
following manner: first, the parameter name is specified, and then the
parameter is specified. Currently, the following parameters are defined:
num_states -- number of hidden states in the model. (2 by default)
(used by all actions except 'analysis' or if 'transition' is defined)
Type: positive integer.
model_type -- type of the model. ('hmm' by default).
Type: currently, these types are supported: 'hmm', 'nhmm', 'mixture',
'nmixture', and 'stateless'. 'hmm' uses hidden Markov model; 'nhmm' uses
non-homogeneous hidden Markov model; 'mixture' runs mixture model ignoring the
order of observations; 'nmixture' runs a mixture model with mixing
probabilities dependent on the values of the input variables; 'stateless' runs
model without hidden states where the probability of each observations is
determined from the previous ovservation and the value of the corresponding
input variable. (NOTE: 'stateless' is currently unavailable)
'model_type' is not needed if 'transition' is defined.
action -- defines what the program should calculate. ('learn' by default)
Type: currently, these options are implemented: 'learn', 'viterbi', 'll',
'll-train', 'simulation', 'analysis', and 'filling'. 'learn' calculates the
parameters of the model using the EM algorithm. 'viterbi' finds the sequences
of most likely states given the data and the model. 'll' and ''ll-train'
calculate the log-likelihood of the test and train set, respectively, given the
model. 'simulation' generates data according to the specified model with or
without input variable values (Xs). 'analysis' calculates the statistics of the
provided data. 'filling' provides the evaluation of the model on data using
hole-filling method. 'prediction' computes the log-likelihood of specified (by
parameter 'lookahead') consecutive observations (given the parameters of the
model) for each time point. 'init' generates multiple initializations of
parameters for the model and stores them in the output file. 'KL' computes the
entropy of the passed distribution and the KL-divergence between the first and
the passed second distribution.
transition -- specification of the transition distribution.
Type: the distribution specification string (same as emission -- see below).
emission -- specification of the emission distributions (output states).
Type: The code and the parameter specifications (below).
Types of emission distributions -- conditionally-independent
(super-distribution), univariate Bernoulli, double-chain (conditional
univariate Bernoulli), Chow-Liu tree, conditional Chow-Liu tree,
sigmoidal belief network, exponential, gamma, Gaussian, AR (conditional
Gaussian), conditionally independent, finite mixture, tree-dependent
mixture.
Bernoulli code: bernoulli
Parameters -- one parameter: the number of possible outcomes (2 for binary
data).
Bernoulli with Dirichlet prior: bernoulli-prior
Parameters -- two parameters: the number of possible outcomes (2 for binary
data) and the pseudo-counts for the prior (non-negative) same for all
outcomes.
Conditional Bernoulli code: chain-bernoulli
Parameters -- one parameter: the number of possible outcomes.
Conditional Bernoulli (averaged) code: chain-bernoulli-global
The difference between this distribution and regular conditional Bernoulli
is that the first entry in the sequence distribution is updated as an
average of probabilities over all entries in the sequence.
Parameters -- one parameter: the number of possible outcomes.
Chow-Liu code: chow-liu
Parameters -- two parameters: the number of discrete variables (nodes in the
tree), the number of possible outcomes for each node (2 for binary data).
Chow-Liu with MDL penalty on MI terms code: chow-liu-mdl
Parameters -- three parameters: the number of discrete variables (nodes in the
tree), the number of possible outcomes for each node (2 for binary data),
the penalty factor for each parameter.
Chow-Liu with Dirichlet prior on the bivariate probabilities and MDL penalty
on MI terms code: chow-liu-prior
Parameters -- four parameters: the number of discrete variables (nodes in the
tree), the number of possible outcomes for each node (2 for binary data),
pseudo-counts for Dirichlet prior (non-negative), same for all outcomes;
the penalty factor for each parameter.
Conditional Chow-Liu code: conditional-chow-liu
Parameters -- two parameters: the number of discrete variables (dimension of
the distribution variable), the number of possible outcomes for each node (2
for binary data).a
Conditional Chow-Liu with MDL penalty on MI terms
code: conditional-chow-liu-mdl
Parameters -- three parameters: the number of discrete variables (nodes in the
tree), the number of possible outcomes for each node (2 for binary data),
the penalty factor for each parameter.
Full binary bivariate MaxEnt model code: maxent-full
Parameters -- one parameter: the number of binary variables.
Bayesian network of conditional binary MaxEnt models with univariate and
bivariate features (sigmoidal belief network) code: BN-maxent
Parameters -- two parameters: the number of binary variables and the MDL
penalty factor.
Conditional Bayesian network of conditional binary MaxEnt models with
univariate and bivariate features code: BN-cond-maxent
Parameters -- two parameters: the number of binary variables and the MDL
penalty factor.
Delta-exponential mixture distribution code: delta-exponential
Parameters -- one parameter: number of mixture components (1+number of
exponential components).
Delta-gamma mixture distribution code: delta-gamma
Parameters -- one parameter: number of mixture components (1+number of gamma
components).
Dirac delta distribution code: delta
Attributes -- one attribute: location (real-valued).
Exponential (geometric) distribution code: exponential
Parameters -- one parameter: real-valued.
Gamma distribution code: gamma
Parameters -- two parameters: real-valued.
Log-normal distribution code: log-normal
Parameters -- two parameters: real-valued.
Gaussian code: gaussian
Parameters -- one parameter: dimension.
Gaussian Auto-Regressive on the outputs: chain-gaussian
Parameters -- one parameter: dimension.
Tree-structured Gaussian code: tree-gaussian
Parameters -- one parameter: dimension.
Conditionally independent: independent
Parameters -- number of components followed by the distribution for all
components (assumed the same). For example, independent 5 bernoulli 2
indicates a conditionally-independent distribution with each component a
2-state Bernoulli.
Mixture: mixture
Parameters -- number of components of the mixture followed by the distribution
for all components (assumed the same).
Tree-dependent mixture code: cl-mixture
Parameters -- three parameters -- number of variables, number of states for
each variable, MDL penalty factor -- followed by the distributions for each
state (distributions are assumed the same for all variables).
input_dimensionality -- number of input components.
Type: non-negative integer.
Types of input distributions -- logistic or trans-logistic
Input distributions are assigned according to the type of the model.
xval_type -- the type of cross-validation.
Type: string -- 'none' (default), 'leave_n_out'.
'none' -- no cross-validation.
'leave_n_out' -- leave-n-out cross-validation
examples_out -- number of examples to leave out for leave-n-out cross-
validation
Type: positive integer (default 1).
data -- file name of the data set. Required for all actions except
'simulation'.
Type: string
num_data_sequences -- number of data sequences in the data file. (1 by
default) With action 'simulation', indicates the number of base simulated
sequences if no input file provided. With action 'analysis', indicates the
total number of sequences in the data file.
Type: positive integer
data_sequence_length -- length of each sequence in the data file. (1 by
default) With action 'simulation', indicates the length of the simulated
sequences if no input file provided.
Type: positive integer
data_sequence_length_distinct -- same as data_sequence_length, but with
different value for each data sequence.
Type: num_data_sequences of positive integers
num_discrete_data_components -- the number of finite-valued components for
each data point in the data set.
Type: non-negative integer
num_real_data_components -- the number of real-valued components for each
data point in the data set.
Type: non-negative integer
model_filename -- file name of the file with the parameters for the
model(s). (required for 'viterbi', 'll', 'sim', optional for 'learn'.)
Type: string
output -- file name of the file where the output is sent. (stdout
by default)
Type: string
num_restarts -- number of different starting points for EM. (1 by default,
used only with action 'learn')
Type: positive integer
input_filename -- file name for the set of values of inputs. Input data
is required for input distribution to be used.
Type: string
num_input_sequences -- number of input data sequences. (1 by default)
Type: positive integer
input_sequence_length -- length of each input sequence. (1 by default)
Type: positive integer
input_sequence_length_distinct -- same as input_sequence length but one for
each input sequence.
Type num_input sequences of positive integers.
num_discrete_input_components -- the number of finite-valued components for
each data point in the input set.
Type: non-negative integer
num_real_input_components -- the number of real-valued components for each
data point in the input set.
Type: non-negative integer
extra_data_filename -- file name for the set of extra values.
Type: string
num_extra_sequences -- number of extra data sequences. (1 by default)
Type: positive integer
extra_sequence_length -- length of each extra sequence. (1 by default)
Type: positive integer
num_discrete_extra_components -- the number of finite-valued components for
each data point in the extra data set.
Type: non-negative integer
num_real_extra_components -- the number of real-valued components for each
data point in the extra set.
Type: non-negative integer
state_filename --- file name where hidden states would be recorded if the
data is being simulated (Hidden states are not recorded if this option is
not invoked.)
Type: string
num_models -- number of models in model file. (1 by default) If cross-
validation is selected, num_models would be adjusted automatically.
Type: positive integer
em_precision -- specifying the sensitivity threshold for EM algorithm. For
the models with hidden states, EM is terminated once the change in per-
dimension log-likelihood is less than the threshold. For the stateless
model, the learning algorithm switches to the optimization of the parameters
for another component once the change in per-bit log-likelihood for a
particular component is less than the threshold.
Type: positive real (5E-05 by default)
num_simulations -- number of simulations per sequence per model. Must be
used with actions 'simulation' and 'analysis'. With action 'analysis',
indicates the number of simulations used to obtain the data set. The total
number of simulated sequences would be num_simulations*num_data_sequences.
Type: positive integer (1 by default)
analysis -- type of analysis to perform on the data. The following
options are supported: 'mean' (default, proportion of rainy observations
per station), 'correlation' (pairwise correlation of rainfall),
'persistence' (probability that no rain/rain persists for each station,
summary of the dry/wet spell distribution), 'dry' (dry spell length
counts per station), 'wet' (wet spell length counts per station),
'information' (mutual information for pairs of data components), 'logodds'
(log-odds ratio for each pair of components).
filling -- the type of evaluation calculated using hole-filling analysis.
This type of analysis emulates filling missing entries. Types of analysis
supported: 'log_p' (cummulative log-probability of the left-out data);
'prediction' (number of correct predictions based on
P(left out|everything else)); 'missing-probabilities' (marginal probabilities
for missing values); 'hidden-states' (probabilities of hidden states as
computed by forward-backward procedure).
lookahead -- number of timepoints to be predicted. Must be used with action
'prediction'.
type: positive integer (1 by default)
em_verbose -- option to display verbose messages during EM runs.
No inputs.
dim-index-display -- option to output/read dimension indices into/from
parameter file. May be needed for backward compatibility.
bare-display -- when outputs models, writes only the parameters, without
descriptions.
initialization -- type of parameter initialization used with non-homogeneous
models. Types of initialization supported: 'random' (default) --
initialization at random from the allowed range; 'em' -- solution of the EM
on the corresponding homogeneous model.
robust_first_state -- option to use all entries in the data for first state
probabilities with HMMs.
No inputs.
Other file formats:
data file -- contains only the data (comments with '#' are ok).
Missing values can be specified as 'nan'.
input variables file -- contains only the values of input variables.
The number of entries must be the same as in the data file.
model file -- lists probabilities for the first state (first), entries
of the transition matrix (second), and parameters for the emissions
(third). Comments with '#' are ok.
simulation file -- each row will contain the data simulated for one
observation (day). Top to botton, data is simulated first for each
simulation, then for each sequence, then for each day.
hidden state file -- each row will contain the value for the hidden state.