A hidden Markov model (HMM) is a bivariate Markov chain which encodes information about the evolution of a time series. HMMs can faithfully represent workloads for discrete time processes and therefore be used as portable benchmarks to explain and predict the complex behaviour of these processes. This project introduces the main concepts of HMMs for discrete time series including a summary of HMM mathematical properties. A section of this report explains the motives behind cluster analysis and the most efficient selection of the clustering algorithm when creating workload models. In the case of this project, an explanation is provided into the benefits of the K-means clustering algorithm for data points in discrete time.
The main aims of this project are to: apply HMMs to two different scenarios to correctly analyse discrete time series; provide meaning to the underlying hidden states of the HMMs in each case; and recreate representative traces for each application. Firstly, the HMM is applied to Flash Memory data in the form of operation type traces to achieve a workload model. Secondly, the HMM is used to decode a data trace formed of hospital patient arrivals creating a Hospital Arrivals model. Both of these models will be validated using averages from the raw and HMM-generated traces and also by comparison of autocorrelation functions.
Another aim of the project is to create a novel adaptation of the Baum-Welch algorithm using Flash Memory data. It is known that discrete HMMs can effectively learn long sequences of observations such as workload access patterns in computer storage systems. However, there is now increasing demand for systems which handle higher density, additional loads as seen in storage workload modelling , where workloads can be characterized on-line. Thus, we derive a sliding version of the Baum-Welch algorithm, which constantly updates its observation set, discarding old data points in the time series as it inputs new ones. We refer to a HMM with this sliding Baum-Welch algorithm as a SlidHMM due to the fact that it slides across the time series, updating its parameters "on-the-fly". The benefit of this novel approach is to obtain a parsimonious model which updates its encoded information whenever more real time workload data becomes available. The SlidHMM is also efficient in keeping track of non-homogeneous processes because it updates the observation set at different stages of the analysis, therefore analysing only the current portion of the time series.
An analysis of an efficient process to identify the optimal number of hidden states for a HMM is also discussed, but left mostly as future work. Also reserved as extensions are: the choice of a different clustering algorithm for each model; and a new approximation for the backward variables in the Baum-Welch algorithm to seek an improvement on the SlidHMM.
Information from pubs.doc.ic.ac.uk/msciHMM.