## Jun 05: Generating Markov Models from Data 1

Category: Methodology

Posted by: Aaron

The tipping points technique under development by yours truly (see here then here) is intended to be something akin to a statistical technique the following sense: it uses data from a system (real or modeled), determines which model of a different kind (Markov) would also generate the observed data, and then conveys information about the imputed model not directly ascertainable from the data. And so like descriptions of statistical modeling techniques I need to provide a description of how to generate the Markov model from the original data and the properties/caveats of doing it in different ways.

To use the general class of techniques I am developing that use state transition diagrams (for which I still need a snappy name) one must obviously have data across time. It helps if there are no missing states/transitions in the data. In most of the applications I am considering we are running the analysis on simulation data so there are no missing states and we have a complete record of the transitions. If one is using survey or observational data and there are holes then there are some techniques from network theory we can apply to fill those gaps, otherwise we have to run the analysis assuming that what we see is all there is. A missing transition will register as a lack of state change and hence a self-loop in a fixed-time Markov model. But we can also build the state transitions as being

event-driven. Doing so will naturally alter the results of the analysis and so the choice needs to be informed by sound methodological reasons (which I will provide). [Note: one advantage of the event-driven technique is that it can differentiate between a reflexive transition and a lack of transition and the time-driven technique cannot.]

Insofar as the generated Markov diagram is an appropriate representation of the original system the information gleamed from it applies to the original system. But appropriateness is a matter of degree. Depending on certain properties of the original data and the way the Markov model is generated we can even measure how close the model is expected to be—a sort of confidence measure on the match. This will vary for different properties of the Markov model and so the generating software can be calibrated to maximize the fit for certain aspects at the cost of others. Many of these choices have to do with resolution and state binning and thus share many of the same tradeoffs and problems with histogram creation. There are other difficulties unique to this technique, but I’m not the first person to create Markov diagrams of systems so the first step will be to see how far others have gone [references appreciated: aaronbramson@gmail.com].

The input of the tipping points etc. methodology tool is the adjacency matrix of the state transition diagram, so that must be the output of this software tool. The input of this tool will be a table with a row for each variable/parameter/value and a column for each time step. This is a rather standard and intuitive format for longitudinal and time-series data so there is no reason for me to deviate. This assumes that the data itself is just values for aspects of a system rather than relational data or something even more sophisticated, but this is just a temporary implementation limitation and not a limitation of the applicability of the general approach. Depending on the data, there may be an obvious way to cut it up into states, but often not. In many cases the system will never return to the exact same set of values, so some abstraction is required to create a Markov diagram wherein some states are revisited. [Note: state revisitation is not necessary for generating a Markov diagram or applying the tipping point etc. techniques, but it is typically necessary to get useful results.] The software will be equipped with some set of automatic abstracting methods that will read the span of values and bin them as appropriate, customizable by user input. The point is that the binning is the hardest part and can’t be done arbitrarily. This is the part where real statistical-style justification comes in. The user specifies some higher-level properties desired of the state transition diagram and the algorithms bin the data to foster those features

To use the general class of techniques I am developing that use state transition diagrams (for which I still need a snappy name) one must obviously have data across time. It helps if there are no missing states/transitions in the data. In most of the applications I am considering we are running the analysis on simulation data so there are no missing states and we have a complete record of the transitions. If one is using survey or observational data and there are holes then there are some techniques from network theory we can apply to fill those gaps, otherwise we have to run the analysis assuming that what we see is all there is. A missing transition will register as a lack of state change and hence a self-loop in a fixed-time Markov model. But we can also build the state transitions as being

event-driven. Doing so will naturally alter the results of the analysis and so the choice needs to be informed by sound methodological reasons (which I will provide). [Note: one advantage of the event-driven technique is that it can differentiate between a reflexive transition and a lack of transition and the time-driven technique cannot.]

Insofar as the generated Markov diagram is an appropriate representation of the original system the information gleamed from it applies to the original system. But appropriateness is a matter of degree. Depending on certain properties of the original data and the way the Markov model is generated we can even measure how close the model is expected to be—a sort of confidence measure on the match. This will vary for different properties of the Markov model and so the generating software can be calibrated to maximize the fit for certain aspects at the cost of others. Many of these choices have to do with resolution and state binning and thus share many of the same tradeoffs and problems with histogram creation. There are other difficulties unique to this technique, but I’m not the first person to create Markov diagrams of systems so the first step will be to see how far others have gone [references appreciated: aaronbramson@gmail.com].

The input of the tipping points etc. methodology tool is the adjacency matrix of the state transition diagram, so that must be the output of this software tool. The input of this tool will be a table with a row for each variable/parameter/value and a column for each time step. This is a rather standard and intuitive format for longitudinal and time-series data so there is no reason for me to deviate. This assumes that the data itself is just values for aspects of a system rather than relational data or something even more sophisticated, but this is just a temporary implementation limitation and not a limitation of the applicability of the general approach. Depending on the data, there may be an obvious way to cut it up into states, but often not. In many cases the system will never return to the exact same set of values, so some abstraction is required to create a Markov diagram wherein some states are revisited. [Note: state revisitation is not necessary for generating a Markov diagram or applying the tipping point etc. techniques, but it is typically necessary to get useful results.] The software will be equipped with some set of automatic abstracting methods that will read the span of values and bin them as appropriate, customizable by user input. The point is that the binning is the hardest part and can’t be done arbitrarily. This is the part where real statistical-style justification comes in. The user specifies some higher-level properties desired of the state transition diagram and the algorithms bin the data to foster those features

*as best as possible*. The work for me is proving that those binning techniques really are the best possible ones.