paul-almond.com
Home Guest Book Links Email
Occam's Razor Part 5: How Mapping Can Work

By Paul Almond, 14 January 2006

Previous Articles in this Series

Reading of the following articles is suggested before reading this article:

Occam's Razor Part 1: What Is Occam's Razor? http://www.paul-almond.com/OccamsRazorPart01.htm. [1]

Occam's Razor Part 2: Principles of Language http://www.paul-almond.com/OccamsRazorPart02.htm. [2]

Occam's Razor Part 3: Assumptions About Reality http://www.paul-almond.com/OccamsRazorPart03.htm. [3]

Occam's Razor Part 4: An Overview of How Occam's Razor Works http://www.paul-almond.com/OccamsRazorPart04.htm. [4]

The following article, although not part of this series, is closely related to it and is a prequel to it:

What is a Low Level Language? http://www.paul-almond.com/WhatIsALowLevelLanguage.htm. [5]

Introduction

The previous four articles in this series have set out assumptions needed for Occam's razor to work and started a justification of it.

The last article Occam's Razor Part 4: An Overview of How Occam's Razor Works [4] justified Occam's razor in terms of mapping. Models made by humans, or any models to which humans have access, are partial models. The actual model describing nature is one of an infinite number of complete consistent candidate models. Partial models that are considered "simple" by the standards of Occam's razor are those that "map onto" many complete consistent models, but no explanation was given of what it means to say that a partial model "maps onto" a complete model.

This article will show how a partial model can be said to "map onto" a complete model and why partial models favoured by Occam's razor will tend to map onto large numbers of complete models, making them more likely to be "correct".

What We Want From Mapping

What "mapping" should mean depends on what it should accomplish. It would be easy to define many types of mapping in which partial models favoured by Occam's razor tend to map onto large numbers of complete consistent models, but this would not be of much use in itself. Occam's razor is supposed to help us find models that are better. Partial models which map onto many complete models are more likely to map onto the actual model but it raises the question of why we would want a partial model that is likely to "map onto" the actual model.

The purpose of models is prediction. If a partial model "maps onto" a complete consistent model then, for the "mapping" to be useful, it must imply that the partial model and the complete consistent model make the same predictions. If a partial model favoured by Occam's razor "maps onto" a large number of complete consistent models then this should make it likely that one of these will be the actual model and that the partial model will be making the same predictions as the actual model - making the partial model's predictions useful. We can therefore state:

A partial model maps onto a complete consistent model if it makes the same predictions.

This should give a basic idea of how Occam's razor can work and how we expect "mapping" to occur, but we still need to decide what it means to say that two models make the same predictions.

In the first article in this series [1], it was stated that observations of reality can be treated as a sequence of bits and a model can be considered to be an algorithm that produces such a sequence of bits. This suggests one simple way of deciding whether or not two models make the same predictions:

Two models make the same predictions if they generate exactly the same sequence of bits.

This is simplistic and will need modifying later, but we will use it for now.

We now need to show that partial models favoured by Occam's razor will actually "map onto" more complete consistent models when we view "mapping" in this way.

Why Models Favoured by Occam's Razor Are Better

We will not be doing anything as naïve as assuming that a partial model which is described by a shorter algorithm is more likely to be literally reproduced - in the sense of having its "code" duplicated - within a complete consistent model: an explanation has already been given of the problems with this.

The argument will be expressed in terms of information content. While we do not expect the actual code of a partial model to occur within a complete consistent model we do expect its information to occur in it. The idea of information is more general: lots of different sections of code could contain the same information.

Let us suppose that the partial model being considered is consistent, meaning that it exactly reproduces all the sequence of bits representing past observations. Let us also assume that it is encoded as efficiently as possible, so that no model can be made, outputting the same sequence of bits as the partial model for both past and future observations, which is described by an algorithm with a shorter length than that of the partial model.

Let us call the information in the partial model X. This is not the same as saying that the actual sequence of code in the partial model is called X as we are concerned with information and not just code. Let P be the number of bits in the algorithm describing the partial model being considered. This means that at least P bits of storage capacity are needed to encode the information X.

X is the minimum information needed by a model to produce the same sequence of bits representing both past and future observations as the partial model being considered. Any complete consistent model which agrees exactly with the partial model being considered for predictions of future observations must contain, at the very least, the information X. It can also contain lots of other information, but X is the minimum information needed for agreement with the partial model being considered.

The set of complete consistent candidate models is infinite. This was dealt with in the last article [4] by considering the set of all complete consistent candidate models described by algorithms with a finite length which is then made to tend to infinity. We will continue to use this approach. Let us suppose that we are limiting the set of complete consistent candidate models to those described by algorithms with lengths of L bits or less.

For a complete consistent candidate model to behave in the same way as the partial model being considered it must contain the information X. It needs at least P bits of information storage capacity to store the information X. It can hold up to L bits of information so there are L-P bits of information storage capacity not needed to store X. It is important to note that we are not doing anything as crude here as saying that P bits of the model are needed to store X and the remaining bits can be used to store something else: this is about information quantities and storage capacities rather than specific bits. As an example, X could be stored in some delocalised way involving every bit of the complete consistent candidate model while still only consuming P bits of capacity.

All complete consistent candidate models containing X have that in common. The only thing that differentiates them is the other information that exists in the remaining capacity, which has a size of L-P bits, of the model not used to store X. This is important with regard to how many different such complete consistent models there are, because it is only by containing different information that two different models can exist. There is no scope for complete consistent models to be different from each other by containing different information X: X is always just X. Any difference must come from models containing different information in the unused capacity of L-P bits. The size of this capacity determines how many different ways this information can be expressed and therefore how many different complete consistent candidate models can be made that contain X. That is to say, the greater the amount of unused information after X is stored in the complete model then the greater the amount of information which is available to allow different complete consistent models to be expressed.

There are only 2L-P ways of expressing the information stored in L-P bits of storage capacity.

Therefore, there are 2L-P different complete models that can be made with X stored in them. This value is greatest when P is lowest.

This means that we should expect more complete models to exist that contain all the information in a partial model, provided that the partial model is expressed as efficiently as possible, when the algorithm describing the model is short.

This in turn means that we should expect more complete consistent candidate models to exist that match the behaviour of a partial model exactly - both in terms of past behaviour and future predictions - when the algorithm describing the partial model is short. The shorter the algorithm describing a partial model is, the more likely it is that the actual model is one of this group of complete consistent candidate models.

We have the issue of L being finite for the purposes of the argument that was just presented, but unlimited in reality. This can be resolved, as before, by letting L tend to infinity.

A Limitation of This Argument

We have not quite established Occam's razor yet.

Just because a model contains all the information in the partial model, this does not mean that it must agree with that partial model. The information X in the complete model is only part of the information in the complete model. There is a lot more other information as well and this information could prevent the X information from being expressed: in fact it probably will prevent the X information being expressed. It should be noted that we are not saying that the complete model has to contain the "X" code and the "other" code: we are talking in more general terms about information and quantities of information. To put this another way, although all the complete consistent models in what we hope is the large group of models that are equivalent to our partial model contain X, we could find some subset of them that contain some other arbitrary information Y from a different partial model. Why should they these models not agree with this other partial model instead of the one with which we want them to agree?

A Possible Answer: Having a Large "Pool"

One way of looking at this is to say that even if complete consistent models that contain the information X are not required to agree with the partial model being considered, they do need to contain X to even have a chance of it, so having the largest possible "pool" of complete consistent candidate models containing X would make it likely that as many as possible of these models would agree with the partial model being considered, even if many of them do not. Putting this another way: if only a certain percentage of the complete consistent candidate models containing X actually agree with the partial model being considered then the actual number of models represented by this percentage if greater when the number of models containing X is greater. We have already seen that this is best achieved by choosing a partial model that can be described by a short algorithm. Is the problem solved?

There is a problem with this reasoning. Even if it suggests that we should expect to find more complete models that agree with our partial model when its description is short, the fact that it is very easy for the other information, apart from X, that is in the complete model to interfere with the expression of the X information suggests that very few of these models will actually be in agreement with the partial model containing X. If this is the case we may just be proving a version of Occam's razor which makes the model with the shortest description the most likely to be true - but still very likely not to be true, as is any other model. This does not fit our experience of reality in which we expect theories to have a good chance of being "correct". We do not merely have, for example, a theory of gravity which is very unlikely to be "correct" but is still the best chance simply because all the competing theories are even more implausible. That is the sort of situation we would be accepting if we used this sort of reasoning.

Another solution is needed. Maybe we should reduce our expectations with regard to what we expect from a partial model?

Reducing Our Expectations

A consequence of this limitation in the argument is that we will probably have to lower our sights and accept that we cannot expect our partial model to provide agreement with the actual model in every detail for the past and all of the future.

One approach may be to accept that the model does not need to be correct in every detail, but this is not as easy as it may sound. All we have for the observations of reality is a sequence of bits generated by the actual model and a partial model will actually match it or not match it. It is hard to see how we could define the concept of agreeing with "less detail" that would be required for this.

Another approach could be still to require absolute detail, but to lower our sights with regard to expecting the partial model to agree with the actual model forever. We may just need to satisfy ourselves with a partial model that has a good chance of working "for now" and that the predictions which it makes for the "near" or "foreseeable" future will match those of the actual model. This is the solution that will now be discussed.

Agreeing "For Now"

The most obvious argument for lowering our expectations and only seeking a partial model which will agree with the actual model "for now" is that, presumably, even many complete consistent candidate models that fail to agree exactly when we demand agreement for all of the future, as well as consistency in the past, will still provide this "weak" form of agreement.

An obvious objection could be made to this by saying that, while the argument given so far has supported partial models with shorter algorithmic descriptions as tending to have more complete consistent candidate models that are consistent and in full agreement for all of the future, it does not necessarily follow that this argument extends to complete consistent models that have agreement for some of the future. I think that most people would intuitively expect this actually to be the case, but we do need more to support the idea.

We will use a concept that I will call defection. Let us imagine that we have a complete consistent candidate model that agrees with the partial model for some time into the future, until it stops agreeing. What happens after agreement ceases is of no interest to us. I will give the term defection to the point at which full agreement between the partial model and the complete consistent candidate model ends; that is to say when the sequences of bits produced by the partial model and the complete consistent candidate model become different in any respect. Previously we wanted partial models which had agreement with many complete consistent candidate models which never defected. Now that we have lowered our sights we want partial models which agree with many complete consistent candidate models which do not defect immediately in the future and, preferably, delay defection as long as possible.

We can get an idea of how many models are likely to agree for some time into the future from the idea of specificity of defection. For a complete consistent candidate model to defect very soon means that it must defect very close to the current time. There is no particular reason to expect a complete candidate model to contain X even to avoid defecting up to the present, and for such models there would be no particular specificity involved in defecting much earlier than the present, but we have already adopted a methodology of only considering consistent models - those which have avoided defecting until the present time. If our observed data goes back a long time then, for such a model to defect now - just after the present moment has passed - requires a huge amount of specificity: such models have to defect at "just the right time". We can extend this argument further into the future: for a complete consistent candidate model to defect at some time in the near future demands some specificity - the amount of specificity dependent on how close to the present the defection actually happens.

Any defection must be caused by information in the complete consistent model other than X. The amount of information specifically required by the specificity of the defection we will call Y. If the defection has to occur at a very specific time then Y is a large amount of information, since to specify something with great accuracy requires more information: it is all that specificity means really. If the defection timing does not have a very specific requirement then Y does not have to be specific and Y can be a small amount of information.

It should be easy to see that, when we are dealing with complete consistent candidate models, after we have allowed for the amount of information storage capacity in a complete consistent candidate model that is used to store X and Y, the remaining capacity needed to allow different complete consistent candidate models to be made is greater when Y is smaller.

Defecting soon is a type of defection requiring a lot of specificity - which requires a lot of information to express it and leaves little scope to make many different models. On the other hand, defecting late requires less specificity - requiring little information to express it and leaving scope to make many different models.

This means that we should expect more complete consistent candidate models that contain the X information to defect later rather than sooner - and the length of the algorithm describing the partial model has an effect on the number of such models. This in turn means that it is more likely that the actual model is one of these models that defects later rather than sooner, making the partial model likely to be correct in this less ambitious sense.

I have not said what is meant by "soon" or "in the near future". There is actually no firm idea of this: in practice there will be different probabilities of agreement for different lengths of time into the future.

Limitations of this View

This is a good start, but the argument is limited in the following ways:

1. It does not allow for multiple theories, but only a single "monolithic" theory.

Each theory, or model, that we could make is expected to do the job of prediction in its entirety. If we have two theories and they give the same predictions then one is redundant. If they give different predictions we have no way of putting these predictions together to form a single model. We would have to decide which of the two theories we want. This is in contrast to how human science and informal understanding of reality works: we can have many theories operating in different domains and at different levels of detail and we can use multiple theories without any conflict.

As an example, we can use the big bang model of cosmology to describe the evolution of the cosmos, the ideal gas laws to tell us what will happen in a piston of compressed gas and principles of sociology to tell us what groups of people might do. We do not have a situation in which applying the big bang model generates the entire sequence of observations that can be made in future and applying the ideal gas laws does the same thing. That is what is required by the methodology presented here, though: it is a methodology intended to find a single "monolithic" model to deal with all of reality. The methodology does not demand certainty - different monolithic models have different probabilities of being true - but only a single model is being sought and can be meaningfully used.

One answer to this may be that there should be no problem with a single monolithic model if we can get it, but our experience of reality and modelling suggests this is unlikely to work in practice.

2. There is no concept of hierarchy.

The models familiar to humans operate over a range of levels. There is a concept of hierarchy in which some models are low level and others are high level; for example, particle physics is at a lower level than plate tectonics.

The methodology proposed here does not recognise any concept of hierarchy: models have no low level or high level nature. This follows unavoidably from the inability of the methodology to accommodate different models. With models only being usable in isolation it is meaningless to talk about different models working at different levels relative to each other.

3. Models are required to be totally accurate.

Models have to predict in full detail every aspect of the universe - or at least every sensory input that would be obtained by an observer. In practice, humans do not have enough information to make such accurate predictions: they have higher level models that make predictions in more vague terms. As an example, a model may predict the height to which a ball will bounce without needing to predict where every atom in the ball goes. The requirement of total accuracy in the methodology proposed here is a consequence of how the models work - generating sequences of bits symbolising observations - and the lack of any difference between high and low level models.

This requirement of total accuracy means that the methodology provides no way of dealing with a model that is even slightly inaccurate, or even a way of describing what "slightly inaccurate" means. Any real modelling system will need to operate with limitations in knowledge and processing resources that lead to inaccuracy: this methodology cannot do this.

It may occur to some readers to use some simple measure such as the Hamming distance to measure the difference between two sequences of bits that describe observations and compare two models, but there is no reason why this should be of any use: there is no obvious reason why two sequences of bits that are "close together" in Hamming distance terms should be regarded as similar to each other for the purposes of this methodology.

4. Models cannot be easily translated into human terms.

The concept of a "model" in this methodology is a very abstract one,
It would be very difficult to convert a theory expressed like this into a description which meaning anything to humans - which suggests there is more to the modelling methodology used by humans.

These limitations mean that, while a version of Occam's razor expressed in this way may be philosophically interesting, it is unlikely to be of much practical use in evaluating models generated by humans or in making algorithms for use in automated model generation in artificial intelligence. We should certainly want a view of Occam's razor that allows us to formally apply it in real situations, so further investigation is required.

Conclusion

It has been shown that if we regard a partial model as "mapping onto" a complete consistent candidate model when the two models output identical sequences of bits representing observations, then partial models with shorter algorithmic descriptions (when expressed as efficiently as possible) can be expected to "map onto" more complete consistent candidate models. This means that shorter partial models should be in exact agreement with more complete consistent candidate models and should have a higher chance of being in exact agreement with the actual model and being "correct".

A problem with this is that the proportion of all possible complete consistent candidate models which are in exact agreement, in the future, with any given partial model is likely to be very low. A solution to this is to reduce our expectations and only seek a partial model which is likely to be in agreement with many complete consistent candidate models "for now", meaning it is likely to agree with the actual model "for now". It has been shown that this is most likely achieved by seeking partial models with short descriptions. This appears to be a justification for Occam's razor.

This view does have some limitations:

  1. It does not allow for multiple theories, but only a single "monolithic" theory.
  2. There is no concept of hierarchy.
  3. Models are required to be totally accurate.
  4. Models cannot be easily translated into human terms.

These limitations make the justification of Occam's razor given here of little practical use. A better way of viewing Occam's razor is necessary and this will be discussed in later articles. Some concepts mentioned in Russell Standish's article [6] will help.

References

[1] Web Reference: Almond, P. (2005). Occam's Razor Part 1: What Is Occam's Razor? Retrieved 22 August 2005 from http://www.paul-almond.com/OccamsRazorPart01.htm.

[2] Web Reference: Almond, P. (2005). Occam's Razor Part 2: Principles of Language. Retrieved 9 October 2005 from http://www.paul-almond.com/OccamsRazorPart02.htm.

[3] Web Reference: Almond, P. (2005). Occam's Razor Part 3: Assumptions About Reality. Retrieved 13 November 2005 from http://www.paul-almond.com/OccamsRazorPart03.htm.

[4] Web Reference: Almond, P. (2005). Occam's Razor Part 4: An Overview of How Occam's Razor Works. Retrieved 24 December 2005 from http://www.paul-almond.com/OccamsRazorPart04.htm.

[5] Web Reference: Almond, P. (2005). What is a Low Level Language? Retrieved 17 July 2005 from http://www.paul-almond.com/WhatIsALowLevelLanguage.htm.

[6] Web Reference: Standish, R. K. (2002). Why Occam's Razor. Retrieved 24 December 2005 from http://parallel.hpc.unsw.edu.au/rks/docs/occam/occam.html.

Home Guest Book Links Email

© Copyright Paul Almond 2003-2006. All Rights Reserved. Email: info@paul-almond.com
This page last modified: Saturday January 14, 2006 19:21