paul-almond.com
Home Guest Book Links Email
Occam's Razor Part 4: An Overview of How Occam's Razor Works

By Paul Almond, 24 December 2005

Previous Articles in this Series

Reading of the following articles is suggested before reading this article:

Occam's Razor Part 1: What Is Occam's Razor? http://www.paul-almond.com/OccamsRazorPart01.htm [1]

Occam's Razor Part 2: Principles of Language http://www.paul-almond.com/OccamsRazorPart02.htm [2]

Occam's Razor Part 3: Assumptions About Reality http://www.paul-almond.com/OccamsRazorPart03.htm [3]

The following article, although not part of this series, is closely related to it and is a prequel to it:

What is a Low Level Language? http://www.paul-almond.com/WhatIsALowLevelLanguage.htm. [4]

Introduction

The previous three articles in this series have stated the assumptions and ideas needed for Occam's razor to be asserted, but they have not given much of an idea of how it can actually work. In fact, the situation may seem worse than this: the last article Occam's Razor Part 3: Assumptions About Reality may have even given the impression that Occam's razor does not work.

The last article [3] suggested that a single algorithmic description of reality is correct and that no particular algorithmic description should be given preferred status. This means that any algorithmic description of reality should be as likely as any other to be the correct one. This may seem to contradict Occam's razor which suggests that certain descriptions of reality - the "simpler" ones - are preferred. How do we resolve this and how will Occam's razor actually work?

This article will deal with this. For the first time in this series of articles we will actually give an idea of how Occam's razor works.

One of my main interests in doing all this with Occam's razor is in the area of artificial intelligence: a statement of Occam's razor which allows it to be practically used by an algorithm to actually synthesize and test theories would be useful. To get there, however, the current ground is best covered first. The basic idea mentioned in this article as the justification of Occam's razor is not new and is often described as an ensemble view. A discussion of this sort of justification of Occam's razor is provided by Russell Standish [5]. When enough familiar ground has been covered things will become more interesting.

What Has Been Stated Previously

I will not repeat everything that has been discussed so far, but a mention of the points that are most important to this article will be useful.

The principle of algorithmic description assumed that there is a particular consistent candidate model - an algorithmic description of reality - that can be considered "correct" in that it would model reality with complete accuracy. Such a model is referred to as the actual model.

The principle of statistical impartiality with respect to algorithms assumed that any given consistent candidate model is as likely as any other to be the actual model in the absence of any knowledge about reality contradicting this. This means that, as far as we know, any particular algorithm which is supposed to model reality and is consistent with previous observations of reality is as good as any other consistent model.

The Problem

This may appear to invalidate Occam's razor. Occam's razor states that there are preferred models: the "simplest" ones.

If Occam's razor is actually telling us anything about the probabilities of models being "correct" then it gives certain model descriptions a preferred status and relates the chance of a model description being the "correct" one to some property of that description - its "simplicity". We would probably consider simplicity to have some relation to the length of the model's description, but we have not even got as far as needing to worry about this because the above assumptions seem to be suggesting that any algorithmic description is as likely as any other to be the actual model. There seem to be no criteria for preferring one model to another, so Occam's razor does not appear to be in effect.

How can we resolve this problem?

The Solution: Partial Descriptions

Consistent candidate models and models proposed by humans, or by any other real-world thing that generates models - such as a computer program or the brain of an alien - are two different things. A consistent candidate model, if it is to be correct, is expected to predict every detail that can ever be observed in the future. The consistent candidate models that we have been considering are presumed to be complete. To produce a complete model would require complete knowledge about reality - something that humans do not have. We are never going to see a complete consistent candidate model for the universe: they are just abstractions.

Models produced by humans are incomplete. They are made from limited observations of reality and using limited processing capability to analyse the results of those observations to produce models. They are partial models. A partial model can be considered, somehow, to be only a part of a consistent candidate model.

This allows us to understand how Occam's razor can work. A partial model will not merely be part of a single consistent candidate model. It will actually be a part of a number of consistent candidate models. If a partial model happens to be part of a consistent candidate model that happens to be the actual model then the partial model can be considered to be "correct": any such model will actually be part of the unknown complete model that would describe reality perfectly.

Short of actually making complete models this is the only way in which we can assess the success of the partial models that we are limited to producing. If a partial model is part of a large number of consistent candidate models then the chance of one of these consistent candidate models being the actual model is high. If, however, a partial model is a part of only a small number of consistent candidate models then the chance that one of these is the actual model is low.

This is how Occam's razor works. Occam's razor favours partial models which are expected to be part of a large number of complete consistent candidate models, which means in turn that one of these models is more likely to be the actual model. Occam's razor therefore favours partial models that are more likely to be part of the actual model.

This resolves the apparent paradox caused by any complete consistent candidate model being as likely as any other to be correct: there are no preferred complete models, but there are preferred partial consistent models because each partial model corresponds to more than one complete model. A discussion of this idea of mapping of some entities onto multiple complete descriptions of the universe is given in Standish's paper [5].

What does it mean to say that one model is part of another model?

I have just stated that models produced by humans are partial models and that partial models are part of other models. This concept is going to be very important in our later discussions of Occam's razor. What does it mean, however, to say that one model is "part of" another? Models are described by algorithms so we could take the term "part of" as literally meaning that the smaller, partial model's code occurs within the much larger, complete model as if it had been "copied and pasted" into it; that is to say that a model is part of another if the entire sequence of symbols comprising the model occurs as part of the longer sequence of symbols in the complete candidate model. While this may be the simplest way of looking at it, it causes problems. We probably need a more sophisticated idea of what it means for one model to be "part of" another.

Using the term "part of" may be taken as implying specifically the sort of relationship between models that I just described. For this reason, it may be appropriate if we use slightly different terminology and, instead of talking about models being "part of" each other, talked instead of models "mapping" onto each other. We can be vague at this stage about what "mapping" actually means, but it basically means whatever "part of" means. Using such terminology the basic idea of Occam's razor would be that models which are preferred by it are consistent partial models that map onto many complete consistent candidate models. Models that are not preferred by Occam's razor would be consistent partial models that do not map onto many complete consistent candidate models. Partial models that are inconsistent would, of course, not even need to be assessed by Occam's razor: they cannot be valid.

The mapping to which I referred here is some logical connection between the two models - the partial one and the complete one.

The Sort of Model that Is Likely to be Favoured by Occam's Razor

While we may not yet be able to say much about what "part of" or "mapping" means, we do have experience about the kind of partial model likely to be favoured. Our intuitive experience should tell us that partial models with long descriptions seem "complex" with regard to Occam's razor while partial models with short descriptions seem "simple" with regard to it. This means that longer models seem to be disfavoured by Occam's razor while shorter models seem to be preferred.

This does not mean that this is what the words '"complex"' and "simple" mean in any general way: it is possible that the words "complex" and "simple" are being misused in reference to Occam's razor and they mean different things entirely when used in the contexts in which they are more frequently used, such as discussing the complexity of a pattern or an image. Such issues need not concern us here, however: we are only concerned with algorithmically expressed models and it is likely that the length of the algorithmic description of a model has importance in determining the degree to which the model is favoured by Occam's razor.

Putting this into the terminology that I have just used, it is likely that Occam's razor relies on partial models with short algorithmic descriptions "mapping onto" large numbers of complete consistent candidate models and partial models which have longer algorithmic descriptions "mapping onto" fewer complete consistent candidate models.

The Idea of One Model Literally Being Part of Another

The most obvious way for the mapping of partial models onto complete consistent candidate models to occur is for the exact sequence of symbols in the algorithm describing the partial model to occur in the algorithm describing a number of complete consistent candidate models - what I previously described in terms of "copying and pasting". All models are algorithms and this would mean that the algorithm or program describing a partial model generated by humans would actually occur somewhere in a complete candidate model for it to be said to be "part of" it or to "map onto" it. Could this be the case?

This may seem to make some statistical sense. Shorter algorithms contain fewer symbols and algorithms containing exact matches with those symbols should occur more frequently. For example, let us suppose that we have a very a large collection of complete consistent candidate models. A certain number of those candidate models would contain the sequence 010. That sequence is only three symbols. It is not demanding too much for a model to contain it; however, if we imagine a partial model which has the sequence 10110110111101010111 then it would seem reasonable that this sequence would occur less commonly in the set of complete consistent candidate models. That is to say, shorter partial models would have sequences of symbols which occur more commonly simply because fewer constraints are being put on a complete consistent candidate model for it to actually contain the sequence in question. Does this mean that this is all there is to Occam's razor?

It is not that simple. There are problems with this.

The first problem is that we are not just dealing with a random set of complete consistent candidate models. We are dealing with the set of all complete consistent candidate models. This means that we are dealing with models of unlimited length. Some of the complete consistent candidate models will be quite short but there would be a huge number (actually an infinite amount, although I do not like that term too much - more on that later) of complete consistent candidate models. The problem with this is that, given a long enough complete consistent candidate model, there is enough opportunity for any sequence of symbols we imagine to occur in it, not just once, but many times.

As an analogy, if we have a long enough random sequence of letters then there is no reason why the text of an entire Shakespeare play should not occur in it many times. This would suggest that, given a large enough set of complete consistent candidate models, which would mean having very long complete consistent candidate models, there would be no obvious way of distinguishing between partial models on the grounds of how many complete consistent candidate models contain a particular partial model: almost all of the very long complete consistent candidate models would contain many instances of the symbols in all the partial models that we can produce.

Against this, it could be argued that shorter partial models - that is to say partial models with shorter descriptions - would occur more frequently in any given complete consistent candidate model. This is actually true no matter how large the complete consistent candidate model is, or rather how long the algorithm describing it is. The frequency with which any particular sequence of symbols occurs in even a larger model can be expected to be higher when the length of that sequence is shorter. We would expect the sequence 010, for example, to occur much more often than a sequence of symbols which is much longer. Could this be the answer? Maybe almost all complete consistent candidate models contain any partial model we care to think about, but short partial models tend to be repeated more. Unfortunately, there is no reason why this should help us: why should it matter how often a particular partial model occurs in a complete consistent candidate model? If one partial model occurs once in a particular complete consistent candidate model and another partial model occurs a thousand times why should this matter? We are interested mainly in prediction. We want to know that the partial model which we select as our working hypothesis is likely to be giving the same predictions as the actual model. That is the whole rationale behind the idea of the shorter partial models mapping onto large numbers of complete consistent candidate models: it is supposed to increase the chance of the actual model agreeing with our partial model. When we start to talk about the frequency of occurrence of a partial model within a single complete consistent candidate model then it is hard to say really what is going on or what sort of partial model would be preferred.

Another issue which may have occurred to some readers is that of infinity. I said that the partial model would actually occur many times within a given complete consistent candidate model. It actually seems worse than this: there is no limit on the maximum length of complete consistent candidate models and that means that there is scope for partial models being repeated infinitely within them. How do it would we resolve this issue? At this stage I do not intend to. There is a simple solution to the issue of infinity which I will present later. For now we can treat the set of complete consistent candidate models as if it were a vast set of incredibly long models rather than actually being an infinite set of every possible models. We can pretend that we are dealing with a very large number of models - in effect, that we are dealing with all complete consistent candidate models up to a certain very large length - a length potentially long enough to allow huge repetition of arbitrary sequences of symbols, but not necessarily infinite.

The second problem with such a "copy and paste" view of the relationship between partial and complete models is that the behaviour of an algorithm is dependent on all of the code within that algorithm and, ignoring a small number of contrived special cases, knowing just some of the code in the algorithm does not tell you what the behaviour of the algorithm will be, because the behaviour depends on the code that you know and the rest of the algorithm that you do not know.

Would it help to get a partial model with a longer algorithmic description, so that we would know a greater part of any complete consistent model of which it was a part? A problem here is that we are dealing with the set of all possible complete consistent models. There is no limit on the lengths of the model descriptions in this set and any partial model that we could obtain could only be a tiny fraction of the size of most complete consistent models, so it is hard to see how it could help much.

It would not help - no matter how long a partial model's algorithmic description is. Algorithms are generally unstable and the influence that a part of an algorithm can have on the whole algorithm is nothing to do with the length of that piece: an algorithm with a billion commands could have its behaviour drastically altered by alteration of a single command. This means that, unless we know all of an algorithm, then no matter how much we do know, we cannot know how that algorithm will behave.

This view of what it means for one algorithm to be "part of" another is simplistic, but we can consider it to have some truth in that it does give an idea of the sort of justification that Occam's razor has: it merely appears simplistic on close inspection.

Model Preference as Probability

The way in which Occam's razor is being viewed here potentially allows us to assign quantitative probabilities of correctness to partial models.

The probability that a partial model is correct is determined by dividing the number of complete consistent candidate models onto which that partial model maps by the total number of complete consistent candidate models.

That is to say:

M(Partial Model) / N

where:

P(Partial model is correct) is the probability that the partial model being considered is "correct"

N is the total number of complete consistent candidate models.

M(Partial Model) is the total number of complete consistent candidate models that are "mapped onto" by the partial model being considered.

Some readers may have objections that such a probability is meaningless because the set of all complete consistent candidate models is infinite and the number of these mapped onto by any given partial model will also be infinite. I will deal with this later in this article.

The way in which we are currently dealing with Occam's razor may be a bit over-simplified, so it is possible that we may, in the end, need to use different expressions to calculate probability, but this does suggest that, if we know enough about how Occam's razor works, then we may be able to determine formal methods for assigning probabilities to all kinds of things - including some things that are thought by many people to be beyond what we can know about. We would not need to have direct experience of something to put a probability on it: we would need a formal description of what we were considering and some way of computing the proportion of complete consistent candidate models onto which that description "maps".

No Single Preferred Partial Model

Many people view Occam's razor as preferring a single model - the "simplest" - and rejecting all other models. The sort of statistical view that we are using here tells us that no single partial model has any absolutely preferred status and no single partial model, provided that it is consistent, is ruled out by Occam's razor.

In the view that we are taking here, each consistent partial model has a probability of being correct and that probability depends on the number of complete consistent candidate models onto which it "maps". Models with shorter algorithmic descriptions probably map onto more complete consistent candidate models.

This sort of view fits better with how we intuitively approach the world, where we have no single model in which we have confidence, but rather a continuum of possibilities with varying degrees of likelihood.

Dealing with Infinity

I have stated that the probability of a partial model being "correct" depends on the proportion of all the complete consistent candidate models onto which it maps. If a partial model maps onto 25% of the total number of complete consistent candidate models then this would suggest that it has a probability of being "correct" of 0.25, or 25%, but this may seem to require that there is a total number of complete consistent candidate models. If there are a billion complete consistent candidate models it is possible to imagine how this may work: maybe we could get a computer to test every possible complete consistent model. Even if there were a billion billion complete consistent candidate models then we can imagine how it would work in principle, although we may not be able to check every complete consistent candidate model and we would need to use some mathematical shortcut which could somehow deal with many complete consistent candidate models at once to give us an approximation.

The problem is that we are not merely dealing with some particular set of complete consistent candidate models, but with the set of all complete consistent candidate models. There are no limits on the lengths of the description of these models and there are an infinite number of such complete models. We will not be worrying about issues such as whether or not infinity really exists, or whether we should regard it as a number: we already have enough to deal with and, whatever our views of "infinity" are, we are certainly dealing with an issue of having an unlimited set of complete consistent candidate models and we can casually talk about it being "infinitely large". If there are an infinite number of complete consistent candidate models then how can it make sense to talk about how many of these are "mapped onto" by a particular partial model? The answer may seem simple - there are an infinite number of complete consistent candidate models and any partial consistent model also maps onto an infinite number of complete consistent candidate models, even if it does not map onto every complete consistent candidate models - so it may seem that it is meaningless to talk of any differences between partial consistent models in terms of the proportions of complete consistent models onto which they map. Such an objection would, however, be flawed: the issue of infinity is not as much of a problem as it may initially appear to be.

Calculus deals with similar kinds of issues by using the concept of tending to limits. We can use a similar sort of approach when considering the set of all possible consistent candidate models.

Let us start by placing an artificial limitation on the number of models in the set of complete consistent models: we will limit the number of models by placing an upper limit on the length of the algorithmic description of each model. We may require, for example, that no model has an algorithmic description longer than 100 bits (or whatever symbols we are using to encode our models). Placing an upper limit on the length of each model's description means that there are now a finite number of models in the set; for example, if models could only be 100 bits in length then there could be no more than 2100 possible models, as there are 2100 ways of making a sequence of bits which is 100 bits long or less.

When the description length is limited like this it becomes meaningful to talk about the number of models in the set onto which a given partial model "maps" - although we will still need to know what "mapping" means. In this limited situation we can compute the fraction of complete consistent models onto which the partial model maps. This value will give us an indication of the probability that the partial model is "correct".

Suppose we now increased the maximum description length of complete consistent candidate models, which will increase the number of models in the set. We could now once more obtain the fraction of these models that are "mapped onto" by the partial model, to get the probability of the partial model being "correct" again. The probability that we have now should be more accurate as we used a larger set of complete models.

We could continue in this way, increasing the maximum length of the complete consistent candidate models, and therefore the number of models in the set, each time to get a more accurate probability value.

Ideally we should continue until there is no limitation on the maximum length of description for each complete consistent candidate models and we are checking an infinite number of them to see how many of them are "mapped onto" by a particular model, but this is clearly not going to happen. As we increase the maximum description length for models in the set of complete consistent candidate models, however, we should see the proportion of these "mapped onto" by a particular partial model becoming more accurate and converging on a particular value.

We can therefore say that the probability of a partial model being "correct" is this value onto which the proportion of complete consistent candidate models "mapped onto" by the partial model converges as the maximum description length permitted for complete consistent candidate models tends to infinity.

That is to say that:

Limit L tends to infinity of ML(Partial Model) / NL

where:

P(Partial model is correct) is the probability that the partial model being considered is "correct"

L is the maximum permitted length of algorithmic description for any complete consistent model.

NL is the total number of complete consistent candidate models with description lengths of L or less.

ML(Partial Model) is the total number of complete consistent candidate models with description lengths of L or less that are "mapped onto" by the partial model being considered.

Conclusion

This article has given an overview of how Occam's razor can work to give the probability that a model is correct. The view is a statistical one in which Occam's razor is considered to be about the probability of correctness of a model, in terms of future predictions.

The last article in this series [3] may have suggested that Occam's razor cannot work. It took the position that, in the absence of any knowledge otherwise, we cannot give any preference to any particular consistent candidate model: we must presume them all equally likely.

The solution to this is provided by the idea that models generated by humans, or indeed by any real thing that we can imagine that generates models (such as the brains of animals, artificially intelligent computers or aliens), can never be complete models. They must always be partial models. Each consistent partial model "maps onto" a number of complete consistent candidate models and the greater the number of complete consistent candidate models onto which a partial model maps, then the greater the chance of the actual model being one of those onto which the partial "maps" and, therefore, of the partial model being correct. This approach suggests the possibility of calculating the probabilities of various partial models being correct.

This article has not specified what it means to say that one model "maps onto" another. That will come in future articles. For now it must suffice to say that a complete consistent candidate model is "mapped onto" by a partial model if that complete consistent candidate model has characteristics in common with the partial model - in terms of predictions that it makes - that would mean that, should the complete consistent candidate model happen to be the actual model, the partial model would be viewed as correct.

One simple way of considering the idea of "mapping onto" for models is to view it in terms of "copying and pasting". In this view, a partial model would "map onto" a complete consistent candidate model if all the symbols of the "code" in the algorithm describing it occurred somewhere within the complete consisted candidate model - as if it had been copied and pasted into it. It has been shown that this view is not completely adequate - due to problems with the huge size of complete consistent models that would need to be considered and the inherently chaotic nature of algorithms. This idea does, however, give an insight into the kind of way in which Occam's razor works.

Although the idea of "mapping onto" needs to be considered further, it seems likely that models with shorter algorithmic descriptions will tend to "map onto" more complete consistent candidate models than models with long algorithmic descriptions.

An objection could be made to this article by saying that the probability of a partial model being correct depends on the proportion of complete consistent candidate models onto which it "maps" and that, as the set of complete consistent candidate models is unlimited (it could be said that it is infinite), it makes no sense to talk about any proportion of this set. This objection is unfounded and can be resolved with a concept similar to one in calculus: we can limit the set of complete consistent candidate models to only those models which have description lengths not exceeding a certain upper limit and say that the probability value is the proportion of complete consistent candidate models in the set that are "mapped onto" by the partial model as the upper limit on description lengths tends to infinity.

The way that Occam's razor has been described here is probably a simplification, but it gives a basic idea of what is going on. Future articles will explore the subject in more detail.

References

[1] Almond, P. (2005). Occam's Razor Part 1: What Is Occam's Razor? Retrieved 22 August 2005 from http://www.paul-almond.com/OccamsRazorPart01.htm.

[2] Almond, P. (2005). Occam's Razor Part 2: Principles of Language. Retrieved 9 October 2005 from http://www.paul-almond.com/OccamsRazorPart02.htm.

[3] Web Reference: Almond, P. (2005). Occam's Razor Part 3: Assumptions About Reality. Retrieved 13 November 2005 from http://www.paul-almond.com/OccamsRazorPart03.htm.

[4] Almond, P. (2005). What is a Low Level Language? Retrieved 17 July 2005 from http://www.paul-almond.com/WhatIsALowLevelLanguage.htm.

[5] Standish, R. K. (2002). Why Occam's Razor. Retrieved 24 December 2005 from http://parallel.hpc.unsw.edu.au/rks/docs/occam/occam.html.

Home Guest Book Links Email

© Copyright Paul Almond 2003-2006. All Rights Reserved. Email: info@paul-almond.com
This page last modified: Saturday January 14, 2006 19:21