paul-almond.com
Home Guest Book Links Email
What is a Low Level Language
By Paul Almond, 17 July 2005

Introduction

'low level - adjective - computing - of or relating to languages or programming operations that are relatively close to machine code in form.'
- New Oxford Dictionary of English, 2001

'Change your language and you change your thoughts.'
- Karl Albrecht

The idea of 'low level' languages is one with which most computer programmers can deal intuitively, yet the term 'low level' is harder to define than is commonly thought.

To most of us it seems obvious what a low level language is: it is a 'basic' or 'simple' language, usually providing only a small number of instructions, none of which are very 'powerful'. A 'low level' language is 'close to the hardware'. The language itself 'does not do much' and the workload on the programmer who writes software in that language is considerable. This may seem adequate, but it is surprising how difficult such a concept is to express formally.

What constitutes a low level language is of some philosophical importance as there are matters in philosophy for which we may wish to consider conceptual computer programs removed from any of the specific attributes of human technology. When such matters are being considered it is often natural to presume that some low level language system is being used, such as a simple universal Turing machine, yet to justify doing this we need to be sure that the concept of 'low level' language is meaningful.

This article is intended to address this problem. It will discuss the idea of 'low level' language, consider the need for a workable concept of 'low level language' in philosophy and propose a way of defining it.

This article will precede a series that I intend to write about Occam's razor, a concept in scientific philosophy that will be briefly discussed later in this article. Although the article is not specifically concerned, for the most part, with Occam's razor, the issue of what constitutes a low level language is of importance when considering the subject and I decided that a reasonable argument needed to be made that this matter could be resolved before I could proceed further, in later articles, with more important issues relating to Occam's razor. I will not be discussing the ideas of these later articles in much detail here: this article could be considered a 'prequel' to the later series.

The Intuitive Idea of Low Level Language

A good way of getting an idea of what we mean when we talk about 'low level' languages is to consider the history of computing. Early computers could not be very powerful without today's huge density of switches. Machines were crude with large switching units that did not work quickly, or even reliably in many cases, and the amount of processing that such machines could perform was very limited. Each command had to do only a small amount of processing because to execute each instruction in a program some crude arrangement of machinery had to be engaged which could not actually do very much.

This dictated the languages that early computers could provide to programmers: they had to be low level. Programmers had a limited number of instructions to use and each of these instructions did not do very much processing. To make a useful program a very large number of commands was typically needed. People in these pioneering days may not have been as aware of this limitation as we would be now if it were imposed on us: the availability of high level languages has made us used to having rather more.

Why does the issue matter?

There are areas of philosophy in which consideration of algorithms in an abstract way is useful. Two examples are Occam's razor and complexity:

Occam's Razor

Occam's razor [1, 2] was proposed by William of Occam (or Ockham), an English philosopher of the thirteenth and fourteenth centuries. It is the idea of preferring the simplest explanation. Though there is some controversy about exactly what Occam meant, the modern equivalent of this principle, generally called 'Occam's razor' is the idea of trying to obtain the 'simplest' possible theory in science. It is generally thought that the 'simplest' theory is the one most likely to be correct, but for this to make sense we need to know what 'simplest' means.

Various justifications [3] can be made for saying that the simplest theory is the one containing the least information; that is to say, the one with the shortest description, but this could rely on an arbitrary idea of what 'simplest' means. A theory could appear short when described in one way and long when described in other ways.

As an extreme example, Newton's law of gravity could be considered a concise way of explaining the motion of planets - there really is not all that much of it - but suppose I proposed my special 'bug-eyed space dragon initiated celestial movement' theory? Such a theory is clearly absurd: we can see this intuitively. We can also give what appears to be a better reason for the absurdity of the idea by considering the amount of information that is likely to be needed to describe a space dragon. Space dragons are probably more complicated than Newton's simple equations - the details of their biochemistry would probably fill a few books - and their description would need to contain much more information. Admittedly, some readers could point out that a simpler explanation might be advanced for space dragons themselves, possibly allowing them to be used in a theory without incurring all of the 'penalty' of their long description, but this would be getting into deeper issues concerning Occam's razor and away from our main subject.

The main point here is that in considering Occam's razor we will be concerned with descriptions of objects and the issue of how long these descriptions are is likely to be of some importance.

This causes problems. My space dragons idea may seem unwieldy when compared with Newton's equations, but I could argue that space dragons are really simple, as only the simple term 'space dragons' is needed to adequately describe my theory, provided that we are using some variation of English in which the term 'space dragon' has an accepted meaning.

This would clearly be wrong and this point was made by Occam: thinking that we have somehow achieved simplicity by giving something a label and then triumphantly pointing out that we only need that label to describe the concept is a fallacy. If this were the case any concept could be made to seem simple merely by inventing a label for it. Occam knew that labels and explanations are not the same thing.

Labels are just words though, and even if we try to provide an explanation, rather than merely a label, that explanation will consist of words. How do we know that some of the words within what looks like an explanation are not simply labels used to cover a lack of explanation? As an example, I may accept that I do have to give some description of space dragons, and yet simply invent some words to hide the complexities of their biochemistry. It would not be too original for me to do this, actually: it is exactly what many contemporary books do when promoting all kinds of implausible belief systems.

How would we solve this problem? It is one of misuse of language and we should be able to solve it by limiting the sort of language that can be used. We could demand that the only language used for explanations is one in which labels for sophisticated concepts do not exist. Such a language would be more basic than normal English - it would certainly lack words like 'personality' or 'democracy' - and it would be the human equivalent of a low level computer language. Describing things would take more words, but we could presumably be confident that any explanations made in it were real, and not artificially shortened by exploiting some feature of the language.

We could take this further. Human language is vague and if we want to make proper theories then we should be able to formally state them. A formal description of something is basically an algorithm or computer program and our theories could be represented as computer programs to avoid the ambiguities of human language. We can see, for example, how Newton's theory of gravity could be represented as an algorithm: instead of an English description there would be a program which actually performed calculations according to Newton's theory and gave the predictions of the theory as output.

This does not, in itself, resolve the issue that I have mentioned. Whereas a clearly absurd English description of a theory could be shortened by exploiting features of English, making it appear concise and reasonable, the same could be done with algorithms: the algorithm that represents an absurd theory could be shortened by exploiting features of the language in which it was written. As an example, a ridiculously complicated theory of gravity could be expressed by a short algorithm if the programming language happened to have a command specifically to do the complicated mathematics for that theory. In fact, any long algorithm, no matter how contrived, could be replaced by a much shorter algorithm if we allowed the language to include 'super commands' specifically to do that algorithm's work. If we are going to start comparing the lengths of algorithms then we should demand that they are all encoded in a low level language.

This could be questioned: there are criteria other than simple algorithm length for assessing the plausibility of a model. If, however, any reasonable method of making this assessment is used then it is hard to escape the importance of algorithm length and/or avoidance of contrived language at some point in the reasoning. To formally assess the relative plausibility of two scientific theories we need to know what a 'low level' language is.

Complexity

We have an intuitive idea of what complexity is. It can be harder to formally describe it, though numerous definitions exist. In most of the accepted definitions complexity is viewed as a property of an object's description and this means that the same issue arises that we were just considering for Occam's razor: we should ideally have a low level language for expressing these descriptions.

This need arises because, in definitions of complexity, description length is still likely to be an issue. It may be thought that complex objects simply have longer descriptions. This is not actually true: if it were then the most complex objects would be totally random ones, which does not fit our intuitive expectations, but nevertheless, the issue of description length is still likely to be given some regard in a reasonable definition of complexity.

As with arguments about Occam's razor, when we try to formalise things we inevitably end up with descriptions being expressed as computer algorithms. Just as with Occam's razor, if we want to use these descriptions to assess complexity we need them to be expressed in a low level language.

What is the problem?

These two examples of philosophical ideas - Occam's razor and complexity - both suggested a need for a low level computer programming language for the expression of algorithms. How low level does this need to be? Problems start when the high level nature of a computing language makes it particularly suited for efficient expression of some algorithms that would not be so easily expressed in other languages. The language should therefore be as low level as possible.

It is probably not necessary for us actually to have this low level language. We seem to have a good intuitive grasp of what constitutes a low level language. If we select a well known low level languag such as that provided by a simple universal Turing machine - a basic sort of computer that will be discussed shortly - we may not be sure that it is the lowest level language we could have - which would be the ideal - but we would probably be justified in thinking that it was close enough. In fact, we may not even need to specify a language in many cases: many philosophical arguments relating to computing can be made on the assumption that some low level language is being used which is not actually defined in the argument.

Even if we do not need to guarantee selection of the lowest possible level of language, or even explicitly state the language used in some argument, we still need to be confident that the term 'low level' has meaning. If what constituted a low level language were arbitrary and subject to human preference, then various properties of algorithms would also be arbitrary: it may be difficult to show that any algorithm constitutes a more plausible physical model or describes a more complex object than another algorithm in any objective way.

The whole situation - if it were unresolved - would be similar to the problem of 'proving your own sanity' described by Hofstadter [4] in Godel, Escher, Bach: an Eternal Golden Braid. If you are insane, then your insane thought processes could be insanely producing logic that seems to prove your sanity. Likewise, if one wrongly accepts a particular language as a 'low level language' then the very use of this language in descriptions and reasoning could make the language appear to be low level and lead to apparently correct 'proofs' of the low level nature of the language. How serious is the problem of this vicious circle? What are the ways in which we could get trapped in it? Is there an escape and if not would the concept of 'low level' language have any real meaning at all?

Using Languages to Describe Languages

In this article we will be considering languages and how they can be described. I will often refer to the idea of using one language to describe another.

Let us suppose that we have two languages, A and B. If we were to use Language A to express a description of Language B what would this mean?

It is best to assume that we are dealing with formal languages, such as computer programming languages. A description of Language B expressed in Language A would essentially be a computer program or algorithm, written in Language A, which could act as an interpreter of languages encoded in Language B.

Some readers may comment that it is better to consider compilation rather than interpretation, but it will be simpler to focus mainly on interpretation in an article like this: any language that can be expressed in a compiler can also be expressed in an interpreter, in principle, so the distinction need not concern us too much, irrespective of how languages may be expressed in reality.

Turing Machines and Turing Equivalency

In this article the idea of Turing machines will be used. A Turing machine is a conceptual computer proposed by Alan Turing and it consists of an infinitely long instruction tape containing symbols. A read/write head moves over the tape and can read or write symbols. It is directed by the symbols on the tape and a lookup table. (See Figure 1)

Figure 1: A Turing Machine

Figure 1: A Turing Machine

A universal Turing machine can simulate any other Turing machine and this makes it a general purpose computer. The idea of Turing machines is important in computing. Any computer that can simulate a universal Turing machine is logically equivalent to it and, if we ignore limitations on memory, could simulate any other Turing machine or any other computer. Such a computer is said to have Turing equivalency and is itself usually regarded as a universal Turing machine. One common view is that a computer must have Turing equivalency to be a real computer, though the semantics of this need not concern us. (See Figure 2)

Figure 2: A Universal Turing Machine

Figure 2: A Universal Turing Machine

Turing equivalency can be provided by a computer programming language and this article will concern itself with languages that provide Turing equivalency. For a programming language to be Turing equivalent it has to provide a system that is logically the same as a universal Turing machine, which means that it must be possible to use that language to write a simulation of any Turing machine. If a programming language is Turing equivalent then it is a general purpose language and it is possible to use that language to write any program that can be formally described or simulate any computer.

Programs in this article will be considered to be sequences of binary digits (bits). For a language to be Turing equivalent it is not necessary for it to be easily understandable to humans. A language may accept a sequence of bits which direct it to produce some output in a way that may appear random to a casual observer and that lacks any simple description or any of the organisation associated with programming languages developed by humans, yet if, for every Turing machine, a sequence of bits could be found to simulate it, then the language would be Turing equivalent.

At some points in this article I will refer to simple or basic universal Turing machines. I use this terminology to refer to those Turing machines that very closely match the idea of the minimal computer proposed by Turing, with an instruction tape, read/write head and lookup table. Any general purpose computer or programming language is, of course, equivalent to such a machine, but a simple Turing machine is one which has only the logic described in Turing's idea.

But why should this be awkward? - Obvious approaches that do not work

Some readers may have accepted that a firm, objectively justifiable concept of 'low level language' is necessary, while not accepting that there is any problem in defining it. Surely, they could argue, is it not true that a low level language is 'simpler' and 'cruder' than a high level language? What is the problem here? To show why there is one I will discuss some of the more obvious methods that we may choose to define a low level language - none of which work. This is intended to make my point that we need to be more subtle about how we define the concept.

Here are the flawed definition attempts:

Attempted Definition 1: Information needed to describe the language

Could we not say that less information is needed to describe a low level language? We can see how much information is needed by looking at the software needed to implement a language, which is essentially a description of it.

A high level language such as Visual Basic or Pascal needs a large computer program (an interpreter or compiler) to make the language work. For a lower level language a smaller computer program would suffice; quite small computer programs can be used to 'assemble' (that is to say, compile) instructions encoded in assembly language - a language so close to the computer's own machine code, with a one to one correspondence between assembly language instructions and the computer's own machine code instructions, that most programmers would probably say that assembly language actually is the computer's own machine code expressed in a more readable way.

The sort of language offered by the lookup table of a simple universal Turing machine would generally be considered to be a low level language and this seems to fit in with this idea: a program to simulate a simple universal Turing machine, allowing the user to provide instruction tapes which are then executed by the simulated Turing machine would be considerably smaller than a program which could provide the Pascal language.

Why it fails

The software needed to implement a language (for example, to provide the Pascal language) is itself written in another language. How do we know that the amount of information needed to describe a language is not simply an artefact of the language in which the description is being expressed, rather then being any real property of the language that we are considering? (See Figure 3)

Figure 3: The amount of information needed to describe a language depends on the language used to do the describing.

Figure 3: The amount of information needed to describe a language depends on the language used to do the describing.

As an example, we may find that when we use some programming language X we need a large program to provide a language like Pascal, but only a small program to produce a simulation of a simple universal Turing machine. This could lead us to conclude that Pascal is a higher level language than the language provided by a basic universal Turing machine, but surely all we have really shown is that language X requires less code to be used for a simulation of a simple universal Turing machine than to provide Pascal.

We could use a different language Y that requires less code to write a high level language interpreter than it does for a simple universal Turing machine. From the point of view of someone using such a language it may appear that Pascal is more 'basic' or 'low level' than the language of a simple universal Turing machine.

Such a language as Y would be contrived, but it would certainly be possible to make a language which allows any chosen language to be expressed with a minimum of information. If the language happens to be a computer language being used to make a Pascal interpreter, for example, then we could go to an extreme of having a single 'super command' in that language which duplicates the entire functioning of a Pascal interpreter written in a more conventional language.

This causes problems. If the degree to which a language is low level depends on the language used to describe it then it may seem that it is all entirely subjective. This does not seem to fit in with our intuitive expectations.

It may seem that we can solve this issue by insisting that the language used to describe our languages should be a low level language. This means that if we were comparing a high level language like Pascal with a low level language like the language of a simple universal Turing machine we would demand that the same low level language be used to describe both languages and take the place of languages X and Y. Such a low level language would preclude the contrived situation that we created with Language Y. As an example, we may use some simple, low level machine code to construct a Pascal interpreter and a simulation of a simple universal Turing machine and then look at how large each of these programs is.

The problem here is that we seem to need a low level language to allow us to describe other languages to assess the extent to which they are low level. How are we to know that this language is itself a low level language? The argument has become circular.

Attempted Definition 2: Close to the hardware

Low level languages are often viewed as being 'close to the hardware' (See Figure 4). Could we use this as a definition?

Figure 4: Low level languages are often viewed as being 'close to the hardware'.

Figure 4: Low level languages are often viewed as being 'close to the hardware'.

Why it fails

How do we decide what hardware to use? We could imagine hardware that makes it particularly easy to implement what we would normally regard as a high level language on it, yet it would seem strange to accept that a high level language can be caused to become a low level language merely by the creation of such hardware.

Some readers could say that in 'close to the hardware' we mean 'conventional hardware' or 'hardware that is viewed as being reasonable'. If we were to allow a definition to be this vague then we would hardly need to refer to the hardware: we could simply refer directly to low level languages and say that they are languages that can reasonably be viewed as low level. It does not really mean anything.

Attempted Definition 3: Ease of physical implementation

Could we not say that a machine which provides a low level language needs less matter to build it than one that provides a high level language?

This is another way of saying 'closer to the hardware', yet now we are trying to say what 'sensible' hardware is. We are also not really making a distinction between the machine and the language software: we are viewing any software that provides the language as being part of the physical makeup of the machine, which it is.

As an example, let us compare some sort of machine code, as provided by the CPU (central processing unit) microchip of a computer with a high level language such as Pascal. To make a computer that provides machine code we would need to make something like a CPU microchip and some small amount of memory (whatever we decide the machine has to have available for running machine code programs). To make a computer that provides Pascal we would have to do all this and provide enough memory to store the Pascal interpretation software - and this is in addition to the 'free' memory which will be available for storing programs that the interpreter is going to interpret. The extra memory needed to contain this extra functionality will need more matter to make it.

It seems hard to avoid: a computer which runs machine code programs will need less matter to produce it than one which can compile or interpret Pascal programs or compile or execute programs written in some similar sort of high level language. This would be true even if we removed a clean distinction between processing and memory. I do not think that this point is controversial and that many readers would think that a computer providing a simple language would need more matter to build it. Could we not say that we have therefore established some objective idea of what 'low level' means?

Why it fails

This fails because it is simply assuming that our laws of physics are equivalent to a low level language. It is the laws of physics that perform the role of telling the matter in the computer what it should do and controlling its state transitions.

What if we lived in a universe which was very different to the one we know - a 'Pascalverse' in which the laws of physics actually made it easier to construct a computer running a Pascal interpreter rather than a computer providing low level machine code? In such a universe the laws of physics could have the concept of Pascal interpretation apparently built into them so that a few atoms could actually behave as a Pascal interpreter. Such laws of physics may seem contrived to us, but why? One reason may be that they are not particularly 'low level' laws of physics, but if we want to use this explanation then we are falling into the circular argument trap of the last case: we are introducing something that is assumed to be low or high level into our definition of what constitutes a low level or high level language. We would need some way of showing that the laws of physics that we have are reasonably low level ones and that various contrived laws of physics would be high level, yet that leaves us back where we started - with some system that needs to be determined to be low level or high level.

An attempt could be made to counter this by saying that imaginary laws of physics are irrelevant. Such an argument would state that all that matters are the laws of physics that govern the reality that we live in and that to assess how low level a language is all we need to do is find out how much matter would be needed to build a device that can provide that language in a world with our physics - regardless of how high level or low level they are.

I intensely dislike this sort of proposal. Making the definition of what constitutes a low level language dependent on laws of physics is removing it from the realm of mathematics and philosophy. It is not a property of the language any more, but a property shared by the language and physical reality. To me, this seems extremely close to making the definition subjective. I think that the languages that we know to be 'low level' languages are like that in an absolute sense and I do not accept that if the laws of physics around us and our computers were different then Pascal could be a lower level language than the machine code language for a CPU microchip.

Attempted Definition 4: More valid programs

In a high level language like Pascal a program has to comply exactly with syntax rules: most of the programs that can be constructed would not follow these rules and would be invalid. In a low level language this is less likely to be the case. The language has less complicated syntax to be followed and more, if not all, programs are likely to be valid.

An example of this is machine code as executed directly by CPUs. The processor simply follows the von Neumann fetch execute cycle and obtains each instruction in turn, executing it without any regard to how it 'fits in' with instructions around it. Each instruction is simply a binary number. Provided that there is an instruction in the CPU's instruction set for each binary number then any sequence of numbers could run as a valid machine code program. Whether or not it produces useful results is irrelevant here. Another example could be a basic universal Turing machine. The read/write head processes each instruction as it is retrieved from the tape and any tape could be valid.

Could we use this as a way of determining how high or low level a language is? Maybe we should say that, the more low level a language is, the greater percentage of its programs that are valid.

There is one issue that may appear to be a problem: in fact, it is quite easily dealt with. If we have two languages - a low level one and a high level one - then there will be an infinite number of programs which we could try to get processed by either language. If only a certain percentage of these are valid for each language then that still leaves an infinite number of valid programs for each language. How can it make sense to say that one language allows more valid programs than another?

The solution to this problem is to limit the maximum length of a program; for example, we may choose to consider only programs which do not require more than one million bits to express them. If we applied such a restriction then there would be finite numbers of valid programs for each language and we could make the comparison.

It could be said that this is contrived: how should we decide what a suitable maximum length of program is? The maximum length should ideally be a very large number, but we can avoid the issue totally by letting it tend to infinity. This is the sort of trick used in deriving the rules of calculus and it is not the same as simply allowing the maximum program length to be infinity. To make this work we could use some sort of process as follows:

To start we could say that the extent to which Language A is low level relative to Language B is the number of valid programs possible in Language A divided by the number of valid programs in Language B when all programs are limited to a maximum size of N bits. N should be as large as possible, so ultimately we can let N tend to infinity and, as it does, the extent to which Language A is low level relative to Language B will converge on a finite value, allowing us to make an assessment of which of the two languages is the most low level and how significant the difference between the two languages is. This may not be a particularly good way of quantifying what 'low level' means, but it is just the idea of using limits that matters.

Why it fails

This fails because the concept of valid and invalid programs is contrived and is really only caused by the encoding of a program. A high level language typically requires programs to be stored in a way that is uneconomical. Only certain words are admissible as commands, yet a huge number of other words could be entered that would be meaningless. If the program consisted of valid words, syntax rules would still invalidate many programs. This, however, would still be an issue of economy of storage. It would be possible to encode programs so that they were shorter and all valid. Economy of storage for a language is more about encoding the information to make it easier for human brains to deal with and has nothing really to do with how high or low level a language is.

We could take any low level language and encode programs for it in a less efficient way that is easier for humans to understand, has many more invalid programs, and yet is still basically the same language. An example of this is assembly language - a form of machine code which is intended to be easier for humans to understand. In assembly language 'mnemonics' (short sequences of letters) are used to identify commands. If a program is simply entered as machine code then it may be possible that every program that can be entered can be executed in some way, but this does not mean that every possible sequence of characters that could be submitted as an assembly language program for that type of machine code would also make sense: most would contain sequences of characters that do not correspond to any valid command and would not be valid programs.

We can also do this the other way round and show that programs in a high level language can be encoded so as to eliminate invalid programs. As an example of how we could eliminate invalid programs, each command could be replaced by a binary number so that it would not be possible to put the letters together to make an invalid command anymore. Some readers could say that we could still put invalid numbers, that do not correspond to program commands, into the program, but in fact there are various encoding schemes which can deal with this: Huffman coding is one. There would still be issues of syntax to resolve, but more changes to the encoding method would deal with these too. As a rather extreme example of how such encoding would work, and an argument that it is possible in principle, we could use some encoding method to assign every possible valid program a number. We could then use these numbers to represent the actual programs. The encoding method would not permit any invalid programs as there are no gaps between the numbers.

The point of all this is that any language can be encoded to make the number of invalid programs, relative to the number of valid ones, large or small, or not to allow expression of any invalid programs at all. The relative numbers of valid and invalid programs are simply a feature of the language's encoding, usually determined by the needs of human convenience and not the extent to which the language is high or low level.

We could encode Pascal programs in a more sophisticated way that did not allow invalid programs, but that would not make this version of Pascal a low level language. It would have lots of properties that we would clearly associate with high level languages. The language interpreter would still seem to have a lot more sophistication than a basic universal Turing machine. A lot more would be going on in it for a program to produce the appropriate behaviour and a lot of information would seem to be needed to describe the language - although we have seen that when we try to construct definitions for terms like 'low level' or 'high level' based on this our arguments become circular.

All of this means that this attempt at a definition has not helped us. We can use other methods of encoding that make the definition irrelevant and leave us where we started - with languages that intuitively appear to be 'low level' or 'high level' but with circular arguments being produced when we try to define these ideas in obvious ways.

Attempted Definition 5: Less code needed for programs in high level languages

It is well known that a simple program that displays 'Hello world' on a computer screen may require only a few commands in a high level language, but is likely to require many more commands in a low level language. This seems to be true for many programs that we can imagine. In fact, together with fitting together more naturally with human thought processes, it is probably the main reason for using high level languages: most programs that we write in high level languages would require more commands if expressed in low level languages.

Could we use this as a definition of 'low level'? If we had two languages then maybe we could take a large number of programs and, for each language, consider the average number of instructions needed to express each of these programs in that language. Presumably, we could reasonably conclude that if, for one of the languages, the average number of instructions needed to express each of the programs was lower than for the other language, then that language would be the lower level of the two.

One issue needs to be resolved first. Some readers may say that a high level program may actually need to be longer, in the sense that more bits are needed to express or store it. We have previously been considering efficiency of encoding and we should assume that all programs, for both languages, are encoded as efficiently as possible, so that they require a minimum of storage space.

Why it fails

If we allow any programs to be considered then there is an infinite number of such programs available in each language. There is no limit on the size of such programs and the average length of each program will be infinite. This leaves no difference between a low level language and a high level language in this respect: for both languages there will be an infinite number of possible programs and the average program length will be infinite. A good case could be made that I should avoid mention of 'infinity'. This would just be semantics and would not change anything important. If we avoided mention of 'infinity' then we could say that any programs are allowed, that there is no limit on the size of programs and that it is not coherent to talk of the 'average' program length for programs of unrestricted length.

This kind of issue occurred with the last definition attempt and we had a resolution there: we put a limit on the maximum length of any program and allowed this limit to tend to infinity while we compared two languages. We can try to use a similar sort of approach here, using the following sort of method:

We can divide the average program length of all valid programs that can be expressed in Language A by the average program length of all valid programs that can be expressed in Language B, when the maximum length of any program is limited to N bits. We can then let N tend to infinity and if the resulting value is less than one then it would suggest that programs in Language A tend to be shorter than programs in Language B and that Language A is a lower level language than Language B.

Unfortunately, this does not work. To see why, we will consider an example. Let us say that that we limit the maximum program length to 1,000 bits and use an encoding system which only allows valid programs. For each of the languages, any sequence of up to 1,000 bits would be a valid program and the total number of valid programs would be 21,000. This does not provide any scope for the average program length to be different for each language: for any sequence of bits that is a valid program in Language A there is an identical sequence of bits that is a valid program in Language B. How could this be the case if the two languages are different? The answer is that identical programs do not have to have the same effect in both languages A and B. it may seem that I have contrived this situation by limiting the maximum program length, yet, as we have seen, if we do not limit the maximum program length the concept of average program length becomes incoherent.

There can be only one conclusion from this: to say that, on average, programs are longer in one language than in another language is either incoherent or wrong, depending on how we are trying to measure average program length.

Some readers will dispute this and will now be thinking something like this:

'Do you have any idea how many instructions it can take to do the simplest thing in a low level language? I have been programming in high and low level languages for years and I know how much easier high level languages make things. If what you say were true then there would be no difference between high and low level languages. There would be no point in using them!'

In fact, there may be all kinds of other reasons for using high level languages, but we should deal with the main problem of this sort of statement: it is based on the fallacy that the sorts of programs that programmers typically produce or encounter are representative of all programs. Many more programs exist than the ones that interest humans and the sorts of programs that people write are special cases. It is not surprising that high level languages allow particularly efficient expression of the 'useful' programs that interest us: they have been designed for this purpose. High level languages do allow some programs to be expressed with less data if we allow for some efficient encoding to be used and one simple way of achieving this is by the use of high level commands to perform specific sequences of operations that occur frequently in 'useful' programs.

To see how a high level language may misleadingly appear to allow shorter programs let us consider an example. We will presume the use of efficient encoding, so that every sequence of bits in either language must correspond to a program, and we will imagine that we are dealing with very short programs to make things simpler.

We have two languages: Language H is what we would think of as a high level language and Language L is what we would think of as a low level language.

We have some 'useful' program written in the low level language L. In the low level language this is the sequence of bits that expresses it:

010111011010110011011101.

The high level language H is designed for particularly efficient expression of 'useful' programs and in Language H the same program can be expressed as a much shorter sequence of bits:

010

This is fine, but now let us consider 010 and every other sequence of three bits or less that corresponds to a program in Language L. For every such sequence corresponding to a program in Language L there is an identical sequence in Language H which may or may not be the same program.

e.g. there is a sequence 110 in L, which is a valid program and there is an identical sequence 110 in H, which is a valid program that may or may not be the same program as 110 in L.

This is a consequence of the efficient encoding causing all sequences of bits to be valid programs. It means that there must be exactly the same number of programs of three bits or less in languages L or H and it should mean that every sequence of three bits or less in L has a corresponding sequence which may or may not be identical, in H.

For example:

If we have 010 in L then this could be the sequence 010 in H, but it need not be. It could be another sequence such as 101. Clearly, if 010 in L corresponds to 101 in H then 101 in L cannot correspond to 101 in H: the 101 'slot' in L is already taken, but this is no problem provided that some other slot is available.

Provided that each sequence of three bits or less in L can map onto another sequence of three bits or less in H then every sequence of three bits or less in L can be expressed as a sequence of three bits or less in H.

There is a problem, though. We know that the sequence of the 'useful' program 010111011010110011011101 in L corresponds to the shorter program 010 in H. This means that one of the available spaces for programs of three bits or less in H is mapped onto by a program that is not of three bits or less in L. This means that it is not possible for every sequence of three bits or less in L to map onto a sequence of three bits or less in H, because we are now one available space short. At least one program of three bits or less in L must therefore have a program of more than three bits in H.

This shows what is really happening with a high level language: for every program in a low level language that becomes shorter when expressed in a high level language, when efficient encoding is used, there must be another program in the low level language that becomes longer when expressed in the high level language. This is not a problem, because high level languages are designed so that the useful programs that interest humans are the ones that become shorter when expressed in them and the programs that become longer when expressed in a high level language tend to be ones that do not interest us much.

All of this does mean, however, that program size, in itself, is not suitable for defining high and low level languages.

If program length in itself will not suffice, could what I have just said allow a reasonable definition? Could we say that a high level language is one that is cleverly designed so that some programs (the 'useful' ones) written in some low level language become shorter when written in it while other programs (less 'useful' ones) become longer when written in it, and could we say that the extent to which a language is a high level language depends on the degree to which this occurs?

Unfortunately, this will not work. Firstly, for any pair of languages, A and B, there will be some programs that are longer when expressed in A than in B and others which are shorter when expressed in A than in B. In fact, when I just described what happens with higher languages, what I said was not, in a sense, correct, or at least not complete, because the same thing happens with all languages: in all languages there are some programs that are particularly short when encoded in that language and others that are particularly long. It is only the usefulness of some of the programs that makes this important for a high level language, and perceived usefulness of programs to humans cannot be a feature of any good definition. Secondly, such a definition needs a low level language to act as a comparison. How do we know that such a language is a low level language? This definition becomes circular again.

Attempted Definition 6: Instruction set size

Could we say that high level languages tend to have many different commands while low level languages tend to have only a small number of different commands?

Why it fails

The suggestion that high level languages have more commands is debatable: some high level languages have few commands yet provide sophisticated syntax that allows the user to extend the languages by defining new commands in programs.

There is another problem with this definition: the concept of what constitutes an instruction is vague, subject to human opinion and hard to formalise. We split a program into 'commands' because this is how we think, yet any program can be represented merely as a sequence of bits which are processed by the language to give a result.

As an example, we could look at a machine code program and say, 'These three bits are a command to load the A register. Some more bits are expected to give the number which has to be loaded into the register,' yet it is merely convenient for us to think that. Someone else could look at the first bit of that 'command' and say, 'This is the command to commence a register loading operation. Two more bits will follow to identify the register that will be involved, before the number to be loaded is given.' Such a person would 'see' a language with a smaller instruction set. Someone else could view things the other way round and see a larger instruction set. Such a person would think of 'Load the A register with 20' and 'Load the A register with 100' as being separate commands and would 'see' a language with an enormous number of commands.

We may even imagine a high level language which is so complex that it is hard for a human to recognise specific commands. Such a language would accept a sequence of bits and produce an output in a way too complex for humans to easily understand or simplify with concepts such as 'commands'.

This definition will not work: the idea of 'commands' is just too vague.

Attempted Definition 7: Using the language to refer to itself

Some of the previous definitions were shown to be circular. When this happened it was for basically the same reason: definitions of 'low level language' based on the description of a low level language need a low level language in which the description should be expressed, yet by referring to another language we are merely compounding our problem by having to find out if other languages are low level or high level.

It seems that, unless we can find a good reason for thinking otherwise, a definition of 'low level' should meet this requirement:

It must not refer to any other language. This means, in particular, it must not refer to any description of the language that we are considering that is expressed in any other language. This includes any description of the language we are considering that is stated as being expressed in a 'low level language'.

This seems to give us a problem. If we cannot express a description of our language without using another language how can we actually consider it?

Maybe the solution lies in self-reference? This is a common trick in computer programming and perhaps it can help us now. What if we express a description of the language in the language itself and then examine this description to assess how low level the language is?

Let us suppose we have some language X and want to know how low level it is. An obvious procedure would be as follows:

  1. Use Language X to write a program that works as an interpreter for Language X. Make sure that the program is as small as possible.
  2. Count the number of bits in the program that was produced in Step 1. The smaller the number of bits then the lower the level of the language is.

(See Figure 5)

Figure 5: Using the Language to Refer to Itself

Figure 5: Using the Language to Refer to Itself

We can also assume that efficient encoding is used for the program: we do not need to worry too much about how we do this.

That is to say, a low level language X only needs a small program to work as an interpreter for that language when it is written in the language X itself.

Why it fails

It may seem that this gives an 'unfair' advantage to high level languages, as it will presumably be easier to write a language interpreter efficiently in a high level language than in a low level language. Against this it could be said that a high level language may make it particularly easy to write high level language interpretation software with a minimum of code. The idea would be as follows:

  1. Language X is sophisticated and is particularly good for writing language interpreters with a minimum of code.
  2. When we use Language X to write an interpreter for Language X a minimum of code is needed because we can take advantage of what was stated in 1.
  3. Language X therefore appears to be a low level language, by the criteria being suggested here, yet we know that it is really very sophisticated. It is the sophistication of X that is allowing us to use a minimum of code for the X interpreter and making X appear low level.
  4. Therefore the definition suggested here is flawed.

Against this, it could be argued that providing such sophistication in the language will mean that it is particularly sophisticated in comparison with many other high level languages. Such a defence would be as follows:

  1. Language X is very sophisticated and is particularly good for writing language interpreters with a minimum of code.
  2. When we use Language X to write an interpreter for Language X we may think that a minimum of code is needed because we can take advantage of what was stated in 1. In fact we will need a lot of code because we do not merely have to produce a high level language: we have to produce the extremely sophisticated system described in 1.
  3. Therefore the argument that was previously presented is flawed and the definition is reasonable.

It could be pointed out that there is no obvious reason why the language itself should be chosen for expression of its own description. How is this better than selecting some other language at random? We can explore this issue by looking at the extreme example of a language with a 'super command' that allows efficient expression of the language itself:

Let us suppose that we have a Language X. X has a 'super command' which I will call F(). F() accepts a sequence of bits which are a program in Language X and interprets them according to Language X.

This means that X contains a command F() which is contrived to make X particularly suitable for writing X interpreters.

The F() command means that, to be an X interpreter, a program really only needs the F() command and very little (if anything) else. This means that an X interpreter written in X will be very small, regardless of what features the language has as well as the F() command.

I am not saying that the existence of a 'super command' in a language X automatically has an effect on any single language interpreter that is made using Language X. What does happen is that it skews the statistics of sets of languages described using Language X - that is to say, language interpreters written in Language X. It makes it possible to write particularly short programs that act as interpreters for Language X and if we generate the full set of language interpreters which are written using Language X and have a program length not greater than N bits, then this set will have a very large proportion of interpreters that contain F() and provide it as a command.

It would be possible to add the F() command to any language and if the definition being considered here were valid that language would immediately become a very low level language. This means that there is at least one situation in which the objection that was made earlier to this definition becomes apparent and the objection is valid. It should be noted that this does not mean that a 'super command' like F() is needed to cause the definition to fail: the use of F() simply demonstrated one type of failure and suggests that the same sort of thing could happen, to a lesser degree, in all kinds of other ways.

The definition does not work. It may not be a total loss though: the definition does establish the idea of making a definition from the language by using the language itself without reference to other languages.

Attempted Definition 8: Use a sequence of references

It was shown that, with the previous attempt at a definition, if we use a language X to describe itself, it is possible for the description of X to be 'contaminated' by specific features of X that could make it artificially appear to be a low level language. Maybe the solution is to try and 'get some distance' between the language used for the description of the language and the language itself, while still basing the description on the original language. Let us suppose that we had some language X and we wanted to determine how low level X was. We could try to do this as follows:

  1. Make the smallest possible program in Language X that can serve as a simulation of a universal Turing machine; that is to say, the smallest possible program that can provide us with a general purpose programming language.Use the language provided by the universal Turing machine simulator which has just been developed to write another program. This program must be the smallest possible universal Turing machine simulator which can be written in that language. As a universal Turing machine simulator, this new program will also provide a programming language.
  2. Go to Step 2, unless we have already been through Step 3 a large number of times, in which case proceed to Step 4.
  3. We have now been through a large number of iterations of Steps 2 and 3 and a final universal Turing machine simulator has been produced. The language provided by this universal Turing machine will be known as Language Z.
  4. The smallest program which can be used to implement Language X will be written in Language Z. The smaller such a program is, then the lower the level of X is.

(See Figure 6)

Figure 6: Using a Sequence of References

Figure 6: Using a Sequence of References

Note: when I say 'universal Turing machine simulator' here I do not mean that the program has to simulate the common idea of a universal Turing machine as a tape being processed by a machine with a read/write head. Any system that is equivalent to such a universal Turing machine is a also a universal Turing machine and all that is required is that some program is produced which interprets its input in a way that provides equivalency with a universal Turing machine.

What is being attempted here is the use of our original language X, the one that we are trying to assess, to make the smallest possible Turing equivalent system. The language provided by this will be used to make another Turing equivalent system which provides another language and so on… After this has been done a large number of times we are left with some final Turing equivalent system and the language provided by it. We can use this language to express Language X and we can assess this description of X to see how low level X is. The idea is that the language that is finally used is only very indirectly related to X and so is unlikely to be 'tainted' by any features of X that make it particular suitable for describing itself. The number of iterations that are performed before we have the final language can be made as large as we like - we can regard it as tending to infinity - so that all traces of any special features of X are removed from it.

Why it fails

We only need to present one scenario for which this fails to show that it cannot be trusted. We will use the previous scenario of a language containing a 'super command' F() which totally expresses the language.

As previously, this means that X contains a command F() which is contrived to make X particularly suitable for writing X interpreters.

F() is a very efficient way of making a program that can provide the Language X, but the fact that this can be done with one command means that F() is the most efficient way of providing any language.

Let us suppose now that we use the language X to make the smallest program which can provide us with a programming language. This will be done by means of the F() command and the programming language provided will be… X. If we use this 'new' language (which is actually still X) to make the smallest possible program which can provide a programming language that that program will also use F() and will provide Language X. No matter how many times this continues, the language being provided will always be X. The final language that we use to describe language X by implementing it is… X. X is very efficient at describing X because of the F() command and the resulting final description of X will be very short. X will therefore be determined to be a low level language, purely because it contains the F() command and irrespective of any other characteristics that it has.

The attempt to obtain a language which has some 'distance' from Language X has clearly failed as we are still working with X! Even though this failure occurs with a contrived situation, the fact that it can occur at all shows that this definition is not to be trusted. As with the previous case, this may not be a total loss: the idea of trying to 'get some distance' between some starting language and some final language used to express the description of the language being considered has been established, though it would need to be done in a better way.

Attempted Definition 9: Average across all languages

If we are trying to determine how low level Language X is, instead of relying on some arbitrarily chosen language to express a description of Language X, could we not use all languages to express a description of Language X? We could encode the description of Language X in every possible language and then use the average length of all these descriptions as an indication of how low level Language X is. (See Figure 7)

Figure 7: Averaging Across All Languages

Figure 7: Averaging Across All Languages

Of course we cannot actually average a result over an infinite number of languages, but infinity has not been a problem in previous definition attempts: we could do the same sort of thing that we did previously and have the number of languages in which we express the definition tend to infinity. The result - the length of the description of Language X averaged across all languages in which the description of X is expressed - will be known as the low level indicator. A low value indicates a low level language.

There is one minor complication here. It may seem that the low level indicator for some language X will converge on a finite value as the number of languages used to express the description of X tends to infinity, which would be ideal. There is, however, reason for thinking that this may not be the case. For any language X it will be possible to find languages in which the description of X requires a very large number of bits to express it and if we increase the number of languages that we use we are increasing our chances of using such languages.

Rather than worry too much about this we can simply use the method as a comparison to tell us which of two languages is the lower level one. If we have two languages X1 and X2 then we can simply determine the low level indicator for X1 divided by the low level indicator for X2 as the number of languages over which the description sizes are averaged tends to infinity. If this value is greater than one then X1 is the lower level language of the two. If it is less than one than X2 is the lower level language. Other, and maybe better, ways of generating the result may be available, particularly with regard to how such a comparison is made, but the general idea is more important than details.

Why it fails

It is vague to resolve the issue of the infinite number of languages by saying that we will let the number of languages tend to infinity. We have to say exactly what is tending to infinity. We are limiting our consideration to a finite number of languages, which are chosen using some criteria that eliminates the rest of the infinite number of languages, and then we expect to somehow broaden that criteria to let the number of languages chosen tend to infinity. For this to work we have to have some way of encoding descriptions of languages so that we can say which languages are members of our finite set of languages and which are not included.

That may seem complicated, so an example of what would be involved in actually using such a definition may be useful. We may decide to limit the languages to a finite set based on description length. We would imagine the description of every possible language before us - effectively meaning having an interpreter for every possible language - and we would select only those languages which had descriptions with lengths of N bits or less. This would give us a finite number of languages. We want to determine how much information is needed to describe language X - the language that interests us. We would use each of the languages in our finite set to write an interpreter that provides Language X and we would take the average length of every such program. We would then say that N tends to infinity, while the number of bits needed to describe language X, averaged across all languages with descriptions of N bits or less, is the low level indicator for Language X, while the low level indicator is also obtained for the language with which we are comparing X.

Unfortunately, there is a problem: treating the set of all possible languages in this way relies on the use of some encoding system to express the descriptions of all these languages. Without such an encoding system, how could it be meaningful to limit membership of our finite set to languages with a description length of N bits or less? The encoding system will itself be a language and the choice of language could introduce prejudice.

One way of attempting to avoid this problem is to say that we should use a low level language as the encoding system. This does not help us: we are trying to do this to say what a low level language is in the first place and it makes the definition circular again.

This, then, is why this definition fails: if we want to average over all possible languages we need a way of encoding the descriptions of those languages. Such a description would ideally be a low level language and that leaves us back where we started.

Attempted Definition 10: Average across all languages derived from the language being assessed

One of the previous ideas that we had was using the language that we are assessing for the expression of its own description that we will use to assess it. This was shown to be unworkable because of prejudice that could be introduced by the language itself and we have just considered another idea for avoiding prejudice by averaging across all possible languages. This definition also failed for a different reason.

What if we were to combine these two ideas to try to make a workable definition? If we are trying to assess how low level Language X is then what if we use Language X itself to express the descriptions of all other possible languages and then determine the average length of description needed to express Language X in each of these languages?

This is how such a process may work:

Let us suppose that we want to know if Language X is a low level language. Using Language X we express the description of every other language that has a description length of N bits or less, where N is some finite number. In each of these languages we express the description of Language X and determine the length of the description in bits. We take the average of these values as the result. (See Figure 8)

Figure 8: Across All Languages Derived from the Language Being Assessed

Figure 8: Averaging Across All Languages Derived from the Language Being Assessed

We let N tend to infinity and the result - averaged across languages - will be known as the low level indicator and will tell us how low level the language is.

There is one minor complication here. It may seem that the low level indicator will tend to a real value as N tends to infinity, but as with the previous definition it may not do. We could solve this by using some sort of comparison of two or more languages as N tends to infinity. We will not worry about the details: we are about to show the definition to be unusable anyway.

Why it fails

As with the previous definition the problem is one of prejudice in the language used to express all the other languages in the collection. In this case we have attempted to resolve this issue by using Language X itself, but all this means is that Language X itself can introduce prejudice into its own evaluation: it is possible that some feature or features of Language X could have an influence on the kinds of languages that can be economically described using it and that form most of the membership of the finite set of languages with descriptions that have lengths of N bits or less.

As an example of how this could occur, let us imagine that X contains some 'super command' F(). The existence of F() makes it very economical to describe languages that belong to a certain group of which Language X is a member. What this means is that it is possible to use Language X to write very small descriptions of languages that are like Language X by using the F() command. When we use Language X to express the descriptions of all the other languages that we are going to use in our sample this will mean that there is a weighting in favour of languages like Language X: for any language that we could make without using the F() command we could make many more, with descriptions of the same length, by using the F() command and languages like X would dominate any set of finite length language descriptions made using Language X in this way.

Some readers may suggest that this matter will resolve itself when we let the description length N tend to infinity. It will not: all that tending to infinity involves is the use of progressively larger values and the reasoning must work for finite values for it to be valid.

Where does this leave us?

I may seem to have spent a lot of the article on these failed definition attempts, but I wanted to establish clearly the idea that the definition of 'low level' is not as trivial as it may initially appear and that when we try to construct a robust definition we can have problems if we are not careful.

At this point some people may find it attractive to declare that whether or not a language is low level is subjective. This point of view would have extreme consequences that may be contrary to our intuition. For example, a human language such as English intuitively appears to us to be much higher level than the language provided by a basic universal Turing machine or by the rules of Boolean algebra that are implemented by logic gates, but a proponent of this view would declare that to be of little significance.

Arguing with a proponent of this 'subjectivity' view could be frustrating. We could argue with such a person by saying that you would need the rules of Boolean algebra to describe an electronic machine that could in turn describe the English language. Alternatively we could say that human language needs a complex system like the human brain to describe it, that the description of a human brain would rest on lower level systems such as physics and, of course, that physics itself really depends on ideas about basic logic - there are rules about how ideas in physics can be organised. All of this would amount to the same thing: we would be saying to out opponent, 'Look! A human language such as English is clearly a high level thing. You need all these low level things to describe any system that can actually do a human language. You need logic rules for computers, physics, or biochemistry. Human speech clearly has to be built on top of simpler, more fundamental systems.' Our opponent's response could be that you can choose, if you wish, to describe human language using what we feel are simpler systems, such as rules of Boolean logic, but that we could just as easily choose to use the English language to describe Boolean logic or the laws of physics! He/she could suggest that the systems that you prefer to use to describe all the other systems automatically appear to be low level systems for no more reason than your preference. (See Figure 9)

Figure 9: What is doing the describing?

Figure 9: What is doing the describing?

Another view could be that our way of seeing things is probably more or less correct - that what appears high level to us is probably quite sophisticated and that what appears low level to us is probably quite basic - but that there is the possibility that we could be being tricked by our laws of physics. According to such a view, if the laws of physics in the universe were actually very high level then our brains may wrongly find them to be very basic - purely because they were built with those laws. Another view, similar to this, could be that not only could we be tricked by our laws of physics into thinking that they are basic, but that we do not even have any good reason for thinking that they are probably basic - that we do not, in fact, have any idea if we are correct about what we think is 'low level'.

If this sort of view happened to be correct then it would have what I regard as very serious consequences. It would not merely imply that we could be wrong when we consider the laws of physics or anything else on which our own brains rely: it would imply that our judgement about how low level any language is could be unreliable. How could things get this bad? We would get into a mess like this in this sort of way (and I want to make it clear that I am discussing a hypothetical situation here - the possibility of which I will be arguing against):

Our laws of physics contain some feature F, similar to the function F() that we talked about in definition attempts that were discussed previously. F has an interesting property: if a system contains the F feature then it is particularly easy to construct a Turing equivalent system which also has the F feature within that system. By 'easy' I mean that the description would be particularly short.

F also makes it particularly easy to describe some system X. We do not need to know anything about X, other than that X is extremely complicated.

From the basic laws of physics other more abstract systems follow. As an example, one of these is chemistry. Chemistry has rules, though these are not fundamental, and so also provides a system. The rules of chemistry happen to contain the F feature, meaning that descriptions of the laws of physics and X using the rules of chemistry can be quite short. This may seem very unlikely. Why should chemistry accommodate us like this? It is not however, unlikely, when chemistry exists in a universe with laws of physics that contain F. The existence of F within the laws of physics makes it likely that various Turing equivalent systems exist that can be described within the laws of physics and that have the F feature themselves and it is likely that chemistry is one of these systems.

The same goes for biology. Biological systems are put together using chemistry and the fact that chemistry features F means that Turing equivalent systems described in terms of biology that feature F can have short descriptions. It follows that biology is likely to feature F.

This extends as far as human brains. Because biology features F, very little information is needed to describe a brain. The description can take advantage of the F feature in biology to produce a brain that also features F and evolution will probably favour these sorts of brains as fewer mutations will be needed to assemble a description of them. This extends to any computers made by humans which happen to contain F in the way they perform logic.

The inclusion of F in human brains has two important results:

  1. A description of any Turing equivalent system containing F, either stored in a human brain or expressed in a way that a human brain can understand can be very short. This means that biology, chemistry and the basic laws of physics appear very simple to a human brain.
  2. A description of X, either stored in a human brain or expressed in a way that a human brain can understand can be very short. This means that X (whatever it is) appears very simple to a human brain.

None of this, though, really reflects how simple any of these are: all of it was caused by the F feature within the laws of physics. This will not be apparent to any human brains considering it because of the F feature.

Now, I do not really believe any of that: it was a hypothetical situation intended to show the potential problems that we have. Someone who does believe it, however, may assert that if we cannot be sure what is low level and what is not we have to take these results at face value. This would be implying that it is meaningless to talk about any concept of 'low level' meaning anything beyond what we can observe and that what constitutes 'low level' is subjective and dependent on our physics. One consequence of this would be that some of the statements made in this article, when I referred to the description length of a language or the complexity of some entity X in an absolute sense, are meaningless.

If we wanted to maintain some idea of 'low level' having an objective meaning I could imagine this sort of issue being perceived as useful by an advocate of Roger Penrose's 'non-computability' case [5, 6] for which numerous refutation papers exist, one of them being by me [7]. Such a person may say that we have seen that it is futile to try to show formally that one language is more or less low level than another, and yet that we seem to be quite sure that some languages are lower level than others. He/she may then suggest that our knowledge comes, not from the use of any formal system, but from some non-computable method, therefore showing that the human brain cannot possibly be a computer. This gives more importance to the issue: if we cannot show that concepts of low and high level can have some formal meaning then advocates of Penrose may be provided with a useful argument (which some readers may think is a good thing, of course) and the possibility would even be left open that someone could prove that a formal system cannot determine how low or high level a language is. The issue really needs resolving, if only for this reason.

We would face greater problems than this, however, if we accepted that we could never formally show that a language was low level, or that the concept of low level language was purely subjective and/or dependent on physics. When we use Occam's razor to select viable scientific theories we need to know that we are describing theories in a reasonable way, rather than a contrived way that makes some theories appear plausible purely as an artefact of the description system being used. This is an issue in the hypothetical situation that I just described: X was made to appear as if it had a small description purely because physics incorporated F and if X were a scientific theory then X could be made to appear plausible. Some readers may say that this is entirely proper: after all, X can only be made to appear plausible in this way if the laws of physics contain some F that makes X appear plausible, and it could be argued that this automatically means that it is plausible that the laws of physics may actually be similar to X. This would be a very strange state of affairs - one in which either X's short description makes it plausible or, X's description is shortened by some unknown features of reality that still make it plausible in a very real way! This would be because of the occurrence of F in the laws of physics meaning that any possible candidate for the laws of physics must come from those systems of physics that contain F. If this sort of reasoning were correct then it would mean that any language that appears low level would automatically be suitable for using to describe scientific theories for the purposes of evaluating their plausibility, regardless of the reason for such an appearance. It would also strongly suggest that the whole issue is one of physics rather than mathematics.

I can sympathise with some of the views that I have outlined here. I do not believe any of them, however, are correct. I think that our intuition generally serves us well in this area and that there is some sense in which our observations about the level of a language can be objectively true in a way that allows us to demonstrate formally reasons for some confidence. To maintain such a case I need to offer a workable definition of the term 'low level', so here is my attempt at one:

The Proposed Solution

This method is probably just one of a number of ways of achieving the same result. The idea will use some of the features of previous definition attempts. If we want to know how low level a language X is then the idea is to average the length of description needed for language X over all languages in which that description can be expressed - or, more accurately, a very large set of languages, but to make sure this time that the languages used to express the description of language X have sufficient distance from any single language that may introduce prejudice. Here is how the method would be applied:

Let us assume that we have some language X and that we want to know how low level X is.

  1. Using Language X to express programs, construct every program that can serve as a universal Turing machine (that is to say, every program that can act as a language interpreter) and that has a length of N bits or less. This set of programs/languages will be known as the current set.
  2. Use every language in the current set to construct every possible program, with a length of N bits or less, where N is some constant, that can serve as a universal Turing machine (that is to say, every program that can be used as a language interpreter). The set of programs/languages generated in this way will become the current set and the existing programs/languages in the current set will be discarded.
  3. If we have been through this step I times, where I is some constant, then go to Step 4 - otherwise go to Step 2.
  4. We now have a set of programs/languages, the current set. This set of languages will be used for the descriptions of Language X that will actually be used to determine how low level Language X is. In each language in the current set write the shortest program that can serve as a language interpreter for Language X. The average length of each such program is the low level indicator for Language X.

(See Figure 10)

Figure 10: The Proposed Solution

Figure 10: The Proposed Solution
Note: The diagram shows a current set which, after the first iteration only contains three languages, each of which only gives rise to three languages in the second iteration, which is extremely unlikely. The number of languages being added at each stage would be vast and they would not fit on a diagram like this.

It will probably be quite obvious that I intend the iteration in Steps 2 and 3 to put some distance between Language X and the set of languages that we ultimately use to describe it for the assessment and that, although similar sorts of attempts to do this failed in previous definitions, I expect it to work this time.

What values should we use for the N, the number of bits in the maximum program length, and I, the number of iterations of Step 2? We want the current set to be as large as possible, so N will be made to tend to infinity. To put as much distance as possible between X, the language being assessed, and the final version of the current set that emerges after all the iterations of Step 2 we want as many iterations of Step 2 as possible: this means that I will also be made to infinity.

The low level indicator for X, therefore, will be the value returned by the above algorithm as N and I tend to infinity.

We do have one problem - and it occurred with some of the previous definition attempts: do we have good reason for thinking that the low level indicator will converge on a finite value as N and I tend to infinity? There may be good reasons for thinking that it will not. A number of solutions are possible for this. For now we will use the approach that was employed in previous definition attempts: we will compare two languages to determine which is the low level language of the pair.

This approach involves comparing our language X to another language Y. Using Language X we derive the current set for given values of N and I. We use each language in this current set to express a description of language X in the most efficient way possible (that is to say, using the least amount of bits). We average the description lengths over every language in the current set to obtain the low level indicator for X. We need another value for comparison, so we use every language in the same current set to express a description of another language Y. We average over all languages in the current set to determine the average number of bits in the description of Y and this value is the low level indicator for Y.

We then divide the low level indicator for X by the low level indicator for Y. This value, as N and I tend to infinity, is our final result and tells us which of the languages X and Y is the lower level one. If the final result is less than one then X is the lower level language of the two and if it is more than one then Y is the lower level one. If the value is exactly one then X and Y are low level to an equal degree.

Would the proposed method work?

Given the failure of all the previous definition attempts, and in particular the failure of the previous attempts to 'get distance' between some language being used at the start and other languages derived from it, it would be reasonable to wonder if this definition is any different. Why should we think that the iteration in the definition causes the languages in the final version of the current set not to be 'contaminated' by some features of Language X, which is the language from which they were all initially derived?

To see why the definition could avoid the previous problems let us assume the sort of extreme case that has been assumed recently. Let us assume that the language being assessed X contains some function F() that acts as a 'super command'. F() allows X to be used to express other languages that contain F() by using a small amount of code - actually just a call to F() in the interpreter program.

As I mentioned previously, the existence of a 'super command' in a language X does not automatically have an effect on any single language interpreter that is made using X. What does happen is that it skews the statistics of any set of languages made using that language: if we are limiting the length of each language interpreter to N bits then there will still be interpreters that do not contain or provide the F() function - the F() function does nothing to prevent the existence of these - but for every interpreter that does not use the F() function there will be many extra interpreters that are made with the F() program and that have it available for any programs made using the languages that they provide.

If the F() function works as I said then we would actually expect the number of interpreters with F() to vastly outnumber those without it. To make the numbers easier to work with, however, let us just imagine that when the number of languages with F() - which means that they were made by using F() and the language that they provide contains F() - is about 90% of the members of the set of languages with lengths of N bits or less - the remaining 10% of the set's members being interpreters that were made without using F() and that are very likely to provide it in their languages.

When we do the first iteration in the process described above this is what we will have as the current set: every language interpreter with a length of N bits or less that can be written in the language X - 90% of which will have F() and 10% of which will not.

We are going to do I iterations, however, and in each iteration a new current set will be generated. Let us now consider the second iteration. In this iteration every language interpreter in the current set will be used as the language to write every language interpreter that can be written with a program length of N bits or less. We have two main kinds of language interpreters in the current set:

  1. Language interpreters made with F() and which provide F() in their languages.
  2. Language interpreters made without F()

and the first type constitutes 90% of the current set's membership. To keep things simple, let us presume that the existence of F() in a language skews the statistics of a set of languages in the same way as it did with languages derived from X; that is to say that when a language containing F() is used to write every language interpreter with a program length of N bits or less, then 90% of these languages are made by using F() and provide F() in their languages.

This means:

  1. When the 90% of the languages in the current set that have F() are used to generate the next current set, 90% of the language interpreters written in these languages will be made with F().
  2. When the 10% of the languages in the current set that do not have F() are produced only a negligible proportion will contain F(): there is no reason for such a language to contain F() apart from by a statistical fluke.

This in turn means that the proportion of languages in the next current set that contain F() will be 90% of 90% - which is 81%. This result is significant because it means that the number of languages containing F() decreases when we use the languages in the current set to generate the next current set. This is why we had I tend to infinity as well as N: each time we perform iteration and generate a new current set the influence of the original language X decreases and as the number of iterations tends to infinity the influence of X and its F() function on the current set will become insignificant.

Possible Objections - and Answers

A number of objections could be made to this and here are some of the more obvious ones, with answers.

Objection 1: You assumed that the proportion of languages containing F() which have interpreter programs with lengths of N bits or less and are written in any language containing F(), is 90%, if this value happens to be 90% for languages written in X. This assumption is flawed. Why should other languages behave in the same way as X?

Answer: The figure of 90% was only used as an example: as I said, in a case where something like F() really existed it would probably be much greater and the assumption of the value always being 90% for languages derived from X, and languages derived from those languages (and so) was only used to simply things. The value of 90% is a kind of estimated average over all languages that appear in the current set over all iterations of the process. For some individual languages this value may be much greater than it is for X and for others it may be much smaller. The argument does make it clear, however, that there is an overall statistical tendency for the proportion of languages containing X in the current set as a whole to decrease as successive current sets are generated unless there is some process going which acts to halt this decrease. Such a process could only be the existence of some individual languages in the current set which give rise to a higher proportion of languages containing F() than exists in the current set - that is to say that as a current set is generated with some languages that do not contain F() some of the languages in this current set that contain F() 'take up the slack' on behalf of languages without F() by causing more of the languages derived from them to contain F(). This is very specific behaviour for an algorithm and it is not enough for it to happen once, in one iteration: for it to save F() from extinction it would have to keep happening. This does not merely mean that some of the languages derived in later iterations must have the same proportion of languages with F() in the languages derived from them: to compensate for languages being generated in each iteration which do not have F() the proportion of those with F() would actually have to increase - and keep increasing with each iteration - which would seem to be a rather difficult situation to create, to say the least. Further investigation, admittedly, may be needed with regard to this matter.

Objection 2: You showed this for F(), but not for other features.

Answer: F() is one of the most extreme languages features that can be imagined. We did not really need to describe F() too much: our discussion of F() mainly focused on its effects on the statistics of the current set and how these would be resolved and the kind of effects we considered F() having on the current set would be those that are typical of problematic features of languages. A language feature which had these sorts of effects on the current set, but to a lesser degree, would still be dealt with by this sort of argument.

Objection 3: Your argument is intended to show that any feature of a language would be gradually weakened in the current set as successive iterations are performed. What about not containing F()? Is that not a feature? Why would this sort of iterative process magically know to weaken things that you do not want to be present - such as the feature of F() in languages - but also know not to remove features that you do want - such as the feature of not having F() that you would like your languages to have? Surely, if we declared not having F() to be a feature of a language we could use the same sort of argument to show how it is weakened over time - clear evidence that your argument is contrived.

Answer: It is not correct to consider not having F() to be a feature of a language in the same way that having F() is a feature. For a language to have F() very specific conditions have to be met, whereas not having F() is really the default in the absence of these conditions. As an example of what I mean here, we could create some organised information with specific features and gradually expose it to random changes so that whatever features make it organised are gradually lost as randomness consumes it. It would be easy to show, in this sort of situation, that the features of the original data are gradually lost, yet some argument could be made that the 'randomness' which we may intuitively expect to arise is also a feature and that we should expect a sequence of random alterations to destroy it just as effectively as any other feature would be destroyed. This would be a fallacy. The randomness which tends to result when random alterations are made to things is clearly not a specific feature which is destroyed by random alterations: it is much less specific than such features and is instead the natural result of such alterations. Likewise, not containing F() is not a feature that we expect the iteration in the definition to remove: it is simply the end product of entropy taking its course through all these iterations.

Objection 4: The process used in your definition seems to be trying to generate a 'current set' that is not influenced by the original language X. Surely this is the downfall of your definition! The statistical properties of the current set, after all these iterations, must come from somewhere, yet if it is not from X, where do they come from? Wherever they come from - X or somewhere else - the choice of where that is will actually prejudice the results as effectively as any of the choices of language that prejudiced the previous definition attempts that you gave as examples. Trying to hide this by sleight of hand will not achieve anything. From where do the statistical properties of the current set, after all these iterations, come?

Answer: Nowhere.

Do we have to start from with the language being assessed?

In the definition, as proposed here, the actual language X that we are assessing is used as the starting point for generating the set of languages used to make the evaluation - with many iterations being used to ensure that no features of X survive to be in this set.

If no features of X are going to be in the final set of languages it raises the question of why we would need to actually use X as the starting language when assessing X. Could we not use any language Y that provides Turing equivalency as the starting point and get, basically, the same results?

If we were to do this then the process used in the definition would be almost the same and would be as follows:

Let us assume that we have some language X and that we want to know how low level X is.

  1. Using some randomly selected language Y, such that Language Y provides Turing equivalency to express programs, construct every program that can serve as a universal Turing machine (that is to say, every program that can act as a language interpreter) and that has a length of N bits or less. This set of programs/languages will be known as the current set.
  2. Use every language in the current set to construct every possible program, with a length of N bits or less, where N is some constant, that can serve as a universal Turing machine (that is to say, every program that can be used as a language interpreter). The set of programs/languages generated in this way will become the current set and the existing programs/languages in the current set will be discarded.
  3. If we have been through this step I times, where I is some constant, then go to Step 4 - otherwise go to Step 2.
  4. We now have a set of programs/languages, the current set. This set of languages will be used for the descriptions of Language X that will actually be used to determine how low level Language X is. In each language in the current set write the shortest program that serve as a language interpreter for Language X. The average length of each such program is the low level indicator for Language X.

(See Figure 11)

As previously, we obtain the low level indicators for Language X and some other language with N and I tending to infinity, allowing a comparison to be made between the two languages.

Figure 11: Using Another Language as the Starting Point

Figure 11: Using Another Language as the Starting Point

If the definition works then there should not be any problem with this. If the definition does not work it will be because the process suggested here fails to prevent the original language influencing the final version of the current set and if this were the case then whether we used Language X or some other language Y as the starting point would be irrelevant: no language choice would help. Some readers may wonder why I chose Language X when first describing the process if this is the case: the reason is that I had to pick some language and Language X was as good a choice as any.

One apparent complication is that we would need some way of making the choice. If a human were to make the choice this would be easy. If an algorithm were to make the choice we need some way of encoding languages so that a random number generator could be used to select one and the issue of what language we should use for such encoding arises: maybe we would use Language X? None of this really matters, because there is no way in which we have to make sure that our choice meets some criteria of 'randomness': the process of iteration is supposed to make our original choice irrelevant by the time the final version of the current set is generated and Language X or any other Turing equivalent language, chosen by any means, would suffice.

Improvements to the Method

The process used in the definition could be made less unwieldy to implement. Each time that the definition is used we have to go through the whole process of generating a series of sets of languages, the final one of which is used to evaluate the language being assessed. All of this is only necessary because, in the absence of a reliable definition of 'low level' no single language can be trusted. Now that we have a definition, however, we can use it to find the lowest level language that we could possibly make - or one that is low level enough - and then simply use that language in the assessment of other languages.

This is how such a process could work:

Let us suppose that we want some language that can be trusted to be 'low level', so that we can use it to describe other languages for assessment purposes.

  1. Using some randomly selected language Y, such that Language Y provides Turing equivalency to express programs, construct every program that can serve as a universal Turing machine (that is to say, every program that can act as a language interpreter) and that has a length of N bits or less, where N is some constant. This set of programs/languages will be known as the current set.
  2. Use every language in the current set to construct every possible program, with a length of N bits or less, where N is some constant, that can serve as a universal Turing machine (that is to say, every program that can be used as a language interpreter). The set of programs/languages generated in this way will become the current set and the existing programs/languages in the current set will be discarded.
  3. If we have been through this step I times, where I is some constant, then go to Step 4 - otherwise go to Step 2.
  4. We now have a set of programs/languages: the current set. It will be from this set that our low level language is selected and this set will be used to assess how low level its own members are. For each language in the current set, determine the low level indicator. As before, the low level indicator for any language is the average length of program needed to write an interpreter for that language in each of the languages of the current set. The only difference is that now the low level indicator is evaluated for each member of the current set itself, rather than for some language, such as X, outside the current set.
  5. Each language in the current set now has a low level indicator. Take the language in the current set which has the lowest low level indicator. This language will be referred to as Language L.

As with the previous version of this algorithm, N and I tend to infinity.

Now that we have a low level language there is no need to repeat the process of iteration with the current set anymore. We can simply use our low level language L to assess any other language.

Let us suppose that we have a language X and we wish to determine how low level Language X is. We can use language L to describe Language X - which is to say that we use Language L to write an interpreter for Language X. We then take the number of bits in that description as an indication of how low level Language X is. If we wish to make a comparison with another language A then we can simply use Language L to express a description of Language A and use the length of this description as an indication of how low level Language A is. Comparison with the equivalent value that was obtained for Language X will then tell us which of the two languages X and A is the lower level one.

It should be noted that this method does not give values that exactly match with the 'low level indicator' type of evaluation that was performed previously on the languages of the current set. This means that if we use Language L to assess itself - by taking the number of bits needed to write an interpreter for Language L in Language L itself as an indicator of how low level the language is - we will not obtain a result which is equal to the low level indicator that was obtained for Language L when it was assessed using all the other languages in the current set. This does not matter: the same 'scale' is not being used to obtain the values on both occasions and we will find that languages that have particularly low level indicator values will also not need very long programs to describe them in Language L.

The definition given in this article uses a particular way of making the comparison between two languages to determine which is the lower level of the two. There are many other ways of making this comparison and quantifying the degree to which a language is low level and the purpose of this article has been to promote only the general idea behind the definition presented here.

Conclusion

It has been shown that defining concepts such as 'low level' or 'high level', or even the idea of a 'basic' word in a human language is not trivial. The more obvious ways of doing this do not stand up to a deep examination. The main problem is that, for any particular language to be formally described in these terms, a description of that language would need to be available and a language needs to be used for the expression of that description. The choice of language used to express a description of another language can affect properties of the description which make it appear to be the description of a low or high level language.

It could be argued that ideas such as 'low level' or 'high level' are based partly on physics, rather than 'pure' logic, or even that they are entirely arbitrary. In this article I have argued against such positions.

A way of defining concepts such as 'low level' or 'high level' has been demonstrated, based on the idea of taking some initial language and using it to describe a large number of other languages, which are in turn used to describe a large number of languages, and so on - the idea being that at the end of such a process we will have a large set of languages that have nothing to do with the initial language and can be effectively regarded as a random sample of languages.

As well as the method used in the definition presented here there could be many other ways of actually defining the concepts of 'low level' or 'high level' - though it seems likely to me that any workable definitions would need to use the same general idea of trying to 'put distance' between whatever formal system is used initially and the formal system or systems finally used to describe the language for the purposes of assessment.

Further investigation may be needed before the definition, or other definitions which rely on the same general idea, can be verified as workable.

If the process described here determined that a language was low level then that could be regarded as correct information, but it does not follow that if a language is determined to be high level using this method then it must be high level by conventional human standards. Languages are considered low level if they do not contain a lot of information in their descriptions (when we average over the current set), but a language may be complex enough to require a lot of information, on average, to describe it, while lacking the useful features that we usually associate with high level languages. This issue need not concern us too much here: the article is only intended to show that it can be meaningful to have low level languages that we regard as really low level.

The definition proposed here, could never be used as it stands to actually evaluate languages. For a start, the requirement that N and I tend to infinity means that it could only ever be approximated in the same kind of way that approximations are made in numerical integration. It may even be that the approximation is not very closely based on the actual definition that is considered best but instead on some method that is easier to implement but is known to give the same sort of results.

Even a very loose approximation of this or other definitions may not be necessary. The article is intended more to demonstrate that the concept of 'low level language' can have an objective meaning. If we are satisfied that the concept of 'low level language' is a usable one then we may still simply use our ideas about what are 'obviously' low level languages and maybe have arguments to support these choices in very broad ways.

References

[1] (?). Occam's Razor. Wikipedia, the free encyclopedia. Retrieved on 30 April 2005 from http://en.wikipedia.org/wiki/Occam's_Razor.

[2] Cline, A. (?). Occam's Razor. Atheism.About.com. Retrieved on 30 April 2005 from http://atheism.about.com/od/criticalthinking/a/occamrazor.htm.

[3] Russel, R. K. (2002). Why Occam's Razor. Retrieved on 30 April 2005 from http://parallel.hpc.unsw.edu.au/rks/docs/occam/occam.html.

[4] Hofstadter, D. R. (New Edition, 2000). Godel, Escher, Bach: an Eternal Golden Braid. London: Penguin Books. p192, p696.
(Originally published:1980. New York: Vintage)

[5] Penrose, R. (1989). The Emperor's New Mind: Concerning Computers, Minds and the Laws of Physics. Oxford: Oxford University Press.

[6] Penrose, R. (New Edition, 1995). Shadows of the Mind: A Search for the Missing Science of Consciousness. London: Vintage. Chapter 2, pp64-76.
(Originally published:1994. Oxford: Oxford University Press)

[7] Almond, P. (2004). A Refutation of Penrose's Godel-Turing Proof that Computational Artificial Intelligence is Impossible. Retrieved on 5 June 2005 from http://www.paul-almond.com/RefutationofPenroseGodelTuring.htm.

Home Guest Book Links Email

© Copyright Paul Almond 2003-2006. All Rights Reserved. Email: info@paul-almond.com
This page last modified: Saturday January 14, 2006 19:21