Following on from the section on state machines, we will now look at an example of a language that isn't regular, and see what sort of problems you might have designing a state machine.

As it turns out, most programming languages are not regular languages, as we will see at the end of this section.

We will look at a simple example of a language that isn't regular. The language contains only the characters X and Y, and consists of all the strings that consist of `n`

Xs followed by `n`

Ys, where `n`

is any integer. Examples of valid strings are:

XY or XXYY or XXXYYY etc

Invalid strings include:

XXY or XXYYY

because the number of Xs and Ys are not equal. There are lots of other invalid strings, of course.

Here is our state machine. For clarity we have excluded all the error paths and just show the correct paths through the machine:

Here is how it works for the string `XY`

:

- Starting in state 1
`X`

moves us to state 2`Y`

moves us to state 9, which is the valid end state

Here is how it works for the string `XXYY`

:

- Starting in state 1
`X`

moves us to state 2`X`

moves us to state 3`Y`

moves us to state 8`Y`

moves us to state 9, which is the valid end state

It should be clear how this works for other strings. Each `X`

takes us one step further to the right, along the top row. We then need exactly the same number of `Y`

characters to get us back to the valid end state, 9.

Here is the problem. The state machine shown for only works for 4 or less incoming `X`

characters.

Of course, we could easily add extra states to allow us to handle 5 Xs. Or 10. Or 100.

But however many states you add, someone could come along with a string that was too long and the state machine would fail. To be capable of handling a string of *any length* you would need a state machine with an infinite number of states.

But we are dealing with *finite* state machines, which by definition cannot have an infinite number of states (the clue's in the name). A finite state machine cannot handle this particular language.

Just because this languages isn't regular doesn't mean a computer can't parse it, of course. It is a very simple language - it would clearly be possible to write a program to check a string to see whether it was valid or not. You just need to count the number of Xs and then count the number of Ys, and check if they are the same.

But you can't use a finite state machine to do that, so our simple language isn't *regular*.

We can't apply some other methods of parsing that only apply to regular languages.

Are programming languages like Python or Java regular? Well it turns out most of them are not.

A simple way to see that is to look at a basic maths expression:

x = a * (b + c * (d + e * (f + g)))

This expression uses brackets. Brackets can be nested, as shown. One of the things that *has* to be true of any valid expression is that each opening bracket must be matched by a closing bracket.

Can a state machine do this? well we already know that it can't. Even if we ignore all the other parts of the expression, the very least we need to do is check a string like this:

((()))

for any number of brackets. That is exactly the same problem as we looked at before for strings like `XXXYYY`

, so we know a finite state machine can't parse a mathematical expression, so any language that allows nested brackets *can't* be regular.

If fact you get a similar problem with other types of nesting. Nested if statements, nested for loops, nested function definitions, nested classes - any language that supports those things (and most langauges do) cannot be a regular language.

Copyright (c) Axlesoft Ltd 2021