Image from Unsplash modified by the author

*Here we obtain the probability of *optimally *solving the Rubik’s cube*

The Rubik’s cube is a prototype of a planning problem with a *colossal state space* and only one solution. It is the very definition of a needle in a haystack. With no guidance (even if you can turn the faces 100 times per second) you might spend the whole age of the universe with no success. Everything about it seems to involve huge numbers.

The quantity we are going to compute here is an exception to that. With it you will get a simple perspective on a difficult problem (and on any similar planning problem too). We need two ingredients, a random process, and an optimal solver. This last is a device (real or ideal) that can solve the cube (or a similar problem) using a minimum number of moves, given any initial state. We will fully answer the following question:

*If a solved cube undergoes **N** random turns, what is the probability **p(d|N)** that an optimal solver needs precisely **d** moves to solve it back?*

In a normal situation, if someone ask you to solve the cube, you will just get that, a scramble cube with no references or labels. Here we have one piece of information about the scrambled state: it was obtained after ** N **random moves from the solved state. That information is useful!

#### Why are we interested in **p(d|N) **?

Computationally you can try to decipher the cube in different ways. The ambition of a Rubik’s project might range between *solving any or some states suboptimally*, to *solve every possible state optimally* (this, for example, will take the famous 35 CPU-years). A cube solver generally involves two things, a search algorithm and a heuristic function. By choosing these two we parametrize how difficult, efficient, or computationally demanding our approach is.

In the field of the heuristic function, i.e. the search guidance, there is always room for new ideas. Historically, the cube’s heuristic was combination of Manhattan-like distance estimations for the position of the scrambled cube facelets relative to their solved positions. Only recently a neural net was used as the heuristic.

*Neural Net = Rubik’s cube new heuristic*

The work of the neural net is simple (a classifier): you feed a cube’s state ** x** and the depth

**of that state is predicted. The depth**

*d***of a state is defined as the minimum number of moves required to solve the cube optimally from that state. Notice the following. If we have a device that knows the depth of any state, we actually have an optimal solver because everytime we can select the move that give us a state with a lower depth, until we get to depth = 0 (the solved state).**

*d*The problem here is how to train that net. Or, specifically, how to get an accurate training data. There is no easy way to know the ground-truth depth of a scrambled state ** x**, unless you already have an optimal solver. And we don’t have an optimal solver, or, we don’t want to use a computational expensive one. We want to build an approximate and efficient optimal solver from scratch and with minimal human input, and we also want accurate training data for it:

training_data = (** x , d**) ,

As we said, the accuracy of ** d** is difficult to get, but it is easy to associate a number

**with some particular scrambled states: the ones produced by acting on the solved state with**

*N***random moves. Then**

*N**p(d|N)** estimates **d, **given** N.*

** p(d|N) **will be used to improve the accuracy of that training data. The authors of the aforementioned paper(s) built the first Rubik’s cube depth-classifier neural net. Their training data was of the form:

training_data = (** x , N**).

They took ** d** as

**That choice was compensated by dynamically improving the accuracy of the labels using a Bellman-like loop during the training. The probability**

*N.***computed here offers a better starting point for the accuracy of that training data (abundantly obtained by just randomly twisting the solve state**

*p(d|N)***times).**

*N*#### A Random Walk view

Computing** p(d|N) **is equivalent to ask how far

**would a random walker be after**

*d***steps. Instead of walking in a square lattice, he would be walking in a massive Rubik’s graph of 10 to the power of 19 nodes (cube’s states) and a similar number of links (legal moves). If we chose a layout where the nodes are organized by their depths: with the solved state in the center and the states of depth**

*N***located at radius**

*d***from that center, the graph will look very symmetrical. The radial (depth) direction offers a very simple perspective.**

*d*#### Convention

Here we adopt the so called quarter-turn metric for the 3x3x3 cube, where a move involves a 90-degree face turn, either clockwise or anticlockwise. In this metric there are twelve possible moves. If we had chosen a different metric, like the half-turn metric (that also includes 180-degree face turns as a single move) the expression for ** p(d|N)** would differ.

#### The data

To get ** p(d|N) **we will need to use some kind of domain knowledge , but we don’t want to deal with graphs, patterns databases or group theory. We will use something more “primary”:

*The list containing the population of cubes’ states at a depth **d**.*

The list (provided by the authors of the 2012’s God’s number paper) doesn’t specify which states are at a certain depth, just the total number of them, and there is no reference to any ** N. **It would be the source data for our computation.

# Depth population list

# number of cubes’ states at a depth d in the quarter-turn metric

D_d={

# depth number of states

0: 1,

1: 12,

2: 114,

3: 1068,

4: 10011,

5: 93840,

6: 878880,

7: 8221632,

8: 76843595,

9: 717789576,

10: 6701836858,

11: 62549615248,

12: 583570100997,

13: 5442351625028,

14: 50729620202582,

15: 472495678811004,

16: 4393570406220123,

17: 40648181519827392,

18: 368071526203620348,

19: 3000000000000000000, # approximate

20: 14000000000000000000, # approximate

21: 19000000000000000000, # approximate

22: 7000000000000000000, # approximate

23: 24000000000000000, # approximate

24: 150000, # approximate

25: 36,

26: 3,

27: 0

}Log-scaled plot of the number of states vs depth

#### Some observations on this list:

First, there are zero states at a depth greater than 26 (God’s number is 26 in the quarter-turn metric). Second, the list reports an approximate number of states for ** d** between

*19*and

*24*. We will need to be careful with this later. Third, if we make a log-scaled graph, we can see a linear growth for most depths (except for those close to the end). That means the population

**of states growths exponentially. Fitting the linear part of the log graph with a straight line we learn that between**

*D(d)*

*d =**3*and

*d =**18*the state population growth as

At depths *3 < **d** < 18, *on average, *9.34* of the *12* moves will make you go away from the solved state, and *2.66* will make you go closer.

**A Markov process on depth’s classes**

To find ** p(d|N)** we imagine the depth classes as sites of a Markov process. Let me explain:

Randomly moving the cube faces is described as a Markov process (one dimensional random walk) between depth classes. Image by the author.

A depth class ** d** is the set of all the cube’s states at a depth

**(minimal number of moves to the solved state). If we randomly chose a state in a depth class**

*d***, and we turn a random face with a random move, that will give us either a state in the class**

*d***with a probability**

*d + 1 ,***, or a state in the class**

*p_d***with a probability**

*d -1,***In the quarter-turn metric there are no self-class moves.**

*q_d.*That defines a Markov process, where a particular site is a whole depth class. In our case, only contiguous ** d **classes are one-jump connected. To be more precise, this is a discrete-time birth-death Markov chain. Because the amount of sites is finite, the chain is also irreducible (ergodic), and a unique stationary distribution exist.

We assume equally distributed probabilities for the selection of the random moves at each time. That induces some transition probabilities ** p_d, q_d **(to be computed)

**between the depth classes. The amount of random moves**

**is the discrete time of the Markov process. This is also a one-dimensional random walker: at every site (depth class number**

*N***, the probability of going forward is**

*d)***and the probability of going backwards is**

*p_d,***This one dimensional chain is, roughly speaking, the “radial” direction in the Rubik’s graph (organized in the depth-radial layout).**

*q_d.*#### The transition matrix

Any Markov processes is encoded in a transition matrix ** M**. The

**(**entry of

*i,j*)**is the probability of jumping from site**

*M***to site**

*i***. In our case only the following entries are different from zero:**

*j*Here ** p_0 = 1: **from the depth class

**0**(containing just the solved state) we can only jump to the depth class

**(there is no class**

*1***). Also,**

*-1***: from the depth class**

*q_*26*= 1***26**we can only jump to depth class

**25**(there is no class

**27**). For the same reason, there are no

**or**

*p_*26

*q_*0.#### The stationary distribution

We mapped the action of randomly moving the cube to a one-dimensional depth-class random walker jumping back and forth with probabilities ** q_d** and

**. What happens after a long walk? or, how many times does the walker visit a particular site after a long walk? In real life: how often is a depth class visited when the cube undergoes random turns?**

*p_d*In the long run, and no matter what the starting point was, the time the walker spends in the depth class ** d** is proportional the population

**of that depth class. This is the main point here:**

*D(d)**the (normalized) depth-population list **D(d)** should be interpreted as the vector representing the stationary distribution of our depth class Markov process.*

Mathematically, ** D(d)** is a left eigenvector of

*M*This matrix equation will give us **26** linear equations, from which we will get the ** p_i’**s and

*q_i**’*s.

Taking into account that ** p_0 = q_26 = 1, **we can rewrite these as

Detailed balance equations. Image by the author.

These are known as ** detailed balance equations**: the flux, defined to be the stationary site population times the jumping probability, is the same in both directions. The solutions are:

and** p_i** is obtained using

*p_i + q_i = 1.*#### Some conditions on the population of a depth class

There is something interesting about these solutions. Because ** q_i **is a probability we should have that

and that translate into the following condition for the distribution ** D_k**:

This is a tower of inequalities that the depth-population ** D_k** should fulfill. Explicitly, they can be organized as:

In particular, the last two inequalities are

Because ** D_27 = 0, **we get that the lower and upper bound are equal, so

Or:

*The sum of the population of the even sites should be equal to the sum of the population of the odd sites!*

We can see this as a detailed balance between even and odd sites: every move is always to a different and contiguous depth class. Any jump will take you from the odd depth class (the class of all the odd depth classes) to the even depth class (the class of all the even depth classes). So the odd to even class jump occur with probability 1 (and vise versa). Being the probabilities one in both direction, their population should be equal by detailed balance.

For the same reason the Markov process will attain a period-two “stationary distribution” that switches between even and odd sites after every move (discrete time ** N**).

#### A problem with the data

The depth-population ** D_d **reported in the source of the data we are planning to use is approximate for

*d**= 19,20,21,22,23,24.*So there is no guarantee it will satisfy all these conditions (inequalities). Don’t be surprised if we get some probabilities

**out of the range [0,1] (as it is the case!). In particular, if we try and check the last condition (the even-odd population equality) it is off by a big number! (update: see note at the end)**

*q_i*#### A way out

The odd class seem to be underpopulated (this is a consequence of the chosen approximation to report the data). To make things work (get probabilities in the range [0,1]), we decide to add the previous big number to the population of the depth class 21 (the odd class with the greatest population, or, the one that will notice that addition the least). With this correction, all the obtained probabilities seems to be correct (which means the inequalities are also satisfied).

The jumping probabilities are:

p_i =

{1., 0.916667, 0.903509, 0.903558, 0.903606, 0.903602, 0.90352, 0.903415,

0.903342, 0.903292, 0.903254, 0.903221, 0.903189, 0.903153, 0.903108,

0.903038, 0.902885, 0.902409, 0.900342, 0.889537, 0.818371, 0.367158,

0.00342857, 6.24863*1e-12, 0.00022, 0.0833333}

# i from 0 to 25

q_i =

{0.0833333, 0.0964912, 0.0964419, 0.096394, 0.0963981, 0.0964796,

0.096585, 0.096658, 0.0967081, 0.0967456, 0.0967786, 0.0968113,

0.0968467, 0.0968917, 0.0969625, 0.0971149, 0.0975908, 0.0996581,

0.110463, 0.181629, 0.632842, 0.996571, 1., 0.99978, 0.916667, 1.}

# i from 1 to 26

Notice that almost all the first ** p_i** (up to

**) are close to**

*i = 21***1.**These are the probabilities of going away from the solved state. The probabilities of going closer to the solved state (

**) are almost**

*q_i***for**

*1***greater than**

*i***This puts in perspective why it is difficult to solve the cube: the random walker (or the cube’s random mover) will be “trapped forever” in a neighborhood of the depth class**

*21.***.**

*21***p(d|N)**

Plugging the ** p_i, q_i** numerical values in the transition matrix

**the Markov process is completely described. In particular, if we start at site**

*M***0**with probability one, after

**steps the random walker will be at site**

*N***with probability:**

*d*This is the probability we were looking for:

Numerically, we learn that ** p(d|N)** is non zero only if

**and**

*N***have the same parity (something that is well known by some Rubik’s cube scholars). Here some plots for different numbers of random moves**

*d***from the solve state:**

*N*Some probabilities p(d|N). Image by the author.

We observe that after ** N **=

**or**

*31*

*32**,*

**is pretty close to the stationary distribution**

*p(d|N)***(except it switches back and forth between even and odd sites). This is another answer the question of how many moves are enough to say we really scrambled the cube.**

*D(d)*Let’s notice we have solved an inverse problem. We got the transition probabilities from the stationary distribution. This is not possible for a general Markov process. In particular, it is not possible to find ** p(d|N), **with

**the method described here, for the half-turn metric. The half-turn metric is different because we can stay in the same depth class after a move (by their moves definition). These self depth class jumps introduce extra probabilities**

**in the diagonal of the transition matrix and we will have more variables than equations.**

*r_i*#### Final Comments

Even if the Rubik’s cube is a 35 CPU-years problem from a computational perspective, there are many aspects of it that can be described analytically or with modest numerical efforts. The probability we computed here is an example of that. Everything we said can be easy generalized to more complex Rubik’s cube-siblings. A cool generalization is to go to more dimensions: ** S = 4D, 5D, 6D**, … dimensional Rubik’s cube. The state space in those cases growth exponentially with

**. So there are cubes for which the Markov chain is as long as we want. In other words, we have similar puzzles for which God’s numbers as big as we want (roughly speaking, God’s number is the logarithm of the number of states, and the number of states increases with the dimension**

*S***). In those cases we can obtain some interesting limits that shed some light over the**

*S***case.**

*3D**In the limit of large God’s number **G,** the probability **p(d|N)** should approach the binomial distribution*

It is not very difficult to see why that might be the case. If ** G** is large, the exponential growth of

**will be very stable for most**

*D_d***’s. This is a daring, yet not excessively wild guess:**

*d*for ** k** away from

**and**

*0***. As we said, in the**

*G***case**

*S = 3D*

*b =**9.34.*For higher

**,**

*S***should increase (having more faces will increase the branching factor**

*b***). That translate into the following values for the probabilities:**

*b*** q_i **approaches a constant value

**when**

*1/(b+1)***is away from the origin (**

*i***).**

*i>>1***will also be constant in this range. You can see that the values of**

*p_i***and**

*p_i***computed here for the**

*q_i***case are almost constant if**

*3D***is not too close to the endpoints**

*i***and**

*i = 0***Numerically**

*i = G = 26.***is approximately equal to**

*q_i***with**

*1/(b+1)*

*b =**9.34*for

*i=3, …, 15**in our*

**case**

*3D**.*For

**we will then have a one dimensional random walker with constant probabilities of going back (**

*0 << i << G***) and forth (**

*q***). In this case the position the walker will be described by a binomial-like distribution.**

*p*The probability that the random walker will take ** k** steps to the right with constant probability

**and**

*p***steps to the left with constant probability**

*N-k***is**

*q***, after**

*Binomial(k,N,p)***trials. The distance traveled after**

*N***steps will then be**

*N***from which we obtain**

*d = k-(N-k),*From here we can get analytical estimates for the most likely *d*

This is interesting: As the dimension ** S** of the cube increase, the branching factor

**grows, and the most probable depth**

*b***actually approaches**

*d*

*N.***Update note**. After this story was published I contacted Tomas Rokicki and Morley Davidson (two of the authors of the 2012’s proof that God’s number is 20 in the half-turn metric) about their reported **D(d)**** **and the negative probabilities I got using it**. **They share with me a more accurate data, with lower and upper bounds for the population of the depths d = 19, …, 24 . Their data is fully compatible with the here obtained inequalities, and with the fact that the population of the even depth classes should be equal to the population of the odd depth classes (in the quarter-turn metric). The probabilities computed here have a negligible correction when using this new data.

Rubik and Markov was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.