Here’s a step-by-step tutorial on how to sample your dataset efficiently using Python
Image by author
I was putting the Christmas tree up with my wife. We went to the basement, took the tree, brought it upstairs, and started building it from bottom to top. It’s always a magic moment🎄
Then it was the point to put the balls on the tree. And immediately I thought: there are at least three ways to put the balls on the tree.
Uniformly: put the balls uniformly on the tree, kind of like thatImage by author, made using FreepikRandomly: put the balls randomly on the tree, closing your eyes and putting the ball wherever you feel like (I started doing this and my wife went MAD)Image by author, made using FreepikLatin Hypercube: Splitting the tree into N sections and extracting randomly in each one of these sections. It’s very hard to draw it without running any code, but a possible Latin Hypercube looks like this:Image by author, made using Freepik
I tried and show this to my wife. She smiled and said “Whatever”, so I went to my computer in the hope that your reaction would be something more satisfactory 😤
Jokes aside, when dealing with Machine Learning problems there are two different scenarios:
You don’t have any control over the dataset. You have a client, or a company, that hands you a dataset. That’s what you will have to deal with until a necessary (eventual) re-training will be scheduled.
For example, in the city of New York, you want to predict the price of a house based on some given features. They just give you the dataset and they want you to build your model so that when a new client arrives you have an AI software that can predict the price based on the features of the house of interest.
2. You can build your Design of Experiment. This is when you have a forward model or a real-world experiment that you can always set up to run.
For example, in a laboratory, you want to predict a physical signal given an experimental setup. You can always go to the lab and generate new data.
The considerations that you make in the two cases are completely different.
In the first case you can expect a dataset that is unbalanced in its features, maybe with missing input values and a skewed distribution of the target values. It’s the joy and damnation of a data scientist’s job to deal with these things though. You do data augmentation, data filtering, fill in the miss values, do some ANOVA testing if you can and so forth. In the second case, you have complete control over what’s going on in your dataset, especially from the input perspective. This means that if you have a NaN value you can repeat the experiment, if you have several NaN values you can investigate that weird area of your dataset, if you have a suspicious large value for some given features you can just repeat the experiment to make sure it’s not an hallucination of your setup.
As we have this amount of control we want to make sure to cover the input parameter space efficiently. For example, if you have 3 parameters, and you know the boundaries
Image by author
where i goes from 1 to 3 (or from 0 to 2 if you like Python so much 😁). In this case, x_i is the i-th variable and it will always be larger than x_i^L(ower boundary), but it will always be smaller than x_i^U(pper boundary).
We have our 3-dimensional cube.
Image by author
Now, remember that we have complete control of our dataset. How do we sample? In other words, how do we determine the xs ? What are the points that we want to select so that we run the forward model (experiment or simulation) and get the target values?
As you can expect there are multiple methods to do so. Each method has its advantages and disadvantages. In this study, we will discuss them, show the theory behind them, and display the code for everyone to use and understand more about the beautiful world of sampling. 🙂
Let’s start with the uniform sampling:
1. Uniform Sampling
The uniform sampling method is arguably the most simple and famous one.
It is just about splitting each parameter (or dimension) in steps. Let’s assume that we have 3 steps per dimension, for 2 dimensions. Each dimension goes from 0 to 1 (we will extend this in a minute). This would be the sampling:
(0,0)(0,0.5)(0,1)(0.5,0)(0.5,0.5)(0.5,1)(1,0)(1,0.5)(1,1)
This means that we fix one variable at a time and increase by step. Fairly simple. Let’s code it:
1.1 Uniform Sampling Code
How do we do this? Let’s avoid this kind of structure:
for a in dimensions 1for b in dimension 2….for last letter of the alphabet in dimension number of letters in the alphabet: X.append([a,b,…,last letter of the alphabet])
We don’t want this as it is not very efficient and you need to define a variable per dimension and it is annoying. Let’s use the magic numpy instead.
https://medium.com/media/880c8660a434e2509601a147b6d89c73/href
the question np.meshgrid(*points) does what you would do using the for loop but in an optimized way. Your parameter dict is meant to be the dictionary that tells you what is the min and max for each parameter.
Using this bit of code you will generate a 0/1 cube and a cube with three different dimensions (for example -5 to 1 in the first dimension, 0 to 10 in the second dimension and -1 to 1 in the third dimension):
https://medium.com/media/48141ef74fe2d5d0a63896e543d25179/href
We have three dimensions, let’s plot the first 2:
https://medium.com/media/548dc5ba5eed3775a7ff9f14dfbe5173/href
1.2 Pros and Cons
Pros: This method is very known for two reasons. The first reason is that it is super easy to implement. It’s really a for loop between variables. The second reason is that of course you are covering the parameter space uniformly, which is ideal if you want to make sure not to lose important parts of the parameter space
Cons: The huge problem with this method is of course the exponential dependency. If we assume that the number of dimensions is fixed (let’s say 6) for a steps = 10 design you are already looking at a million point realization. And the problem, again, is the exponentiality of the thing. So if you want to increase the number of steps by doubling it (20 steps instead of 10) you are now talking about a 64M point problem.
2. Random Sampling
An alternative to the uniform sampling is the random sampling. How does it work? It’s very simple: in the cube of interest, you just select the points randomly in the boundaries.
2.1 Random Sampling Code
The random sampling method is extremely simple to code, for both 0–1 cubes and custom boundaries cubes. This is how:
https://medium.com/media/68097f6708180afba7fe22f077fdeb21/hrefhttps://medium.com/media/7af959e71dc945157ffdbf4e3a2d3413/href
Let’s plot this:
https://medium.com/media/308c412832eacf940bb5b0b9d7d7aed8/href
2.2 Pros and Cons
Pros: Even in this case, the random sampling is extremely simple to understand and to code (as you saw). Another pro is that this method can capture the complexity of the output space better than the uniform sampling, especially for large dimensions.
Cons: The problem is the inherent randomness of the sampling. This can potentially create clusters and areas of scarce exploration.
Just to be a little more in-depth, the paper produced by Pedergnana et al (extremely good) compares the difference between these two methods and other ones, especially for high dimensionality.
3. Latin Hypercube Sampling
The Latin Hypercube Sampling is usually defined as “Uniformly randomized sampling”. I think this is a very beautiful definition. Let me explain the idea.
The key idea behind LHS is to divide the parameter space into equally probable intervals along each dimension and ensure that, within each interval, only one sample is taken. This results in a stratified and well-distributed sample that covers the entire parameter space.
The cool thing about the Latin Hypercube is that you can use optimization methods, for example, to maximize the minimum distance between points and place the point in a randomized location within its interval.
3.1 Latin Hypercube code
This method deserves a custom installation, namely Surrogate Modelling Toolbox (smt)
pip install smt
Super easy to do that:
https://medium.com/media/65cc3fafd07f3611f5a2dd83c922399a/href
3.2 Pros and Cons
The Latin hypercube is visually similar to random sampling, but in multiple dimensions, it helps to preserve a form of regularity in the random sampling without the limits of uniform sampling. This method is, in its variation, the preferred one for large dimensions with few samples (which is the trickiest case to consider). The con of this method is the fact that it is much more complex both to implement and to describe, so it requires domain knowledge and a little bit of hands-on practice.
4. Conclusions
In this blog post, we saw three designs of experiment, or sampling, techniques for machine learning cases where a control of the input parameters is possible. In particular, we saw:
Uniform (Grid) Sampling: this is a method of building a N-th dimensional grid, where N is the number of dimensions. Straightforward to use, but not detailed enough especially for large dimensions.Random Sampling is the method of defining the N-th dimensional cube and extracting random values in the cube. Straightforward to use and better functioning than uniform sampling in the case of large dimensions, but still not optimal as it can create clusters and too dense areasLatin Hypercube Sampling is a method that regularizes Random Sampling by imposing that at least one point is sampled for different sections of the N-th dimensional hypercube. Optimal for large dimensionality and few samples but requires domain knowledge and an optimization procedure
We saw the three cases with coding examples for both unitary cubes (0 to 1 for each variable) and custom limits for each variable.
No method is perfect, and adopting one or the other depends on your end goal. I hope that this article gives you a little bit of a framework to work on when deciding which design of experiment is best to adopt 🙂
5. Conclusions
If you liked the article and you want to know more about machine learning, or you just want to ask me something, you can:
A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available.
Hands on Sampling Techniques and comparison, in Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.