Visualizing the True Extent of the Curse of Dimensionality

Using the Monte Carlo method to visualize the behavior of observations with very large numbers of features

Florin Andrei
Towards Data Science

--

Think of a dataset, made of some number of observations, each observation having N features. If you convert all features to a numeric representation, you could say that each observation is a point in an N-dimensional space.

When N is low, the relationships between points are just what you would expect intuitively. But sometimes N grows very large — this could happen, for example, if you’re creating a lot of features via one-hot encoding, etc. For very large values of N, observations behave as if they are sparse, or as if the distances between them are somehow bigger than what you would expect.

The phenomenon is real. As the number of dimensions N grows, and all else stays the same, the N-volume containing your observations really does increase in a sense (or at least the number of degrees of freedom becomes larger), and the Euclidian distances between observations also increase. The group of points actually does become more sparse. This is the geometric basis for the curse of dimensionality. The behavior of the models and techniques applied to the dataset will be influenced as a consequence of these changes.

Many things can go wrong if the number of features is very large. Having more features than observations is a typical setup for models overfitting in training. Any brute-force search in such a space (e.g. GridSearch) becomes less efficient — you need more trials to cover the same linear interval limits. A subtle effect has an impact on any models based on distance or vicinity: as the number of dimensions grows to some very large values, if you consider any point in your observations, all the other points appear to be far away and somehow nearly equidistant — since these models rely on distance to do their job, the leveling out of differences of distance makes their job harder. E.g. clustering doesn’t work as well if all points appear to be nearly equidistant.

For all these reasons, and more, techniques such as PCA, LDA, etc. have been created — in an effort to move away from the peculiar geometry of spaces with very many dimensions, and to distill the dataset down to a number of dimensions more compatible with the actual information contained in it.

It is hard to perceive intuitively the true magnitude of this phenomenon, and spaces with more than 3 dimensions are extremely challenging to visualize, so let’s do some simple 2D visualizations to help our intuition. There is a geometric basis for the reason why dimensionality can become a problem, and this is what we will visualize here. If you have not seen this before, the results might be surprising — the geometry of high-dimensional spaces is far more complex than the typical intuition is likely to suggest.

Volume

Consider a square of size 1, centered in the origin. In the square, you inscribe a circle.

a circle inscribed in a square
a circle inscribed in a square

That is the setup in 2 dimensions. Now think in the general, N-dimensional case. In 3 dimensions, you have a sphere inscribed in a cube. Beyond that, you have an N-sphere inscribed in an N-cube, which is the most general way to put it. For simplicity, we will refer to these objects as “sphere” and “cube”, no matter how many dimensions they have.

The volume of the cube is fixed, it’s always 1. The question is: as the number of dimensions N varies, what happens to the volume of the sphere?

Let’s answer the question experimentally, using the Monte Carlo method. We will generate a very large number of points, distributed uniformly but randomly within the cube. For each point we calculate its distance to the origin — if that distance is less than 0.5 (the radius of the sphere), then the point is inside the sphere.

random points
random points

If we divide the number of points inside the sphere by the total number of points, that will approximate the ratio of the volume of the sphere and of the volume of the cube. Since the volume of the cube is 1, the ratio will be equal to the volume of the sphere. The approximation gets better when the total number of points is large.

In other words, the ratio inside_points / total_points will approximate the volume of the sphere.

The code is rather straightforward. Since we need many points, explicit loops must be avoided. We could use NumPy, but it’s CPU-only and single-threaded, so it will be slow. Potential alternatives: CuPy (GPU), Jax (CPU/GPU), PyTorch (CPU/GPU), etc. We will use PyTorch — but the NumPy code would look almost identical.

If you follow the nested torch statements, we generate 100 million random points, calculate their distances to the origin, count the points inside the sphere, and divide the count by the total number of points. The ratio array will end up containing the volume of the sphere in different numbers of dimensions.

The tunable parameters are set for a GPU with 24 GB of memory — adjust them if your hardware is different.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# force CPU
# device = 'cpu'

# reduce d_max if too many ratio values are 0.0
d_max = 22
# reduce n if you run out of memory
n = 10**8

ratio = np.zeros(d_max)

for d in tqdm(range(d_max, 0, -1)):
torch.manual_seed(0)
# combine large tensor statements for better memory allocation
ratio[d - 1] = (
torch.sum(
torch.sqrt(
torch.sum(torch.pow(torch.rand((n, d), device=device) - 0.5, 2), dim=1)
)
<= 0.5
).item()
/ n
)

# clean up memory
torch.cuda.empty_cache()

Let’s visualize the results:

the volume of the sphere
the volume of the sphere

N=1 is trivial: both the “sphere” and the “cube” are just linear segments, and they are equal, so the “volume” of the “sphere” is 1. For N=2 and N=3, you should remember the results from high school geometry, and notice that this simulation is very close to the exact values.

But as N increases, something very dramatic happens: the volume of the sphere drops like a stone. It is already close to 0 when N=10, and it falls below the floating point precision of this simulation shortly after N=20 or so. Beyond that point, the code calculates the volume of the sphere to be 0.0 (it’s just an approximation).

>>> print(list(ratio))
[1.0, 0.78548005, 0.52364381, 0.30841056, 0.16450286, 0.08075666, 0.03688062, 0.015852, 0.00645304, 0.00249584, 0.00092725, 0.00032921, 0.00011305, 3.766e-05, 1.14e-05, 3.29e-06, 9.9e-07, 2.8e-07, 8e-08, 3e-08, 2e-08, 0.0]

One interpretation: for large values of N, almost the whole volume of the cube is located in the corners. The central region, containing the sphere, becomes insignificant in comparison. In high-dimensional spaces, the corners become very important. Most of the volume “migrates” into the corners. This happens extremely fast as N rises.

Another way to look at it: the sphere is the “vicinity” of the origin. As N increases, the points simply disappear from that vicinity. The origin is not really all that special, so what happens to its vicinity, must happen to all vicinities everywhere. The very notion of “vicinity” undergoes a big change. Neighborhoods empty out along with the rise of N.

I’ve run this simulation for the first time years ago, and I vividly remember how shocking this result was — how quickly the volume of the sphere dive-bombs into 0 as N rises. After checking the code and not finding any errors in it, my reaction was: save the work, lock the screen, go outside, take a walk, and think about what you’ve just seen. :)

Linear distance

Let’s calculate the diagonal of the cube as a function of N. This is trivial, since the diagonal is literally sqrt(N) so the code is too simple to include here. And these are the results:

length of diagonal
length of diagonal

Again, for N=2 and N=3 you should recognize the results from geometry (the diagonal of the square and of the normal 3-cube). But as N increases, the diagonal also increases, and its growth is unbound.

This may sound very counterintuitive, but even as the size of the edge of the cube remains constant (and equal to 1), its diagonal can grow arbitrarily large, as you increase N. Already for N=1000, the diagonal length is around 32. In theory, if the edge of the cube is 1 meter, there will be a space with very many dimensions where the diagonal of the cube will be 1 kilometer long.

Even as the distance along the edge remains the same, the diagonal just keeps growing unbound, along with the number of dimensions.

Every time a new dimension is added to the space, more edges, faces, etc. pop into existence, the configuration of the corner region becomes more complex, having more degrees of freedom, and the diagonal will measure a bit longer.

Distances between observations

How about distances between observations or points? Let’s say we generate a fixed number of random points, and then we calculate the average distance between any two points, and the average distance to the nearest neighbor of any point. All points are always contained in the cube. What happens to the average distances as N increases? Let’s run another simulation.

Please note the conservative approach to memory management. This is important, since the data structures used here are quite large.

n_points_d = 10**3
# how many pairs of points are there
dist_count = n_points_d * (n_points_d - 1) / 2
# we use the full pair-wise matrix of distances,
# so each distance will be counted twice
dist_count = 2 * dist_count
d_max = d_max_diag

avg_dist = np.zeros(d_max)
avg_dist_nn = np.zeros(d_max)

for d in tqdm(range(d_max, 0, -1)):
torch.manual_seed(0)
# generate random points
point_coordinates = torch.rand((n_points_d, d), device=device)
# compute differences of point coordinates on all axes
coord_diffs = point_coordinates.unsqueeze(1) - point_coordinates
del point_coordinates
# square the coordinate differences
diffs_squared = torch.pow(coord_diffs, 2)
del coord_diffs
# compute distances between any 2 points
distances_full = torch.sqrt(torch.sum(diffs_squared, dim=2))
del diffs_squared
# compute average distance between points
avg_dist[d - 1] = torch.sum(distances_full).item() / dist_count
# compute distances to nearest neighbors
distances_full[distances_full == 0.0] = np.sqrt(d) + 1
distances_nn, _ = torch.min(distances_full, dim=0)
del distances_full
# compute average distance to nearest neighbors
avg_dist_nn[d - 1] = torch.mean(distances_nn).item()
del distances_nn

torch.cuda.empty_cache()

We use far fewer points here (only 1000) because the size of the main data structures increases with N^2, so we would run out of memory very quickly if we generate too many points. Even so, the approximate results should be close enough to the actual values.

This is the average distance between any two points, out of 1000, as a function of N:

average distance between points
average distance between points

For small values of N, the average distance is something like 0.5, or 1. But as N increases, the average distance starts growing, and it’s already close to 18 when N=2000. The increase is substantial.

This is the average distance to the nearest neighbor, as a function of N:

average distance to the nearest neighbor
average distance to the nearest neighbor

The increase here is even more dramatic, as the values are quite tiny when N is below 10 and the points are crowded together in a low-dimensional space. For large values of N, the average distance to the nearest neighbor gets actually close to the average distance between any two points — in other words, the emptiness of the immediate vicinity dominates the measurements.

distance ratios
distance ratios

This is why we said at the beginning that points become nearly equidistant in high-dimensional spaces — average distances and shortest distances become nearly the same. Here you have proof of it from the simulation, and an intuition for the strength of the phenomenon.

As the number of dimensions N increases, all points recede away from each other, even though their coordinates are confined to the same narrow range (-0.5, +0.5). The group of points effectively gets more and more sparse by simply creating more dimensions. The increase spans a couple of orders of magnitude when dimensions increase from a few units to a few thousand. It’s a very large increase.

Conclusions

The curse of dimensionality is a real phenomenon, with a basis in the geometry of N-dimensional spaces.

All else being equal, as you increase the number of dimensions (or features), the points (or observations) move away from each other quickly. There’s effectively more room in there, even though linear intervals along axes are the same. All vicinities empty out, and each point ends up sitting alone in a high-dimensional space.

This should provide some intuition for the fact that some techniques (clustering, various models, etc.) behave differently when the number of features changes substantially. Few observations, combined with a large number of features, may lead to sub-optimal results — and while the curse of dimensionality is not the only reason for that, it can be one of the reasons.

Generally speaking, the number of observations should “keep up” with the number of features — the exact rules here depend on many factors.

If nothing else, this article should provide an intuitive overview of the properties of high-dimensional spaces, which are difficult to visualize. Some of the trends are shockingly strong, and now you should have some idea of just how strong they are.

All code used in this article is here:

All images used in this article are created by the author (see the code link above).

--

--

BS in Physics. MS in Data Science. Over a decade experience with cloud computing.