LLE simplifies complex data in three steps. Here’s how it works, using a dataset of handwritten numbers or digits as an example.
Find nearby data points: For each data point, a locally linear embedding identifies its closest neighbors. These neighbors are small groups of points that are used to represent local relationships. Nearby points can be found in two main ways:
- k-nearest neighbors: This method finds the closest points for each data point based on their distances from it, using a “k” parameter. For example, if k=5, the algorithm looks for the five closest points to represent the local neighborhood. In the handwritten digit dataset, this step would involve identifying the five most similar digit images for each sample based on pixel intensity patterns using metrics like Euclidean distance.
- ε-ball: Instead of a fixed number of neighbors, this method includes all points within a certain distance (ε) from a data point. For instance, all digit images within a specific similarity range would be included in the neighborhood. A small ε might focus on digits with very similar strokes, while a larger ε might include more diverse handwriting styles.
Calculate how neighbors fit together: Each data point is written as a combination of its neighbors. The algorithm adjusts the reconstruction weights to find the best fit. In the handwritten digit dataset, this step involves figuring out how much each neighboring digit image contributes to approximating the current image. For example, the image of the digit “3” might be represented by a weighted combination of similar “3” images in the dataset. The weight matrix ensures these relationships are captured mathematically, and the weights are optimized iteratively to minimize the reconstruction error while maintaining sparse matrix representations to reduce computational overhead.
Create a simpler version of the data: Finally, locally linear embedding creates a simpler version of the dataset. It uses weights to make sure the relationships between data points stay the same, even in fewer dimensions. For the handwritten digit dataset, this would result in a low-dimensional representation where similar digits (like different styles of “3”) are grouped close together. This embedding could then be visualized in 2D or 3D, making it easier to identify patterns, clusters, or outliers in the dataset. The global geometric framework ensures that while local relationships are preserved, the structure in the embedding space is meaningful.