Here you have a normal data table in your hand. It could include thousands of rows and columns, and usually, the items are all numerical numbers, which are hard to understand. Now, your task is to get information from it as much as possible.
What can you do with the data? How could you find useful information from it?
The general way is to compute some statistical features. But if you want to get a good comprehension of it, you need to find more meaningful features, like clustering information of rows and relationships between different columns.
To do this, you can use some advanced methods like principal components analysis which is known as PCA. But the numerical result of these methods is still hard to understand. So, here we introduce visual signals while using these methods.
For example, with PCA, you can get a squeezed result, which means the number of rows won’t change but columns could be compressed into 2 or 3. Since columns indicate dimensional information, you can say the dataset is transformed from high-dimension to low-dimension, which you can easily visualize as a scatterplot in the coordinate system.
Now, you can get the scatterplot in which every row represents a point, and they could gather into different clusters. We naturally think these clusters are the clustering features of this dataset. And if you assign different colors into different clusters, then you can understand it immediately.
But the question is that since PCA is just one of so many algorithms, how could you make sure the PCA could give you a proper result. Can you trust it? The short answer is No. We need a way to test it.
<< Read more articles in https://www.cxmoe.com >>
Here we compute two kinds of neighbor errors by testing every point in the scatterplot.
What are the neighbors? If points are located in a circle area around the one we selected, then we call them their neighbors.
We use the classic Euclidean distance to find out all neighbors of each point in the low-dimension scatterplot and the original high-dimension data.
Next, we can compare these two kinds of neighbors. For the point of low-dim scatterplots, if its neighbors don’t appear in high-dim data, then we call them false-neighbors. And if the points of low-dim scatterplots miss some neighbors of high-dim data, then we call these missing-neighbors.
Now, we get important errors! that could tell you which part of the scatterplot we can trust and which part is totally wrong.
The next question is how to assign these error messages to our scatterplot?
There are several ways to do it. We can use luminance to represent errors with the original scatterplot or we just generate a new scatterplot with colors to show errors.
With the above methods, we could have several visual results of the original dataset.
So now, you have 3 views. Using the scatterplot of PCA, you 1can easily understand the clustering information. And with these error views, we can easily correct the former understanding. Actually, this kind of solution could be used as a general process, we can also compute other types of views to refine our understanding.
So, you can see these visual methods able you to have a good comprehension of the table in a short time. That’s why we call it: understand complex tables in one minute!