Working with non-numerical data can be tough, even forexperienced data scientists.

To be useful, data has to be transformed into a vector space first.

One popular approach would be totreat a non-numerical feature as categorical.

Machine learning: How embeddings make complex data simple

No two emails are exactly the same, hence this approach would be of no use.

Computing distance (similarity) between all data samples would give us a distance (or similarity) matrix.

This is numerical data we could use.

Article image

Could we reduce the number of dimensions to a reasonable amount?

The answer is yes!

Thats what we haveembeddingsfor.

Use Case: Shipment Tracking

It’s free, every week, in your inbox.

An embedding is a low-dimensional representation of high-dimensional data.

Typically, an embedding wont capture all information contained in the original data.

Visualizing Embeddings

A good embedding, however, will capture enough to solve the problem at hand.

There exist many embeddings tailored for a particular data structure.

For example, you might have heard ofword2vecfor text data, orFourier descriptorsfor shape image data.

Model accuracy

As long as we can compute a distance matrix, the nature of data is completely irrelevant.

It will work the same, be it emails, lists,trees, or web pages.

We will also go through the pros and cons of this method, as well as some alternatives.

We wont even attempt to cover all the embeddings out there.

Each of them has its own approach, advantages, and disadvantages.

You should also keep a few technical details in mind.

If its not symmetric, we can use(D+DT)/2

instead.

Specifically, let features be a sample matrixXRnp

havenfeatures andpdimensions.

For simplicity, lets assume that the data sample mean is zero.

Numerically, we can findVqby applying SVD-decomposition toX, although there are other equally valid ways to do it.

PCA can be applied directly to numerical features.

Or, if our features are non-numerical, we can apply it to a distance or similarity matrix.

If you use Python, PCA isimplemented in scikit-learn.

The advantageof this method is that it is fast to compute and quite robust to noise in data.

Kernel PCA

Kernel PCA is a non-linear version of PCA.

Specifically, there exist a few different ways to compute PCA.

One of them is to compute eigendecomposition of the double-centered version of gram matrixXXTRnn.

Letxi,i1,..,nbe the feature samples.

Kernel PCA required us to specify a distance.

For non-numerical features, we may need to get creative.

One thing to remember is that this algorithm assumes our distance to be a metric.

If you use Python, Kernel PCA isimplemented in scikit-learn.

The advantageof the Kernel PCA method is that it can capture non-linear data structures.

Multidimensional scaling (MDS)

Multidimensional scaling(MDS) tries to preserve distances between samples globally.

The idea is quite intuitive and works well with distance matrices.

In principle, however,it is possible.

The disadvantageis that its implementation in scikit-learn is quite slow and does not support out-of-sample transformation.

Once the data was collected, the merchant called a data scientist (thats us!)

The dataset contains information on 200 tracked shipments.

The plot below shows how this data looks.

This data looks like troubletwo different flavors of trouble, actually.

The first problem is that the data were dealing with is high-dimensional.

This is where distance matrices and embeddings will come in handy.

We just need to find a way to compare two shipment routes.Frechet distanceseems to be a reasonable choice.

With a distance, we can compute a distance matrix.

Note:This step might take a while.

Writing a distance function efficiently is key.

For example, in Python, you could usenumbato accelerate this computation manyfold.

We will use embeddings we discussed earlier: PCA, Kernel PCA, and MDS.

The labeled data marks four trade posts connected by six trade routes.

Two of the six trade routes are bidirectional, which makes eight shipment groups in total (6+2).

This is a good start.

Embeddings in a model pipeline

Now, we are ready to train an embedding.

For Kernel PCA, we should not forget to apply a radial kernel to the distance matrix beforehand.

How do you pick the number of output dimensions?

The analysis showed that even 3D works okay.

How about one linear and one non-linear model: say, Logistic Regression and Gradient Boosting?

For comparison, lets also use these two models with a full distance matrix as the input.

Time for the verdict:

Are embeddings something that a data scientist should use?

Lets take a look at both sides of the story.

Also tagged with