Most of my applied data science work is in text heavy domains where the objects are small, there isn’t a clear “vocabulary”, and most of the tasks focus on similarity. My go to tool is almost always cosine similarity, although other metrics such as Levenshtein or character n-grams also feature heavily. The reason these tools are great is they work great in situations where you have a small corpus and have a situation where there might be an imbalance between the source corpus and the destination corpus, for example:
Trying to find the most similar product that matches the offer – 10% off Sketchers
Again, these methods are tried and true, but also largely from the machine learning era of AI, as the godfather of ML, Andrew Ng, would say. But with deep learning becoming incredibly practical, and techniques such as word embedding becoming increasingly popular, new ways of looking at text are starting to take hold.
I use embeddings almost exclusively these days, with libraries like spaCy making it impossibly easy to create powerful deep learning models for text, but I sometimes find myself having to perform a basic similarity task, and over the weekend, after being stumped on what I thought was a basic task, decided to have a go at using embeddings on single words.
Take two words, bag and gab. Now, as an applied data scientist, I view all methods and techniques like tools, not an incantation from mtg. If I view cosine similarity as some magical amulet, then I just shove these two words in, take the output and declare profit. Problem is, you have to decide what the output means, and optimize your tool for that goal. If you run these two words through cosine similarity, you get the value 0.6. This is totally fine, if your goal was to see how far apart these words are, but if your goal is to ensure your system is capable of seeing bag and gab as completely different words, but bag and bags as similar, then cosine similarity may not be the right tool.
For my purposes, I needed something:
- That didn’t just look at the character counts, but understood their context
- Was capable of being tuned so I could adjust its thresholds depending on the data set
Essentially, I needed something very similar to Word2vec, but for single words, so with an afternoon free, I thought, hey, let’s see what a Char2vec might look like.
The goal of my experiment was to explore what a character to vector model would perform like, compared to something like tf-idf based character n-grams. I used the same approach as word embeddings and simply pulled apart the two words, created a sliding window to build up the embedding matrix, then did cosine similarity on the resulting vectors.
It was definitely interesting. A couple of observations:
- For the pair bag, gab, when the sliding window is set to (2,3) we get a solid 0 with both models! However, for the pair bag, bags, the embedding worked better than tf-idf
- Overall, for cases where the words are from the same distribution, embedding yields a higher similarity score
Honestly, I was hoping Char2vec would redefine my career and put me on the front page of AI Today, the reality was, it definitely does well, and in some cases, better than the traditional methods like tf-idf based character n-grams, but not as well as I’d hoped.
Firstly, if we were to stem the words going in, then tf-idf based character n-grams would perform better I’m almost sure.
Secondly, I didn’t build up weights for my Char2vec model, so there isn’t an estimator or optimization capability which severely hampers the performance of the model. And as it was the day after Christmas and I was struggling with the effects of food coma, I didn’t have the intellectual horsepower to try and implement sequences to improve the performance of Char2vec.
It wasn’t a complete waste of time though, as it definitely was the right solution to the problem I was trying to solve, so I’m happy about that, and I managed to capture the crude implementation in a notebook, so there is that too.