There are so many ways to screw up
And not a single reason to put it more mildly. AI offers some awesome possibilities – and a plethora of things you can do wrong. Remember the series about mistakes of AI adoption? We covered data strategy, the mismatch of business and technology, the human factor, trust in AI… And that’s still not all. I’m not trying to scare you away, though. It’s the spooky Halloween season, so we’ll be telling ghost stories – only with AI fails – so you can be more cautious and mindful in the future. You know, learn the lesson before it hurts you.
Why AI projects fail – common problems
Big is never big enough
Big data is a buzzword, but it’s also rather enigmatic. How big is “big”? How much data do you need? Yes, data is a problem. Not just because there’s not enough of it – though sometimes there is, naturally – but also due to issues with labeling, training data, etc. Because an AI system can only be as good as the data it’s fed with, you can’t have any tangible results if there’s no data behind it. So what’s the problem with data? Well, where do we start…
First, there may not be enough of it. If the business you’re running is small and has a limited set of data, you have to carefully discuss your expectations and the current state of your data set with an experienced AI advisor or data scientist. How much data is enough? See, that’s a tricky question because that depends. It depends on the use case, the type of data, and the result you expect. However, we can often hear “the more, the better”. Seems like in data science projects, more is more, period.
Do as I do, robot
We tend to expect that AI systems perform intellectual tasks as well as we do – or better. That’s a reasonable thing to expect since we all know that “AI is outperforming humans at more and more tasks”. It is. It even beat a Go champion. However, our minds are much more flexible than AI systems. Think about recommendations: you meet an interesting person at a startup event. Let’s give him a name: it’s John. John enjoys talking to you and appreciates your knowledge of business and technology – he asks for a recommendation of a book that will help him gain more knowledge about these things too. You quickly run through all the titles in your head. There’s book A, B, C, D, E… OK, John, I’ve got it. You should read (insert title here). How did you know what you should recommend to John?
Your brain scanned the information you’ve gathered so far – what John knows, what he was interested in when talking to you, what his style is – to assess which book will be best for him, even though you have no idea about his actual taste in books. You had a feeling he’ll like it, and you might be right.
Now, let’s look at an AI system that “meets” John. John enters the website of an online bookstore, and he’s instantly welcomed with a list of bestselling books. Nothing interesting, he keeps clicking “next”. The AI has no context to John – it’s in a “cold start” situation when it can’t generate personalized recommendations because it has no information about John. But John clicks the search bar and looks for “startup”. Oh, there’s the list. He’s browsing and clicking through some titles. At this point, AI figures out that “startups” are what John likes, and recommends content on this subject. It doesn’t know John very well but it uses data about what other users who browsed (or bought) the book “Startup” also liked. But what will happen if nobody else looked for startup books? John will not get relevant recommendations because the system didn’t have any data to learn from.
You and AI may end up recommending different books for John. You both can be right, you both can be wrong, or one of you will be the winner. However, your brain never said “insufficient data” – it just improvised. Artificial intelligence cannot do that. And we, as AI’s “employers” cannot expect it to perfectly reflect the operations and intricacies of the human brain.
I thought labeling was passé
Putting labels on people – sure. Putting labels on data – never. Data doesn’t just have to exist, it has to be labeled – so it has a meaning, too. If data isn’t properly organized, humans have to devote their time to the tedious task of labeling it. Data labeling is troublesome, yet somehow many companies just don’t think about it at all. In an article published on AWS blog, Jennifer Prendki writes:
For many machine learning models that are trained in a supervised way (supervised learning) data labeling is crucial – the models just require the data to be labeled, otherwise, they won’t make sense of it. And because data labeling is such a huge issue, data scientists often choose to use data that has already been labeled – let’s take the example of images. There is a whole variety of quality images available, yet many machine vision projects rely on ImageNet, which is the largest labeled image dataset that contains about 14 million images. Additionally, more and more data is created every day. About 50 terabytes of data is uploaded to Facebook every single day. And Facebook isn’t the only data-generating source. With all the data, we have actually reached a point where there aren’t enough people on the planet to label all the data.
There’s so much of data, it can’t be right
And it might not be right. You may have this feeling that you have all the data you need, you’re just killing it! There might be a lot of data – but is it the right data? If you’re an e-commerce, you likely have a lot of information about your customers – their names, addresses, billing information, perhaps credit card information. You know what they buy and when they buy it. You know what they browse. You also know when they contacted you and via what channel. Now, what data is necessary? You will look at different information when addressing different problems. So when you’re implementing a recommender system, you may not need all the demographic data, but the purchase history is a must. However, when you want to predict churn, different factors will come into play.
So you may have all the data in the world (no, actually, that’s impossible), but is it the data you need? It’s tempting to collect all the data you can, but it’s just not necessary. The key is to get it right, not to collect it all, it’s not a collectible item.