Follow the data…
Setting the scene for data labelling
By Daniel McMahon, CEO
There is a common misconception that the explosive growth across the 3 V’s of data (volume, variety and velocity) delivers instant value. I often hear ‘it’s a data play’ or ‘it’s all about the data’ as bywords for smart decision making or investment insight, normally accompanied by a knowing nod of approval.
I am as excited as anyone about the proliferation of data and particularly the use of modern machine learning techniques to exploit its potential, but am often confused when I read about great advances such as Alpha Go, yet can’t get some of the simpler things in life to work such as getting Amazon’s Alexa to play my daughter’s favourite nursery song.
Inevitably my search for answers brings me back to the data, where a few home truths quickly become apparent.
Where to get started?
With this in mind, the big data divide I’ve learned about is the difference between structured and unstructured data. For structured data, think financial transactions, logs or receipts. Things that come with time/date stamps or locations — or any other existing feature that allows you to get stuck into the data in a way that will resource your problem. Unstructured data is everything else — which is a lot! This comprises up to 90% of all data and includes medical scans, satellite photos, tweets, CCTV, customer feedback and news articles… the list goes on. Crucially, this lack of structure or levers for exploitation means the data needs preprocessing before it can be utilised, which is no small task.
So why is this important?
The explosive growth in data, while amazing in its potential, is being generated in a form that we are unable to immediately take advantage of — it is trapped in an unstructured state. Meaning there’s a barrier to getting started with this incredible resource.
To unlock its value it needs to be structured through a process called labelling. This is where the features that are of interest to you are identified and tagged as per your requirements. This can range from identifying your favourite brands in TV shows, to sentiment analysis of customer feedback, to drawing boxes around cars in traffic or maybe counting plastic on beaches from drone footage.
Structure for the sake of it?
While the general organising and structuring data for its own sake brings the benefits of a tidy house or a searchable catalogue, it’s what this enables with machine learning where things get really exciting.
High-quality labelled data is a key component of AI development, without it we would be a long way from where we are today. Many machine learning methods (the tools that make AI possible) are reliant on labelled data. They use it as training data, to teach machines what to think or as validation data, to verify the machines decisions.
In practice, if you want a computer vision tool to be able to recognise an aeroplane in camera footage, you need to show it thousands of examples of planes in all sorts of orientations, settings, conditions etc— so it can learn what a plane looks like and doesn’t confuse a plane with me getting dropped off to college with an ironing board stuck out of the car window.
How to do it?
This is the inescapable part, where there is no substitute for hard work and graft. We need humans — people to provide the labels. Their efforts can be multiplied with automation and machine learning, but the core component of a high-quality label remains a person. We still need to teach the machines what to do and not do!
The volume and quality requirements of training and validation data in current applications are simply staggering and have given rise to a largely hidden industry that makes AI tick. It is the hard yards that enable amazing leaps with AI and the detail is well described by Tim here.