A moveable feast: How to build a labelled dataset
The steps needed to build a useful dataset, from hundreds of samples to hundreds of thousands
By Tim Lingard, Data Scientist
Data, we’re told, is the future. Gleaming rows of neatly catalogued files just waiting to be called into action, to serve the glorious purpose of evaluating lip-sync battles, rewriting Harry Potter, or finding the perfect San Diegan burrito. This shining reservoir of ground truth is, sadly, as far from reality as humanity is from tap-dancing on Jupiter; we’re often instead stuck staring at a sample or summary of a dataset thinking “how on earth can I trust the rest of this?”
There are many wonderful posts about the importance of getting labelling right: from Peter Gao’s “It’s the Data, Stupid!”, to the origins of Tyler Ganter’s trust issues, to the excellent, no-holds-barred data strategy of Kevin Schiroo. This post instead aims to walk the reader through the steps needed to build a useful dataset, from hundreds of samples to hundreds of thousands, including the steps needed to ensure trust at each stage. To keep things simple we’ll focus on reproducing the Food-101 image classifications dataset (101 distinct categories, 101,000 images, handily available through TensorFlow datasets). However, the basic processes are transferrable to hierarchical labelling and more complex annotation.
The appetiser: 100s of images
Let’s say a new project has turned up on our doorstep: we need to be able to label uploaded content in a meal-sharing app to allow filtering, recommendations and other stomach-rumble-inducing user experiences. Hungry to get stuck in, let’s grab a small sample of data and have a flick through, to get a taste for things to come. The simplest tool here is probably your file explorer of choice, but since we need to be able to dig in and do some labelling, setting up a tool such as Label Studio, Prodigy, superintendent or LightTag (for NLP) will allow you to flick through local (or remote) files, labelling and annotating your data with ease.
Here’s an example of a configured Label Studio interface for our dataset; it’s easy to filter classes using the text entry box, and the interface is incredibly configurable — though we run out of keyboard shortcuts for our 101 classes! If you happen to have $390 to hand (or $490 for a company licence), Prodigy is another slick option to deal with a lot of the headaches associated with visualisation, managing which data has already been labelled and optimising the labelling interface.
We’ll want to build a specification document using simple language and plenty of examples, to avoid misunderstandings and ensure consistency across the dataset. This document should highlight some commonly confused categories and what to do in the event of an uncertain label. Think of it like an examiner’s answer booklet that we’ll be able to use to judge our own work in the future. It’s not uncommon for this booklet to evolve; make a note of what image IDs prompt each decision, so we can go back and alter labels if needed!
The f*ck that threshold: 1000s of images
There comes a point in the growth of any dataset when the efforts of one lonesome labeller just doesn’t cut it. This tipping point depends on the labelling task at hand and the importance of the dataset and is generally characterised by a sudden sinking feeling when you realise how much work is left to do. In my experience, it’s often accompanied by an expletive and desire for a nap/long walk/strong drink, but side effects may vary.
Data scientists want to be building, experimenting and analysing, not (semi-) mindlessly clicking! Not to mention, their time is expensive and so should be budgeted carefully. To cap it all, tired eyes make mistakes and, believe me, after labelling a few thousand of these images you’ll be very tired (and very hungry).
But not to fear! There are simple strategies we can employ to make the most of our precious time, one of which is the process of ✨ Active Learning ✨. The use of active learning involves simultaneously training a model while labelling, and letting the model dictate which items in the remaining dataset will help it learn most efficiently. Voyage’s blog contains a great take on active learning and why it is so valuable.
There are many different active learning strategies out there: from simply taking the images with the highest model uncertainty, to training a Bayesian classifier and selecting batches where the model provides multiple confident yet conflicting predictions (BALD & BatchBALD).
One simple way to get started with active learning, if you’re planning on rolling your own, is the modAL python library. Luckily for us the clever people over at Label Studio and Prodigy (among others) let us integrate an ML backend and will use simple active learning strategies to cut down on wasted labelling effort! This is great for rather homogeneous datasets (e.g. repeated-MNIST, as used in the BALD post linked above), but for one as diverse as ours there’ll still be a significant amount of legwork needed.
It’s at this point that bringing other team members onto the task will provide great dividends, as they can both check your work for errors and provide a springboard to help solidify and expand the specification document mentioned above.
Getting to the meat of things: 10,000+ images
We’re now reaching the stage where there’s practically no way for a single person to complete this work. Many incredibly dedicated people have single-handedly created mind-numbingly vast labelled datasets (often graduate students, the pack animals of research), but to make the most of our shiny data scientist’s time, we need a better way to scale.
While active learning is a fantastic way to maximise our effort, all but the most conservative acquisition strategies can introduce some form of unknown bias into what ought to be a highly trusted dataset. What we really need, the teletransportation paradox aside, is to copy ourselves hundreds of times and unleash this army of “perfect” workers onto our dataset.
One successful approach, which led to the ubiquitous ImageNet dataset, is to use the hordes of crowd workers available through workplaces like Amazon Mechanical Turk or Clickworker (or the incandescently wonderful Zooniverse, if you’re an academic able and willing to engage with their volunteer citizen scientists). These workers (besides the Zooniverse volunteers) are paid a handful of cents every time they complete a task, with the idea that the massively parallel nature of the job allows hundreds or thousands of workers to contribute while each still earning a meaningful amount.
Unfortunately, the glory days of ImageNet are long gone. Using a site like Mechanical Turk in 2021 will often lead to immense frustration, as a proportion of the workforce is notorious for ignoring even the best-phrased instructions. That being said, there are many absolutely fantastic workers buried in the pool, and identifying and prioritising them is a topic for another blog post!
So how do we leverage the scale of the crowd, without making ourselves vulnerable to the errors and noise that could pollute our perfect platter of pigeonholed pictures? This is the promise of companies like Cloud Factory, Scale AI, The Hive and (my personal, slightly biased favourite) 1715 Labs. We gather your requirements through an API, workflow builder and possibly a very friendly Zoom call, then go off and leverage heavily monitored internal teams of workers and/or massive, cleverly filtered distributed crowds to all but eliminate the risk associated with outsourcing.
Trust is a beautiful thing
Outsourcing the data labelling process can feel like abandoning your pet at doggy day-care while you go on holiday: you’re off to do enjoyable things but who knows what state Fenton will be in when you return.
This is where that specification document from before really comes to the rescue: if you can comprehensively convey what you need, with examples and supporting text, and translate that knowledge into a form the labelling provider can use (some are more flexible than others here!), then they can leverage all your learning for both their own QA and to pass on, in a distilled form, to the workforce. Trust the kennel to understand your dog, once you’ve provided a list of allergies!
That being said, since we’re adopting a zero-tolerance attitude towards errors, we’ll still need a clever way to do QA. Our old friend from the active learning days comes in handy again here: we can train our model on the labelled dataset and search for odd predictions, where our model is confidently wrong. Alternately, if the model is appropriate, we can make use of dense representations of your data learned by your model to search for labels in suspicious regions of embedding space, a great trick made easy(ish) by the likes of Aquarium Learning and Zegami.
Now we’re cooking: 100,000+ images
Many datasets don’t need to be this big. Many need to be bigger. Lots don’t even need labelling! Chances are if you’ve gotten this far and are dealing with data at this scale, you’ve got a strategy (or strategic partner) in mind. If not, hopefully, the information above will help guide your thinking process and help provide clarity on what questions you ought to ask (how do I ensure consistency across the dataset, how do I make the best use of my / my team’s time etc…). Please get in touch with 1715 Labs if you want to pick our brains about labelling, or if you have an interesting data labelling story, let me know over on Twitter!
Best of luck!