The golden rules of crowdsourced labelling

Seven tips for getting the most out of your workers

By Tom Nottingham, Data Science Intern

All machine learning (ML) models must be trained. Fundamentally, this requires structured data with relevant labels. These labels are often collected at no cost ‘naturally’ by people simply engaging in normal usage online. They can also be generated by the model itself, and then corrected or augmented by humans. Some models use active learning techniques to increase efficiency by targeting only that training data that would improve the algorithm the most.

When push comes to shove, however, there's likely to be a time where labelled data needs to be created by humans. Whether this is classifying audio clips, highlighting pieces of text or drawing annotations on images, a human given the right instructions will be able to provide these labels.

At smaller scales, it's possible to train a single person to perform your task. This can work well for high-complexity, low-volume tasks, such as a difficult annotation problem, where you can invest time into training your annotator. But as Tim explains, whilst this might solve your immediate problem, this doesn't scale to larger datasets. For these larger, more common problems, with volumes of more than a few hundred classifications, it's more efficient to use a distributed crowd, such as the Amazon Mechanical Turk platform (MTurk).

Human classifiers are key to improving ML models, but using crowds of workers comes with its challenges. As a data science intern with 1715 Labs over the last few months, I’ve come to appreciate some of these, and have learnt and developed ways to mitigate them. In this post, I've compiled seven of my most important learnings.

1. Keep it simple

Breaking down a complex task into smaller, more manageable pieces not only makes it easier for a crowd worker to quickly jump onboard a project, but it provides flexibility in the way you compile your final result. It can also provide you with more insight into your labels and how they reached their final status.

Take the example of a computer vision (CV) task in which you need to provide annotations of an image with labels similar to the Cityscapes dataset. Asking one worker to use 30 different tools to annotate a single image would not only take a huge amount of time, but they would likely lose interest and start cutting corners to get the job done; I certainly would! Asking them to only annotate with one label at a time (road, car, pavement etc) not only gives workers the opportunity to excel at a particular annotation type, but makes it more engaging as they get to see more interesting images faster, rather than being stuck on the same one.

2. Give proper tutorials

It’s important that your workers understand exactly what you need them to do. Lay out exactly what the task is in clear language, showing examples of success and failures along the way. If a worker is stuck or comes across a difficult edge case, the tutorial needs to quickly and efficiently tell them exactly what they are looking for.

3. Use the collective intelligence of multiple workers

Five sets of eyes are better than one, and even a highly competent workers won't get things right 100% of the time. Cross-checking results is one way to assure quality, but often a better final classification can be reached by aggregating multiple classifications from different workers. Using Bayesian statistics and confusion matrices to analyse and weight classifications can also provide a confidence value associated with each label, which can be as useful as the label itself.

4. Test the interface yourself

There's nothing worse than putting all the effort in to get your labelling interface to the crowd, just for there to be a silly mistake that means a task isn’t visible, or a tutorial not accessible. Not only is it a waste of money, but it will leave workers who want to do a good job unsure of what they should be doing!

5. Invest in good workers

A very difficult annotation task that involved hundreds of points

A very difficult annotation task that involved hundreds of points — it was well worth the time and effort to train workers to be good at this task.

Sometimes, especially on harder tasks, it is worth your time and effort in creating custom training and feedback for your workers. Not only will it increase your throughput and label quality, but it will make your workers feel more valued. All too often, MTurk workers submit their work and never hear anything more about it. Engaging with your crowd sets you apart from other requesters and makes it more likely that workers will want to do your tasks.

6. Cut out bad faith actors

Most of the people that work on platforms like MTurk incredibly hard to do the best jobs they can. There are a few, however, who either try to find the easiest way to complete a task and get paid, regardless of quality, or who write scripts to automate the tasks to get more money. Whilst these bad actors are few and far between, it is important to filter out and block these workers from doing your tasks. Generally, this can be done by using assessment tasks for the first few times they work on a particular project, and then reassessing at set intervals.

7. Pay your workers fairly

Graph plotting classifications against time taken

One payment method is to work out the median time taken by workers on a particular task, and then to pay a flat rate to all workers based off that and the local living wage.

This shouldn't have to be said, but MTurk doesn’t force you to pay a minimum wage. Similarly to the last point, treating workers fairly and paying a living wage is not only important ethically, but it means workers are far more likely to want to work for you again and do a good job.

One way to pay fairly is to take the median time that workers took for a task, and pay all the workers a flat rate based off that time. This means you don’t spend a fortune on workers who simply left their tab open, and it is also usually a decent predictor of how long a task should have taken. Paying in this way, however, could lead to wrongly incentivising a small number of workers to finish the task as quickly as they can in order to earn a higher rate per hour. This is where the previous point comes in, as continuously assessing workers means that you can identify these bad apples without punishing the rest of your workforce.

All in all, there is a lot to think about when setting up and optimising a pipeline to create labelled training data. 1715 Labs do all this and more for you, so get in touch today!

Tom NottinghamData Science Intern
Published September 13th, 2021

Labelled data from 1715 Labs helped our model improve robustness and consistency on real world noisy documents

Lorenzo Bongiovanni - Lead Machine Learning Scientist @ Amplyfi
Lorenzo Bongiovanni - Lead Machine Learning Scientist @ Amplyfi

1715 Labs' human-led approach unlocks hard to reach value in complex datasets

Derek Langley - Product Line Design Authority @ Thales
Derek Langley - Product Line Design Authority @ Thales
Trusted by data teams at
  • Thales
  • University of Oxford
  • Nesta
  • Amplyfi
  • Codemill

Contact us to
get your AI out of the lab

We'll guide you through the best solution and implementations to achieve your data goal and make the most of your artificial intelligence.