Green peas, penguins, and commercial data labelling

Lessons learned from a career of academic data wrangling at the Zooniverse

Guest post by Chris Lintott, Professor of Astrophysics at the University of Oxford and founder of the Zooniverse.

It started with trying to understand galaxies.

When my team and I launched Galaxy Zoo in 2007, we were hopeful that a few thousand people might participate in our astronomical crowdsourcing project over the course of the next few years. A day later, as 70,000 classifications an hour flooded into the database, it was clear we had underestimated the capacity of the crowd. And once we got a look at their results, we realised we were underestimating their collective ability too.

Volunteers on Galaxy Zoo were able to outperform the admittedly rudimentary machine learning available at the time (though we did write a paper in 2010 pointing out the usefulness of crowdsourced data in training neural networks), but more importantly they dealt with the unusual and the unexpected with ease; that first project generated exciting new discoveries like Hanny’s Voorwerp and the Galaxy Zoo Peas which are still being followed up today.

I didn’t expect, as that first project launched, to be putting my energies into helping a commercial company develop these same techniques, but then I didn’t anticipate travelling to the Antarctic to help Oxford’s premier penguinologist, Tom Hart, with his camera network; that we’d ask volunteers to examine the history of the British Navy to provide vital historical context to climate scientists, or that we would end up crowdsourcing mapping following natural disasters in the Caribbean and Nepal.

Hanny’s Voorwerp and IC 2497

Hanny’s Voorwerp and IC 2497 taken by Wide Field Camera 3 of the Hubble Space Telescope, NASA (link)

Hanny’s Voorwerp and IC 2497 taken by Wide Field Camera 3 of the Hubble Space Telescope, NASA (link).

Yet we did, driven by the sense that each new project taught us more about how to work effectively with a crowd, work which eventually led to the development of the Panoptes platform and project builder that the Zooniverse and 1715 Labs make use of.

This cross-disciplinary expertise shows up in surprising places. An approach to clustering first developed to help volunteers identify events seen by the VERITAS array of telescopes, which record the collisions of high energy cosmic rays with the Earth’s atmosphere, is now being used to map the three-dimensional structure of cells. An approach developed to find distant galaxies is deployed to monitor the ecology of a national park. This seeding of expertise across projects is particularly important when we confront recent developments in deep learning, advances that make the kind of labelling we do both more necessary and more important than ever.

That sounds counterintuitive. More powerful machine learning should mean less work for the crowd, but we’ve found the opposite for three reasons.

Firstly, despite the arrival of multi-purpose machine learning tools such as ImageNet, the performance of a network on a particular problem still depends on the size of the labelled set, something which crowdsourcing directly confronts.

Second, for complex problems, the kind of intuitive response to the unexpected — which, for example, means it’s easy for you to distinguish the experience of spotting a cardboard giraffe cut-out in a bookshop window from that of spotting an actual giraffe popping into Waterstones for the latest Hilary Mantel — turns out to be important for those problems where accuracy is vital.

Thirdly, for an enormous range of labelling problems, we’ve found that when confronted with real-world data the combination of human and machine classification outperforms either alone; diverting the tasks where the intuition of the crowd is most necessary away from machines allows ML to perform better on the bulk of the dataset. This ‘hybrid’ (cyborg?) machine learning is, we reckon, the future for many complex labelling problems. Visitors to Galaxy Zoo today, for example, see images selected precisely for the ability of new crowd labels to increase the capacity of our Bayesian Convolutional Neural Network which handles the bulk of the classifications.

With a mature set of tools and our ‘project-first’ approach to problem solving, we probably shouldn’t have been surprised that the Zooniverse team started to be approached by commercial clients, with problems in domains ranging from textual analysis to Earth observation. It’s to satisfy this demand that 1715 Labs was set up, but I was determined that the new company remain connected to our research. The company draws on expertise from a variety of Zooniverse projects, techniques and team members, and — thanks to a unique spin-out arrangement — will contribute funding back to the Zooniverse itself.

By solving your problems, we’re also helping learn a little more about the Universe.

Chris LintottCo-founder
Published March 23rd, 2021

Labelled data from 1715 Labs helped our model improve robustness and consistency on real world noisy documents

Lorenzo Bongiovanni - Lead Machine Learning Scientist @ Amplyfi
Lorenzo Bongiovanni - Lead Machine Learning Scientist @ Amplyfi

1715 Labs' human-led approach unlocks hard to reach value in complex datasets

Derek Langley - Product Line Design Authority @ Thales
Derek Langley - Product Line Design Authority @ Thales
Trusted by data teams at
  • Thales
  • University of Oxford
  • Nesta
  • Amplyfi
  • Codemill

Contact us to
get your AI out of the lab

We'll guide you through the best solution and implementations to achieve your data goal and make the most of your artificial intelligence.