Confronting Bias in Data Is (Still) Difficult-and Necessary
Year after year, datasets get bigger, cloud servers run faster, and analytics tools become more sophisticated. Despite this constant progress, however, practitioners continue to run into the issue of Bias—whether it's lurking in the dark recesses of their data files, popping up in their models' outputs, or framing their project's root assumptions.
A definitive solution to bias will require a lot more than local changes to a data team's workflows; it's not realistic to expect tactical fixes to solve a deep-rooted systemic problem. There's hope, however, in the growing recognition (in tech and beyond) that this is, indeed, a problem to think about, discuss, and tackle collectively.
This week, we're highlighting several articles that cover bias and data (and bias in data) in creative, actionable, and thought-provoking ways.
- The different types of bias you might encounter. For anyone who's exploring this topic for the first time, Shahrokh Barati‘s primer is an essential read on the differences between statistical bias and ethical bias: "two different categories of bias with distinct root causes and mitigations," that can each jeopardize data projects (and harm end users) if left unaddressed.
- A powerful strategy to add to your anti-bias toolkit. After ML models go into production, they continue to evolve as teams fine-tune them to optimize their performance. Every tweak is a potential opening for bias to sneak in – which is why Jazmia Henry advocates for the adoption of model versioning, an approach that "allows for model rollbacks that can save your company money long term, but more importantly, help reduce bias if and when it arises."
- Who shapes the politics of language models' outputs? The rapid integration of chatbots into our day-to-day lives begs the question of their objectivity. Yennie Jun attempted to measure the political leanings of GPT-3's outputs; the fascinating results she reports raise a whole set of questions about the responsibility and transparency of the people who train and design these powerful models.
- How biased data can become a life-and-death issue. When we think of a field where Data Science and ML can make a major impact, healthcare is a common example, with many real-world applications already in use (or getting close). As Stefany Goradia shows, though, the datasets that health data scientists rely on can be rife with numerous forms of bias, which is why it's crucial they know how to identify them correctly.
- A deeper understanding of how bias works within AI systems. To round out our selection, we recommend reading Boris Ruf‘s lucid explanation of the inner workings of models—statistical formulas and all!—and how their design makes them susceptible to producing biased outputs.
For any of you who'd like to branch out into other topics over the next few days—from A/B testing to natural language processing—we're delighted to share some of our recent favorites. Enjoy!
- Ready to dig deeper into the conversation around ChatGPT? Our February Edition featured some of the most thoughtful writing you'll find on the omnipresent chatbot.
- We were delighted to share a new back-to-basics article by Shreya Rao, focusing on essential ML concepts like gradient descent and linear regression.
- It's always interesting to hear writers-who-happen-to-be-data-pros discuss their craft; Parul Pandey‘s new interview, featuring Lewis Tunstall (who wrote a book on NLP and transformers last year) is no exception.
- Take your A/B testing game to the next level by giving it a Bayesian flavor – Matteo Courthoud‘s accessible tutorial shows how to go about it.
- Chayma Zatout‘s latest deep dive is a patient introduction to neural networks, guiding us through the solution to a classification problem in Python.
We hope you consider becoming a Medium member this week – it's the most direct and effective way to support the work we publish.
Until the next Variable,
TDS Editors