`The Ethics of Using Hacked Data: Lessons from the Patreon Breach`

Can data scientists ever justify using hacked data?
This question became especially relevant after a major security breach at Patreon in 2015. What followed wasn’t just a technical failure, but a serious ethical dilemma—one that still matters today.

In this post, I explore what happened, why the leaked data was tempting for researchers, and why—despite the potential benefits—using hacked data is fundamentally unethical.

What Happened to Patreon?

Patreon is a popular crowdfunding platform that allows creators to receive financial support directly from their audiences. In October 2015, the platform suffered a significant security breach when hackers accessed a debug version of the site that had mistakenly been left publicly accessible.

As a result, around 15GB of data was stolen and leaked online. This included:

User account information
Donation and transaction records
Private messages
Portions of Patreon’s source code

While credit card numbers were reportedly not compromised, a large amount of personally identifiable and sensitive information became publicly available.

Why This Created an Ethical Dilemma

For many researchers, especially in academia, access to large, real-world datasets is limited. Before the breach, Patreon only provided small sample datasets to researchers, making it difficult to draw meaningful conclusions.

Suddenly, after the hack, the entire dataset was accessible.

From a purely technical perspective, this looked like a goldmine. But ethically, it raised a serious question:

Does public availability justify using data that was obtained illegally?

The Data Science Lifecycle and Ethical Responsibility

A standard data science project usually follows five stages:

Ethical data collection
Data preprocessing
Modelling
Evaluation
Deployment

The Patreon breach failed at the very first stage.

Sensitive user data—such as emails, passwords, addresses, and donation records—was exposed. This highlights how critical secure data storage practices are, especially for platforms handling financial and personal information.

A Note on Password Security

Passwords should never be stored in plain text. Instead, secure systems rely on:

Hashing (one-way transformation)
Salting (adding randomness)
Secure hashing algorithms such as bcrypt

These techniques ensure that even if a database is breached, passwords cannot be easily recovered.

Can Privacy Be Restored After a Leak?

Some researchers argue that privacy can be preserved through techniques like Privacy-Preserving Data Publishing (PPDP), including:

Suppressing sensitive variables
Grouping values
Adding noise to the data

However, simply removing names or emails is not enough. Quasi-identifiers—such as postcodes, donation patterns, or timestamps—can still be combined to re-identify individuals.

Once deeply personal data is leaked, full privacy recovery becomes extremely difficult.

Arguments For Using Hacked Data

Supporters of using leaked datasets often argue that:

Once data is leaked, it effectively enters the public domain
Researchers gain access to data that would otherwise be impossible to obtain
Analysing such data can support public interest, accountability, or “data activism”

From this perspective, the data is seen as an opportunity for societal benefit.

Arguments Against Using Hacked Data

However, the arguments against are far more compelling:

Researchers often inspect raw data manually before anonymisation, exposing them directly to private information
Individuals whose data was leaked never gave consent
There is a real risk of misuse, including criminal activity
Large leaked datasets are chaotic, increasing the chance of accidental privacy violations

A useful analogy is privacy at home:
Even if someone isn’t doing anything wrong, they still expect not to be watched. Using hacked data violates that expectation.

Kugathas ganeshan