Bluesky, the decentralized social media platform, is facing renewed scrutiny over the implications of its open API. A recent report by 404 Media revealed that Daniel van Strien, a machine learning librarian at Hugging Face, used Bluesky’s Firehose API to scrape 1 million public posts for machine learning research. The dataset was briefly available in a public repository before van Strien removed it following widespread controversy.
This incident highlights the dual-edged nature of decentralized platforms with open APIs. While openness fosters innovation and research, it also raises significant privacy concerns. Everything posted publicly on Bluesky is, as the platform’s nature implies, accessible for anyone to use—including for training AI models.
Critics argue that such practices blur the lines between public data and ethical AI use. Even when data is technically public, the question of consent looms large, especially in cases where users might not be fully aware of how their posts could be repurposed.
Bluesky’s Firehose API offers developers access to all public content on the platform in real time, a feature intended to encourage third-party integrations and research. However, as the case with Hugging Face demonstrates, this openness also provides opportunities for mass data scraping, often without the explicit consent of users.
As debates around data ethics and AI training continue, incidents like this underscore the importance of transparency and accountability. For platforms like Bluesky, striking a balance between openness and user trust will be critical in the evolving landscape of decentralized social media.