Jordan Meyer and Mathew Dryhurst founded Spawning AI to develop tools that help artists control how their works are used online. Their latest project, Source.Plus, aims to curate “non-infringing” media for AI model training.
Source.Plus’ first initiative features a dataset with nearly 40 million public domain images and images under the Creative Commons’ CC0 license, allowing creators to waive almost all legal interest in their works. Despite its smaller size compared to other generative AI training datasets, Meyer claims Source.Plus’ dataset is already “high-quality” enough to train state-of-the-art image-generating models.
“With Source.Plus, we’re creating a universal ‘opt-in’ platform,” Meyer said. “We aim to make it easy for rights holders to offer their media for generative AI training on their terms and seamless for developers to integrate that media into their training workflows.”
Table of Contents
ToggleRights Management
The ethical debate around training generative AI models, especially art-generating models like Stable Diffusion and OpenAI’s DALL-E 3, remains unresolved and has significant implications for artists.
Generative AI models “learn” to create outputs (e.g., photorealistic art) by training on vast quantities of data. Some developers argue fair use allows them to scrape data from public sources, regardless of copyright status. Others have tried compensating or crediting content owners for their contributions to training sets.
Meyer, Spawning’s CEO, believes no one has settled on the best approach yet.
“AI training often defaults to using the easiest available data, which hasn’t always been the most fair or responsibly sourced,” he told TechCrunch in an interview. “Artists and rights holders have had little control over how their data is used for AI training, and developers have lacked high-quality alternatives that respect data rights.”
Source.Plus, in limited beta, builds on Spawning’s existing tools for art provenance and usage rights management.
In 2022, Spawning launched HaveIBeenTrained, a site allowing creators to opt out of training datasets used by vendors partnered with Spawning, like Hugging Face and Stability AI. After raising $3 million from investors, including True Ventures and Seed Club Ventures, Spawning introduced ai.text for websites to “set permissions” for AI and Kudurru to defend against data-scraping bots.
Source.Plus is Spawning’s first effort to build and curate a media library in-house. The initial PD/CC0 image dataset can be used for commercial or research purposes, Meyer says.
Source.Plus isn’t just a repository for training data; it’s an enrichment platform supporting the training pipeline,” he continued. “Our goal is to offer a high-quality, non-infringing CC0 dataset capable of supporting a powerful base AI model within the year.”
Organizations like Getty Images, Adobe, Shutterstock, and AI startup Bria claim to use only fairly sourced data for model training. (Getty even calls its generative AI products “commercially safe.”) But Meyer says Spawning aims to set a “higher bar” for fair data sourcing.
Source.Plus filters images for “opt-outs” and other artist preferences, showing provenance information. It excludes images not licensed under CC0, including those requiring attribution. Spawning also monitors for copyright challenges from sources like Wikimedia Commons, where someone other than the creators indicates copyright status.
“We meticulously validated the reported licenses of the images we collected, excluding any questionable licenses — a step many ‘fair’ datasets don’t take,” Meyer said.
Historically, problematic images, including violent and pornographic ones, have plagued training datasets.
Adobe touts its Firefly AI as more ethical than rivals like Midjourney. But it was actually trained on images from them. https://t.co/Ep1eadjQML
— Bloomberg Technology (@technology) April 12, 2024
The LAION dataset maintainers had to pull one library offline after reports uncovered medical records and child sexual abuse depictions. Recently, a Human Rights Watch study found one of LAION’s repositories included Brazilian children’s faces without their consent. Adobe Stock, used to train Adobe’s Firefly Image model, contained AI-generated images from rivals like Midjourney.
Spawning’s solution includes classifier models detecting nudity, gore, personal information, and other undesirable content. Recognizing no classifier is perfect, Spawning plans to let users adjust classifiers’ detection thresholds, Meyer says.
“We employ moderators to verify data ownership,” Meyer added. “We also have remediation features where users can flag offending or possibly infringing works, and the data consumption trail can be audited.”
Compensation
Programs compensating creators for generative AI training data contributions have had mixed results. Some rely on opaque metrics, while others pay unreasonably low amounts.
For example, Shutterstock’s contributor fund for artwork used to train generative AI models or licensed to third-party developers isn’t transparent about earnings, nor does it allow artists to set their own terms. One estimate pegs earnings at $15 for 2,000 images, a modest amount.
Once Source.Plus exits beta and expands beyond PD/CC0 datasets, it will differ from other platforms by allowing artists to set their own prices per download. Spawning will charge a flat rate fee — a “tenth of a penny,” Meyer says.
Customers can also pay $10 per month, plus the typical per-image download fee, for Source.Plus Curation, a subscription plan offering private image collection management, up to 10,000 monthly downloads, and early access to new features like “premium” collections and data enrichment.
“We provide guidance and recommendations based on industry standards and internal metrics, but contributors ultimately determine their own terms,” Meyer said. “This pricing model intentionally gives artists the lion’s share of revenue and allows them to set their own terms for participation. We believe this approach leads to higher payouts and greater transparency.”
If Source.Plus gains the traction Spawning hopes, they plan to expand it to other media types, including audio and video. Spawning is in talks with firms to make their data available on Source.Plus and might build its own generative AI models using Source.Plus datasets.
“We hope rights holders wanting to participate in the generative AI economy can receive fair compensation,” Meyer said. “We also hope artists and developers conflicted about engaging with AI can do so respectfully.”
Spawning has a niche to carve out. Source.Plus seems like a promising attempt to involve artists in generative AI development and let them profit from their work.
As my colleague Amanda Silberling recently wrote, the rise of apps like Cara, which saw a surge after Meta announced it might train generative AI on Instagram content, shows the creative community is at a breaking point. They’re seeking alternatives to platforms they perceive as exploitative, and Source.Plus might be a viable one.
However, if Spawning always acts in artists’ best interests (a big if, considering Spawning is VC-backed), it’s uncertain if Source.Plus can scale as successfully as Meyer envisions. Social media has shown that moderating millions of pieces of user-generated content is a challenging problem.