Training Data Disclosures for Generative AI: New Rules and Strategies for 2026
Jun, 4 2026
As of January 1, 2026, the landscape for building generative artificial intelligence has changed overnight. If you develop or distribute an AI system in California, you can no longer keep your training data a secret. The state’s Assembly Bill No. 2013 (AB 2013) is now in force, mandating that developers publish detailed documentation about the data used to train their models. This isn't just a suggestion; it is a binding legal requirement with significant implications for how AI companies operate, compete, and defend their intellectual property.
For many in the industry, this marks the end of the "black box" era where proprietary datasets were guarded as closely as source code. For others, it raises serious concerns about trade secrets and competitive advantage. Whether you are a startup launching your first large language model or an enterprise integrating third-party AI services, understanding these new disclosure obligations is critical. The goal here is not just to explain what the law says, but to show you how to navigate it without exposing your company to unnecessary risk.
What Exactly Does AB 2013 Require?
At its core, Assembly Bill No. 2013 is a California law requiring developers of generative AI systems to disclose information about the training data used in their models. The legislation defines generative AI broadly as any system capable of producing synthetic content-text, images, video, or audio-based on training data. The rules apply to any system released or substantially modified after January 1, 2022, that is made available to Californians, whether for free or for a fee.
The burden falls squarely on the developer, defined as anyone who designs, codes, produces, or significantly modifies such a system. You must publish this documentation on your website. It’s not enough to have it buried in a terms-of-service agreement or hidden behind a login wall. The information needs to be publicly accessible at the point of initial release and whenever you make significant updates to the model.
Crucially, the law does not ask for your raw datasets. It asks for a high-level summary. However, that summary must cover twelve specific categories of information:
- Sources and owners: Where did the data come from? Who owns it?
- Volume: How many data points are in the dataset?
- Type: What kind of data is it (e.g., text, code, images)?
- Intellectual Property Status: Does it include copyrighted, trademarked, or patented material?
- Personal Information: Does it contain personal or aggregate consumer data?
- Timeframe: When was the data collected? Is collection ongoing?
- Usage Dates: When was this dataset first used in development?
- Purpose Alignment: How do these datasets relate to the system's intended purpose?
- Size Estimates: Approximate size of the data, expressed in ranges if exact numbers aren't possible.
- Licensing: Was the material licensed or freely available?
- Synthetic Data: Did you use synthetic data generated by other AI systems?
- General Characteristics: Any other relevant traits of the training data.
This level of detail forces a shift from vague claims like "trained on public internet data" to specific, verifiable statements about composition and sourcing.
The Tension Between Transparency and Trade Secrets
Here lies the biggest headache for AI developers: the conflict between transparency and competitive secrecy. Your training data strategy is often your moat. The specific mix of datasets, the cleaning processes, and the filtering techniques are what make your model better than your competitor’s. Revealing too much could hand your rivals a roadmap to replicate your success.
Legal experts warn that inadvertent disclosure of trade secrets is a real risk. If you describe your data sources in too much detail, you might reveal proprietary partnerships or unique acquisition methods. On the flip side, being too vague could mean non-compliance with AB 2013. The law aims to improve public understanding, not to protect corporate secrets, but the line is thin.
Major players in the industry are already grappling with this. Companies behind leading large language models have started publishing high-level disclosures that categorize their data into broad buckets: publicly available information, data from third-party partners, user-generated content (with opt-out mechanisms), and synthetic data. These disclosures satisfy the letter of the law while keeping the granular details-like specific URLs, API endpoints, or internal curation algorithms-under wraps.
The key strategy here is aggregation and generalization. Instead of listing every single data provider, group them by type. Instead of giving exact counts, use ranges. The goal is to provide enough information for a consumer to understand the nature of the data without revealing the recipe.
Strategies for Compliance Without Compromise
So, how do you comply with AB 2013 without handing over your crown jewels? Here are practical steps to consider:
- Audit Your Data Lineage: Before you can disclose anything, you need to know what you have. Many teams struggle with incomplete records of what data went into which model version. Implement robust data governance tools that track provenance from ingestion to training.
- Categorize, Don’t List: Use the twelve categories mandated by the law as your framework. For each category, provide a narrative description rather than a raw list. For example, instead of naming every news outlet, say "data from major international news agencies obtained through licensing agreements."
- Highlight Synthetic Data Usage: If you use synthetic data, be explicit about it. This is becoming a common practice to augment training sets. Explaining how and why you use synthetic data can actually build trust, showing that you are actively managing bias and quality.
- Review for IP Flags: Clearly state whether your data includes copyrighted material. If it does, explain the legal basis for its use (e.g., fair use, licensing). This preempts potential copyright challenges and shows due diligence.
- Update Regularly: Compliance is not a one-time task. Every time you release a new model or make substantial modifications, you must update the disclosure. Build this into your release pipeline.
Remember, the law exempts certain systems designed exclusively for security, system integrity, aircraft operation, or national security. If your AI falls into these narrow categories, you may be exempt, but don’t assume this applies to general-purpose business tools.
Legal Challenges and Constitutional Questions
The implementation of AB 2013 hasn’t been without controversy. xAI, the developer of Grok, filed a federal lawsuit seeking to block the statute. Their argument hinges on two constitutional grounds: that the law compels the disclosure of trade secrets in violation of the Fifth Amendment and forces speech in violation of the First Amendment.
This challenge is significant. If successful, it could set a precedent that limits the scope of mandatory transparency laws across the United States. Even if the law stands, the litigation highlights the tension between regulatory oversight and corporate rights. Developers should monitor these cases closely, as they may influence how strictly regulators enforce the disclosure requirements.
Moreover, there is a concern about "disclosure fatigue." Researchers like Guha et al. (2023) argue that for disclosures to be effective, they must be understandable, actionable, and verifiable. If every AI company publishes a dense, technical document that consumers ignore, the law fails its primary goal of empowering users. Standardizing formats could help, but it risks making disclosures boring and overlooked. Finding a balance between legal sufficiency and user engagement is a challenge for all developers.
Broader Implications for AI Governance
California’s move doesn’t exist in a vacuum. It aligns with broader global trends toward AI accountability. The European Union’s AI Act also imposes transparency obligations, particularly for high-risk systems. While the specifics differ, the direction is clear: regulators worldwide want to know what fuels AI systems.
For businesses, this means that transparency is becoming a feature, not just a compliance checkbox. Consumers and enterprise clients are increasingly skeptical of AI systems they don’t understand. Providing clear, honest disclosures about training data can differentiate your brand. It signals confidence and responsibility.
Furthermore, as AI models become more integrated into critical sectors like healthcare, finance, and education, the need for traceability grows. Knowing that a medical diagnosis AI was trained on diverse, up-to-date, and properly licensed datasets matters. AB 2013 is a step toward making that knowledge accessible.
| Aspect | Requirement | Strategic Consideration |
|---|---|---|
| Scope | Generative AI systems released/modified after Jan 1, 2022 | Retroactive application requires immediate audit of legacy models. |
| Disclosure Format | Publicly accessible website documentation | Must be easy to find; avoid burying in legal jargon. |
| Data Details | High-level summary of 12 categories | Use ranges and generalizations to protect trade secrets. |
| IP & Privacy | Status of copyrights, patents, and personal info | Clarify licensing and opt-out mechanisms to mitigate liability. |
| Exemptions | Security, national security, aircraft operation | Narrow exemptions; most commercial AI is not covered. |
Looking Ahead: Beyond California
While AB 2013 is currently a state law, its impact will likely ripple nationwide. Other states may follow suit, adopting similar frameworks. Federal legislation is also in the works, though it tends to move slower. By complying with California’s stringent requirements now, you position yourself ahead of the curve for future regulations.
Additionally, the pressure from civil society and academic researchers is mounting. Groups advocating for ethical AI are pushing for greater transparency to address biases and misinformation. Proactive disclosure can help deflect criticism and demonstrate good faith.
The bottom line is that the era of opaque AI development is ending. Training data is no longer just a technical input; it is a subject of public interest and legal scrutiny. Embracing transparency, while carefully guarding your competitive edges, is the only sustainable path forward.
Who exactly is considered a "developer" under AB 2013?
Under AB 2013, a developer is defined as any entity or individual that designs, codes, produces, or substantially modifies a generative AI system. This includes both the original creators of a model and those who fine-tune or significantly alter existing models for new purposes. If you are merely using an API to access a third-party model without modifying its core training or architecture, you may not be classified as a developer, but this distinction can be nuanced.
Do I need to share my actual training datasets?
No, AB 2013 does not require you to publish your raw training datasets or proprietary code. The law mandates a high-level summary describing the characteristics, sources, and composition of the data. You should focus on providing metadata and descriptive information rather than the data itself. This allows you to maintain control over your intellectual property while meeting transparency obligations.
What happens if I fail to comply with AB 2013?
Non-compliance can lead to enforcement actions by the California Attorney General or private lawsuits. Penalties may include fines and injunctions requiring you to cease distribution of the non-compliant AI system. Additionally, failure to disclose can damage your reputation and erode trust with users and partners. Given the retroactive nature of the law, existing models released after January 1, 2022, are also subject to these requirements.
How does AB 2013 interact with copyright law?
AB 2013 requires you to disclose whether your training data includes copyrighted material. This disclosure does not grant you immunity from copyright infringement claims, but it does promote transparency. If you use copyrighted data, you should clearly state the legal basis for its use, such as fair use doctrines or licensing agreements. This helps manage expectations and provides context for potential legal disputes.
Is there a deadline for updating disclosures?
Yes, developers must update their disclosures whenever they release a new model or make substantial modifications to an existing one. There is no fixed periodic review schedule (like annual updates), but the obligation triggers with each significant change. This ensures that the public always has access to current information about the data powering the AI systems they use.