AI-Ready Biodata Is America’s Next Strategic Infrastructure

There is little doubt in Washington that AI is a powerful technology that will help determine which country rules the 21st century. Policymakers from the Hill to the White House have made U.S. AI leadership a priority and invested significant resources towards staying ahead of competitors. Yet the United States is at perhaps greater risk than ever before of losing the broader global technology competition.

Despite growing investment in AI, U.S. policymakers have failed to prepare for its convergence with biotechnology, a fusion that will define economic and national power in the coming decades. While competitors are building coordinated AI-bio ecosystems, the U.S. biodata (biological data) environment remains fragmented, underfunded, and insecure. Without a federally led effort to build AI-ready biodata as national infrastructure, the United States risks ceding leadership in both AI and biotechnology at a critical moment.

The Strategic Importance of the AI-Biotechnology Nexus

Compute, talent, and capital are necessary for AI-enabled biotechnology, but biodata is the binding constraint. Without large, representative, and interoperable biological datasets, AI models cannot generalize, scale, or translate into real-world impact.

The application of AI to biotechnology carries profound promise for national power. From stronger, bio-based armor for U.S. warfighters to patching supply chain vulnerabilities with domestic biomanufacturing, the potential is as vast as biology itself. The country that leads in AI-enabled biology will set the pace not only in health and medical discovery but also in agriculture, industrial production, and potentially even future deterrence. Seizing this potential, however, will hinge on improving America’s access to high-quality, secure biodata that is designed specifically for AI.

Biodata holds the blueprints of life and has become a new form of strategic power in the age of AI. These data, including DNA, RNA, proteins, and metabolites, are foundational to innovation in bio-based materials, fuels, agriculture, and medicine.

The National Security Commission on Emerging Biotechnology’s 2025 final report concludes that dominance in biotechnology will “hinge on who controls the most complete, accurate, and secure biological datasets.” Biodata is a strategic asset for national power in the twenty-first century, analogous to advanced semiconductors or critical minerals. U.S. competitors, namely China, are moving fast to establish AI-bio leadership.

China’s Biotech Edge

China’s advantage in AI-enabled biotechnology is not simply scale, but also coordination. Beijing’s national strategies explicitly link biotechnology, big data, and artificial intelligence under directed planning, aiming to align data generation, compute resources, and industrial translation across sectors. One example is China’s non-invasive prenatal testing ecosystem: The domestic non-invasive prenatal testing market was valued at roughly $608 million in 2023 and is projected to exceed $1 billion by the end of the decade, reflecting widespread integration of genomic sequencing, hospital networks, and commercial bioinformatics services. Firms such as BGI Group operate large-scale sequencing and testing platforms (including the noninvasive fetal trisomy test) that generate and process substantial volumes of genomic data within an integrated ecosystem that spans clinical care, research, and industry. China has also rapidly expanded its domestic cell- and gene-therapy ecosystem, including multiple Chimeric Antigen Receptor T-cell therapy approvals and a growing clinical biomanufacturing base, shortening the path from research to deployment. At the same time, China is building the data substrate that makes AI-bio compounding possible: massive longitudinal health cohorts and national-level biodata platforms designed for large-scale integration and analysis.

At the infrastructure level, the China National GeneBank DataBase functions as a unified biological big-data portal, providing archival, sharing, and analysis services for multi-omics datasets under a coordinated national framework. By aggregating data from diverse projects into a centralized platform supported by cloud computing and bioinformatics tools, the database lowers transaction costs between data collection, model development, and downstream applications. Rather than a single dramatic breakthrough, the strategic advantage lies in the system itself — an ecosystem in which state-directed data collection, AI model training, and industrial production reinforce one another. In national security terms, this coordination reduces the friction between discovery and operational application, whether in biodefense preparedness, supply chain resilience, or advanced biomaterials.

The United States, by contrast, possesses world-class public biodata repositories, which China regularly imports. The National Library of Medicine’s National Center for Biotechnology Information hosts foundational databases such as GenBank, the Sequence Read Archive, and the database of Genotypes and Phenotypes, which collectively store vast volumes of genomic and biomedical data and underpin global open science. The distinction between the United States and China is not the absence of data infrastructure, but the degree of integration and strategic alignment. U.S. repositories were designed primarily for archival access and scientific openness rather than coordinated industrial translation or AI-native optimization, and governance, interoperability, and commercialization pathways remain distributed across agencies, universities, and private actors.

China’s centralized model is not without weaknesses. Rapid expansion of biomedical data generation has strained management and integration capacity, creating quality-control and governance challenges. Moreover, overlapping regulatory regimes governing human genetic resources impose strict review requirements for cross-border data transfer, complicating international collaboration and potentially constraining interoperability with global research ecosystems. Independent analyses also note that heterogeneous biomedical data sources and inconsistent standards can introduce bottlenecks that limit the practical utility of large, aggregated datasets. Centralization can reduce coordination costs, but it can also amplify systemic vulnerabilities (including personal identification, data integrity, and other cybersecurity risks) if verification, auditing, and governance mechanisms lag behind scale.

The lesson for the United States is not to replicate China’s system, but to recognize that coordination, not just innovation, determines AI-bio leadership. In an era where AI performance compounds with data scale and integration, institutional design — not just scientific talent — becomes a competitive variable. The National Security Commission on Emerging Biotechnology concluded that the United States has a roughly three-year window to reassert biotechnology leadership or risk ceding profound military, geopolitical, and economic advantages to China. U.S. policymakers should thus act quickly to patch some key vulnerabilities.

Current U.S. Challenges

The United States lacks the biodata diversity, quality, interoperability, and security required to build a globally competitive AI-bio ecosystem. These weaknesses are often conflated but are analytically distinct. Data diversity determines whether models generalize across populations and environments; data quality affects signal-to-noise and reliability; interoperability governs whether datasets can be combined; and security determines who can access sensitive information and under what conditions. Progress on one dimension does not guarantee progress on others.

Diversity

Many foundational genomic datasets remain disproportionately composed of individuals of European ancestry, limiting model performance across populations and environments. AI systems trained on homogeneous data underperform in diverse settings, reducing both equity and operational resilience. The challenge is not simply volume, but representativeness across genetic, geographic, environmental, and experimental conditions.

Quality

Biological datasets are frequently noisy, inconsistently annotated, or collected under heterogeneous conditions. Missing metadata, batch effects, and inconsistent use of technology or sampling protocols can significantly degrade downstream analytical performance. Large datasets do not automatically produce reliable models if provenance and validation standards are weak.

Interoperability

Even high-quality data are of limited value if they cannot be integrated. This problem extends beyond medicine: Multiple biomedical repositories operate using different formats and ontologies, and efforts such as the Global Alliance for Genomics and Health exist precisely to create shared standards because genomic and phenotypic data are difficult to harmonize across systems. Without agreed-upon technical frameworks, cross-domain analysis in agricultural genomics or bioindustrial phenotyping becomes far more costly and slow, increasing the “integration tax” that should be paid before AI models can be deployed.

Security

Aggregated biodata systems create high-value targets. But fragmentation does not eliminate risk, it just diffuses accountability. Biotechnology supply chains and food-sector infrastructure have faced escalating cyber threats, including ransomware targeting operational systems. Sequencing platforms and laboratory information management systems can expose sensitive intellectual property and bioindustrial processes if inadequately secured. As biodata becomes AI-ready and more tightly linked to industrial processes, the attack surface expands.

Most U.S. biodata is not built for AI. Model systems such as ChatGPT and AlphaFold were trained primarily on “found” data, or data readily available on the Internet, but not intentionally structured for model optimization. Open science has been a U.S. strength, yet much of this data reflects inconsistencies in collection, documentation, and context. Without deliberate curation, shared standards, and AI-native architecture, models risk amplifying noise rather than extracting biological signal.

Biomedical AI has produced meaningful gains, particularly in drug discovery and diagnostics. Companies like Insilico Medicine, Recursion Pharmaceuticals, and DeepMind’s Isomorphic Labs illustrate the promise of AI-enabled therapeutics. But health applications represent only a fraction of the biotechnology ecosystem. Limited support for agriculture, manufacturing, energy, defense, and environmental resilience leaves critical domains underrepresented in national biodata, constraining the development of robust cross-sector AI models.

In the absence of a national strategy to coordinate public and private biodata collection, labeling, and storage, individual stakeholders operate autonomously, following their own practices and standards. The result is a fragmented and disorganized trove of biodata and subpar AI models that could obscure underlying biological signals and produce unreliable outputs. The absence of shared standards and coordination produces fragmented data and subpar models, creating structural weakness in an era of AI-enabled biology.

The Case for Action

These challenges have produced a biodata ecosystem that is ill-prepared for the AI future and threatens America’s biotechnology leadership. The Trump administration has taken several steps to remedy these challenges, but all ultimately fall short.

The recently released National Security Strategy and National Defense Authorization Act acknowledge biotechnology as a critical domain of geopolitical competition and outline initiatives to better integrate biotech into U.S. defense and national security systems. But acknowledgements and pilot programs do not constitute a strategy, especially in light of the National Security Commission on Emerging Biotechnology’s 2025 warning that the U.S. scientific ecosystem is stagnating.

The Genesis Mission executive order signals growing awareness of AI-accelerated science, but without AI-ready biodata, it risks scaling existing fragmentation. Further, Congress has not yet paired Genesis with the investments and standards needed to make biological automation meaningful. Automated data generation without interoperable schemas, quality controls, and secure compute-to-data environments simply produces more inconsistent data faster.

In short, the problem is not a lack of awareness but a lack of scale and coordination. The United States is still trying to compete in AI-enabled biology without the inputs that make AI work: high-quality, secure biodata. The problem is not a lack of awareness, but rather a lack of scale and coordination. Without serious investment in AI-ready biodata and the infrastructure to generate and govern it, federal initiatives will remain disconnected and slow-moving, even as competitors industrialize the AI-bio ecosystem. The United States now needs a focused set of actions that match the pace of the technology and the stakes of the competition.

The Path Ahead

Critics will argue that private firms are best positioned to build AI-ready biodata and that premature standardization risks slowing innovation. In narrow domains, this is often true. But firms rationally invest in datasets that are proprietary, short-horizon, and commercially monetizable, leaving gaps in longitudinal, cross-sector, and public-interest data. These gaps are precisely where national security and long-term competitiveness are decided. Public investment is not a substitute for private innovation — it is the substrate that makes private innovation scalable and transferable.

At the same time, centralizing AI-ready biodata introduces real risks. Biology is inherently dual-use. The same datasets and models that accelerate vaccine development, industrial enzymes, or climate-resilient crops can also lower barriers to harmful biological design if misused. Aggregated biological datasets, therefore, become high-value cyber and insider targets, and poorly governed access could expand both espionage and adversarial exploitation risks. A national biodata strategy should pair coordination with robust compute-to-data controls, auditability, red-teaming, and tiered access frameworks. The choice is not between innovation and security, but between unmanaged fragmentation that multiplies vulnerabilities and deliberately engineered coordination designed with security, access controls, and accountability from the start.

Treat biodata as critical national infrastructure

Congress should direct the Department of Energy, in coordination with the National Institute of Standards and Technology, National Institutes of Health, and other domain-relevant agencies, to fund AI-ready biodata commissioning: large, longitudinal datasets; standardized metadata; provenance tracking; shared security classifications; and sustained maintenance. The private sector cannot produce comprehensive, unbiased, or long‑horizon datasets alone. Public investment is essential to capitalize on America’s academic strengths.

Build a secure national compute-to-data portal

A federated portal should allow vetted users to bring algorithms to sensitive datasets, using privacy-preserving machine learning, differential privacy, and continuous monitoring. Federated compute-to-data models manage, but do not eliminate, the tradeoffs between security and openness. By allowing algorithms to move while data remains in controlled environments, these systems can support sensitive civilian, defense, and commercial datasets without broad release. However, such models require sustained investment in technology infrastructure, governance, auditing, and user vetting to avoid becoming new bottlenecks.

Convert National Defense Authorization Act pilots into binding national standards

Congress should require interoperable metadata, auditability, security classifications, and shared provenance rules across all federally funded biodata. Converting promising pilots, such as those in Sections 244 and 245 in the Fiscal Year 2026 National Defense Authorization Act, into standards requires clear authority. Congress should direct the Office of Science and Technology Policy and the National Institute of Standards and Technology, in coordination with the Department of Energy, National Institutes of Health, and other relevant mission agencies, to define baseline biodata metadata, provenance, and security standards as a condition of federal funding. Compliance would be enforced not through regulation alone, but through procurement rules, grant and contract requirements, and data-sharing eligibility mechanisms already familiar to universities and contractors.

Align Genesis with a national effort to generate AI-ready biodata

The Genesis executive order provides a blueprint for AI-accelerated scientific discovery, but its biotechnology ambitions will falter unless paired with a coordinated national effort to generate AI-ready biodata. The National Institute of Standards and Technology (NIST), working with other relevant domain-specific agencies, should spearhead federal automation initiatives to produce standardized, high-quality biological datasets for AI model development; integrate secure measurement systems; and create continuous data pipelines that feed model training.

The Stakes

Biology is becoming programmable, and AI is the compiler. The nation that leads in AI-bio will shape global standards, industrial supply chains, medical innovation, climate resilience, and biodefense. Sustained investment at the scale of tens of billions over a decade is not extraordinary in context. The United States requested roughly $27 billion in biodefense-related spending for FY2026 alone, while the domestic bioeconomy already supports an estimated $830 billion in annual economic impact across food, agriculture, and biomanufacturing. The real question is not whether the investment is large, but whether the United States is willing to risk becoming dependent on foreign-controlled bioindustrial supply chains for medicines, materials, and critical inputs that sustain both its economy and its military.

Without large-scale public investment in AI-ready biodata, America will fall behind competitors who are already building what we have not yet begun. Losing this race is a choice, and the window to reverse course is closing. America can still lead. But leadership requires building, now, the biodata and related infrastructure that will define the biotech century.

Michelle Holko, PhD, PMP, is a scientist and strategic innovator working at the intersection of biology, technology, and security. She has served as a White House Presidential Innovation Fellow and led projects at DARPA, NIH, DHS CISA, HHS BARDA, and the Department of Defense’s Chemical and Biological Defense Program, with expertise in genomics, bioinformatics, biosecurity, and emerging health technologies. She is currently Principal Scientific Advisor at Computercraft Corporation, an adjunct senior fellow at the Center for a New American Security, and a Principal Investigator at the International Computer Science Institute (ICSI) at Berkeley, and has previously held leadership roles at Google and advised the National Security Commission on Emerging Biotechnology.

John Wilbanks is a researcher and entrepreneur whose career across nonprofit, academic, and commercial sectors has focused on making scientific and healthcare data more accessible and useful. He founded Incellico, a knowledge graph company serving the pharmaceutical industry, and later led Science Commons and helped spin Sage Bionetworks out of Merck, advancing governance and consent models for large-scale biomedical data sharing. He has since held industry leadership roles at Biogen Digital Health and the Broad Institute, where he led product for Terra, a leading cloud platform for genomic analysis.

Sam Howell is an Associate Fellow with the Technology and National Security Program at the Center for a New American Security. Her research interests include biotechnology, quantum information science, and human performance enhancement.

Image: Sidney Hinds via DVIDS.