Your Old Data Has New Stories to Tell

I see a repeating pattern across medicine: data collected years ago, under the constraints of its time, turns out to contain far more than anyone originally extracted from it. The science wasn’t wrong. The tools and analytical capacity simply weren't there yet.

We're living through one of those inflection points right now. In my view, it's one of the most under appreciated opportunities in life sciences research.

Old Data, New Answers

Consider a few examples that show how dramatically the analytical landscape has shifted.

The UK Biobank began enrolling participants in 2006. They collected genetic samples, health records, and lifestyle data from half a million people. The dataset was valuable from the start, but the analyses that have emerged in the last few years (e.g., genome-wide association studies at scale and multi-trait analyses connecting dozens of conditions) simply weren't computationally or methodologically feasible when the data was first gathered.

The Cancer Genome Atlas (TCGA) was launched in 2006 to systematically characterize the genomic landscape of cancer. For years, its primary utility was in identifying mutations and copy number variations within single tumor types. Today, researchers are running cross-cancer analyses, integrating TCGA genomic data with transcriptomic, epigenomic, and clinical outcome data in ways that reveal biology that was invisible to the original investigators. What changed was our ability to reason across data.

Early imaging datasets in radiology was collected before modern deep learning existed. Now, AI models that can detect patterns and predict outcomes that no human radiologist, and no earlier algorithm, could have identified at the time. In some cases, those models are finding signals in images that predict disease progression years before clinical symptoms appear.

The value locked in existing data exceeds what was extractable when it was first collected. The constraint was the bandwidth, the tools, and the ability to integrate across sources.

What I Found When I Went Back to My Own Data

Back when single-cell RNA sequencing was still an emerging technology about a decade ago, I published a single-cell dataset as part of my academic research. Like most researchers working at the frontier of a new method, I was constrained by three things:

the sheer bandwidth required to pursue every meaningful question the data raised
the analytical tools available at the time
access to the computational environments to run those analytical tools

You'd see the shape of an interesting hypothesis, but actually running it required integrating across modalities, cross-referencing external cohorts, connecting expression patterns to clinical outcomes. That was a significant undertaking, to say the least. Choices got made. A lot of potentially valuable analyses simply didn't happen.

I recently pulled that dataset back out and asked: what could we learn from it now, with today's technology in place?

I combined my original single-cell data with two additional sources: GTEx and TCGA — including their bulk RNA-seq, clinical metadata, and genomic data. That combined dataset was ingested into Manifold. It spanned expression profiles from healthy tissue, tumor tissue, and individual cells, linked to patient outcomes. Both GTEx and TCGA are deeply characterized datasets. Beyond expression, they include germline and somatic variant calls, copy number data, and tissue-level QTLs. That breadth is precisely the point. My single-cell data couldn't tell the whole story on its own; integrating it with the full richness of those resources is what makes it possible to draw insights that simply weren't accessible before.

From there, I used Manifold's AI analysis agent to integrate these heterogeneous data sources into a unified analytical framework. The agent reasoned across the data, cross-referenced the published literature, and surfaced fully formed, evidence-supported hypothesis predictions. Each one was specific, testable, and accompanied by the evidence trail behind it.

This is a markedly different output than what I was producing ten years ago. The data was sitting there the whole time. What changed was having the right technology to actually work with it.

The Talk That Opened Some Eyes

This demonstration was the centerpiece of a session I gave at NextGen Omics, Spatial & Data US 2026 in Boston, alongside Vinay Mohta, Manifold's CEO and co-founder, and Sami Farhi, Director of the Spatial Technology Platform at the Broad Institute of MIT and Harvard.

Vinay framed the core challenge: the promise of spatial and multimodal omics is enormous, but integration and interpretation at scale remain the bottleneck. Generating data is no longer the hard problem. What’s hard is making sense of the data across modalities, cohorts, and time.

Sami brought that to life from the Broad's perspective. His team has built Celldega, a cloud-native, open-source visualization tool designed specifically for large spatial datasets, and he walked through how it addresses one of the field's persistent problems: making spatial transcriptomics data accessible and interpretable to researchers beyond the bioinformatics core. Celldega runs on Manifold, meaning collaborators can explore and interact with spatial data directly, without any infrastructure setup or data transfers on their end.

And then I ran the demo. Live. On my actual decade-old data.

The response was much greater than I expected. The people seeking us out at the booth and after my talk were not just single-cell and spatial researchers, but teams doing broad multi-source integration across genomic, proteomic, and clinical data. These folks are often under real resource pressure, with budgets tighter than they'd been in years and the expectation that they'd still generate meaningful discoveries on reasonable timelines.

There was intense interest in the idea of AI agents that could work alongside research teams: handling integrations, surfacing connections, generating hypotheses that a small team would otherwise have to choose not to pursue. They were looking to extend scientific judgment, not hand it off.

That's what my demo showed. And I think that's why it landed so well.

Let's Have This Conversation

If you're sitting on datasets that were collected years ago, or running research programs where the analytical scope has grown faster than your team's capacity to execute it, I'd love to talk.

The opportunity in existing data is larger than most teams have had the chance to explore. The right environment to pursue it simply hasn't been there. That's what Manifold is built to address, and I'm happy to show you what it looks like in practice.

Reach out directly or book time with our team.

Learn More

Related News

No items found.

Agent OS

Collaboration

Tools

Data

Governance

Infrastructure