The Shape of Data:
How Topology Finds Hidden Patterns in a World of Information
What if the most important fact about a dataset isn’t a number, a label, or a trend — but a shape?
I. The Question Nobody Asked
Imagine you receive a box containing ten thousand marbles scattered in every direction. Each marble is a different color and carries a different number. You could count them. You could sort them by shade, average their values, map their distribution on a histogram. These are all legitimate moves in the classical statistician’s playbook — and they would tell you a great deal. But they would not tell you the single most interesting thing about the arrangement: whether the marbles form a ring, a cluster, a tangled knot, or a string of loops. They would not reveal the shape.
This is the problem that a branch of mathematics called Topological Data Analysis (TDA) was built to solve. At its core, TDA asks a deceptively simple question: What is the shape of a dataset? It then uses centuries-old mathematical machinery — refined for the age of computing — to answer that question in a way that is precise, provably robust to noise, and applicable to data in hundreds of dimensions at once.
The stakes are higher than they might first appear. We now live in an era of staggering information abundance. Every day, sensors, satellites, genomes, social networks, and neural networks generate oceans of data so vast that no human analyst could hope to inspect them directly. The traditional approach — fitting equations to data, reducing dimensions, measuring correlations — still works, but it has blind spots. It can miss the skeleton, the scaffolding, the hidden architecture of a dataset. Topology, the mathematics of shape, offers a new lens. And the scientists, biologists, and computer scientists beginning to look through it are seeing patterns that simply weren’t there before.
“An important feature of modern science and engineering is that data of various kinds is being produced at an unprecedented rate. The nature of the data we are obtaining is significantly different — often very high-dimensional, where we don’t necessarily know which coordinates are the interesting ones.”
— Gunnar Carlsson, Topology and Data, Bulletin of the American Mathematical Society, 2009II. A Very Old Idea Finds a Very New Purpose
The Bridges of Königsberg (1736)
The story of topology begins not with data, but with a puzzle about bridges. The Prussian city of Königsberg (now Kaliningrad, Russia) was divided by the Pregel River into four distinct landmasses connected by seven bridges. The local question — can you cross each bridge exactly once and return to your starting point? — seems like a geographic problem. But the Swiss mathematician Leonhard Euler, in a 1736 paper, realized it was something deeper. The distances didn’t matter. The shapes of the islands didn’t matter. All that mattered was the pattern of connections. He proved the walk was impossible, and in doing so invented an entirely new kind of mathematics — one concerned not with measurement, but with structure and connectivity.
The word “topology” itself was coined in 1847 by the German mathematician Johann Listing, though the field only solidified with the great French polymath Henri Poincaré. In 1895, Poincaré published Analysis Situs (“Analysis of Position”), the founding text of modern algebraic topology. Poincaré introduced the concept of homology — a way of counting the holes, loops, and voids in a shape — and gave mathematicians the algebraic tools to calculate them. A donut, for instance, has one hole through its center; a coffee mug also has one hole (through its handle). They are, topologically speaking, the same object. A sphere has zero holes; a figure-eight surface has two.
From Pure Math to Data: The Long Road (1990–2002)
For most of the twentieth century, topology remained a purely theoretical pursuit — beautiful and rigorous, but far removed from anything as mundane as a spreadsheet. The bridge to data began forming in 1990, when the Italian mathematician Patrizio Frosini introduced size functions, a way of measuring topological differences between shapes. A decade later, the English mathematician Vanessa Robins used topological ideas to study the shapes of attractors in physical systems — the underlying patterns traced out by chaotic behavior over time.
The decisive leap came in 2002, when Herbert Edelsbrunner, David Letscher, and Afra Zomorodian published “Topological Persistence and Simplification” — the paper that introduced persistent homology groups and established the modern computational framework of TDA. Three years later, Gunnar Carlsson of Stanford University — already famous for solving the deep mathematical problem known as the Segal Conjecture — turned his attention to this new tool. His landmark 2009 paper, “Topology and Data,” published in the Bulletin of the American Mathematical Society, brought TDA to the broader scientific world and launched what is now a rapidly growing international field. Carlsson later founded Ayasdi, a company that applies TDA methods to real-world problems in medicine, finance, and intelligence.
III. Vocabulary: Key Terms and Concepts
The following terms appear throughout this essay and throughout the TDA literature. If you encounter them in research papers or popular science writing, this glossary should clarify what they mean in plain language.
- Topology
- The branch of mathematics that studies properties of shapes that survive stretching, bending, or twisting — but not tearing or gluing. Think of it as the geometry of connectivity rather than measurement. A circle and a square are topologically the same; a circle and a figure-eight are not.
- Point Cloud
- A set of data points in space, each described by coordinates. Could be three-dimensional (a 3D scan), two-dimensional (GPS locations on a map), or any number of dimensions. TDA treats datasets as point clouds and studies their shape.
- Homology
- A mathematical tool that counts the “holes” of different dimensions in a shape: 0-dimensional (connected pieces), 1-dimensional (loops or tunnels), 2-dimensional (enclosed voids), and so on. Invented by Poincaré, it turns qualitative shape information into computable numbers.
- Betti Numbers
- The numerical outputs of homology. β₀ (beta-zero) counts connected components. β₁ counts loops or through-holes. β₂ counts enclosed empty cavities. A sphere has β₀=1, β₁=0, β₂=1. A donut has β₀=1, β₁=2 (two independent loops), β₂=0.
- Filtration
- The process of gradually building up a structure by increasing a scale parameter (like a distance threshold). Starting with just the data points, you slowly connect nearby points into edges, then triangles, then higher shapes — like slowly inflating bubbles around each point until they merge. Each stage captures the data’s shape at a different resolution.
- Simplicial Complex
- A geometric object built from simple building blocks: points (0-simplices), line segments (1-simplices), triangles (2-simplices), tetrahedra (3-simplices), and their higher-dimensional cousins. These are the shapes TDA builds from point clouds during filtration.
- Persistent Homology
- The central tool of TDA. It tracks which topological features (connected components, loops, voids) appear and disappear as a filtration progresses. Features that appear at a small scale and survive a long range are interpreted as “real” structure. Features that appear and vanish almost immediately are likely noise.
- Persistence Barcode
- A visual representation of persistent homology. Each topological feature is drawn as a horizontal bar; the bar begins at the filtration scale where the feature is “born” and ends where it “dies.” Long bars represent significant structure; short bars represent noise.
- Persistence Diagram
- An alternative visualization of the same information: each feature is plotted as a point in a 2D chart, with its birth scale on one axis and its death scale on the other. Points far from the diagonal (y = x) represent long-lived, significant features; points near the diagonal are fleeting noise.
- Topological Invariant
- A property of a shape that doesn’t change under continuous deformations. Betti numbers are topological invariants. The number of holes in a donut stays the same whether the donut is small, giant, squished, or twisted — as long as you don’t tear it or glue pieces together.
- Vietoris–Rips Complex
- A specific type of simplicial complex widely used in TDA. Given a distance threshold ε, you connect all pairs of data points within distance ε with an edge, all triples within ε with a triangle, and so on. As ε grows, the complex grows — this is one way to build a filtration.
- Mapper Algorithm
- A TDA tool for visualizing the global shape of high-dimensional data as a network graph. It “summarizes” a dataset by clustering overlapping slices and connecting clusters that share data points. Often used to discover unexpected subgroups in medical or biological data.
IV. Loops, Holes, and the Geometry of Connection
To understand why topology is useful for data, we first need to understand what topology actually measures. The core insight is this: some properties of a shape survive any amount of stretching and bending, while others don’t. Distance changes when you stretch. Angles change when you bend. But the number of holes in a shape — the number of distinct loops you can trace that cannot be shrunk to a point — stays constant unless you physically cut or glue.
Consider a rubber band. If you lay it flat on a table, it forms a closed loop. You can squish it into a circle, an oval, a rectangle, a wiggly blob — but as long as you don’t cut it or fold it over itself and fuse the layers, that fundamental loop persists. Topology would say: this shape has β₁ = 1 (one one-dimensional hole). A figure-eight rubber band, folded once, has β₁ = 2. A rubber sheet (no hole) has β₁ = 0.
Now apply this thinking to data. Suppose you’re analyzing the positions of stars in a galaxy. Plotted in two dimensions, the data might form a ring structure — a circular arrangement of clusters with a gap in the middle. Traditional statistics might capture the average position, the spread, the density. What it might miss is the ring itself — the global circular topology that tells you the distribution has a hole. Persistent homology would catch this immediately as a long-lived β₁ feature.
V. Persistent Homology — Reading the Shape of a Cloud
The central challenge of applying topology to data is this: data is always finite, noisy, and imprecise. Real topology requires infinite precision. How do you compute the “holes” in a cloud of ten thousand GPS points when there is measurement error, missing readings, and irregular spacing? Persistent homology is the answer — an algorithm that measures topological features across all possible scales simultaneously and filters out which features are real versus which are noise.
The process has three conceptual stages: filtration (building the structure gradually), tracking (noting when features appear and disappear), and summarizing (encoding results as a barcode or diagram).
The key output of this process — the persistence barcode — is a compact visual summary. Each feature in the data gets one horizontal bar. The bar starts where the feature is born (at a small ε) and ends where it dies (when it disappears or fills in). A long bar means the feature persisted across many scales, suggesting it reflects something genuinely structural in the data. A short bar that barely appears before vanishing is most likely measurement noise. As Robert Ghrist, a mathematician at the University of Pennsylvania and one of TDA’s leading expositors, wrote: “The genius of a barcode representation is the ability to qualitatively filter out topological noise and capture significant features.”
VI. A Worked Example: Finding a Ring in the Data
Let us walk through a concrete, simplified example. The goal is to show how persistent homology detects a loop (a circular arrangement) in a set of data points — step by step, using no more than basic intuition.
Draw the 8 sensor positions on a 2D grid. Label them A through H. They form an approximate circle with a gap in the center. At this stage, all you have is 8 isolated points — no edges, no connections, no structure. The Betti count is β₀ = 8 (eight disconnected components), β₁ = 0 (no loops).
Connect every pair of points that are within 1.5 units of each other with a line (edge). At this small radius, only nearest neighbors connect. Points A-B, B-C, C-D, D-E, E-F, F-G, G-H, and H-A each get connected. Now β₀ = 1 (all points joined into one component), but β₁ = 0 (the ring is not yet “closed” — it still looks like a chain, not a loop).
As the radius grows slightly, the last gap in the ring closes: H connects to A. Now we have a complete cycle of edges enclosing an empty interior. The algorithm detects this event and records: a 1-dimensional hole is born at ε = 2.0. At this moment, β₁ = 1. There is a loop. The ring structure has been detected.
At larger ε, every point connects to almost every other point. Diagonal chords form across the center: A connects to E, B connects to F, etc. These cross-connections create triangles that fill in the hole. The algorithm records: the loop dies at ε = 4.0. β₁ returns to 0.
The β₁ feature was born at ε = 2.0 and died at ε = 4.0. Its lifespan is 2.0 units — this is a long, stable bar in the barcode. If the sensors had been positioned randomly and the ring was coincidental, this bar would be very short (born and dying within ε = 0.1 or less). The long lifespan is your statistical evidence that there is genuine circular structure in the sensor placement. The shape is real.
VII. Where Topology Meets the Real World
Once the theoretical foundations were in place, researchers began testing TDA on real problems — and the results were striking. The method has found productive application in biology, medicine, artificial intelligence, and climate science, often uncovering patterns that had eluded conventional analysis for years.
🧬 Cancer Genomics
In 2011, Carlsson and collaborators used the Mapper algorithm to analyze gene expression data from breast cancer patients. They discovered a previously unidentified subgroup of patients — a small cluster with a distinct molecular profile and, crucially, a 100% survival rate in the dataset. This subgroup had been invisible to standard clustering methods. TDA’s shape-aware approach revealed it because the subgroup occupied a topologically distinct region of the high-dimensional data space — a branch that conventional methods had collapsed into nearby clusters.
🦠 Viral Evolution
Researchers at Columbia University applied TDA to track the evolution of influenza and HIV. By treating genomic sequences as points in a high-dimensional space and computing their persistent homology, they could detect recombination events — moments when viral strains exchange genetic material — as topological loops in the data. These loops reflect the non-tree-like structure of evolution when genes mix between strains, a pattern that standard phylogenetic (family-tree) methods were poorly equipped to capture.
🤖 Artificial Intelligence
Inside a neural network, layers of computation transform input data through thousands of numerical dimensions. TDA is now being used to map the “shape” of these internal representations — to understand whether a network has learned genuinely structured knowledge or merely memorized training examples. Researchers have used persistent homology to identify early stopping points in training, detect compromised (trojaned) models, and analyze the topology of the latent spaces of generative AI models, asking: what is the geometric architecture of the knowledge a large language model has built?
🌍 Climate Science
In atmospheric science, persistent homology has been applied to detect weather regimes — large-scale, recurring patterns of atmospheric circulation such as the North Atlantic Oscillation. A 2022 study published in Climate Dynamics argued that the defining feature of weather regimes is precisely their non-trivial topology: they occupy distinct, hole-like regions of the atmospheric phase space. TDA-based methods have also been used to identify atmospheric rivers — the long, filamentary moisture streams responsible for extreme precipitation — with greater speed and lower false-positive rates than previous techniques.
🏦 Finance and Markets
Financial time series have been studied through the lens of TDA as a way of detecting market phase transitions — the moments just before a crash or bubble burst when the structure of market correlations changes in characteristic ways. The topology of correlation matrices, tracked over time, can reveal instabilities that are invisible in standard volatility measures. EPFL researchers have developed TDA-based early warning systems for systemic market shifts, using topological signatures of correlation data to flag anomalous states before they become apparent in prices.
🧠 Neuroscience
The brain encodes information in the collective firing patterns of millions of neurons — a high-dimensional process that is notoriously difficult to characterize. TDA researchers have used persistent homology to study the topology of neural activity spaces, revealing that certain brain regions encode stimuli in geometrically structured ways: place cells in the hippocampus, for instance, appear to map spatial environments onto toroidal (donut-shaped) topological structures in their activity space. This work suggests the brain may literally use topology to represent information.
“Topological data analysis provides a promising path forward: using tools from the mathematical field of algebraic topology, TDA processes data in a manner that is inherently robust to noise and to particular choices of parameters.”
— Applications of Topological Data Analysis in Oncology, PMC / Frontiers in Physiology, 2021VIII. Topology Meets Deep Learning
Perhaps the most exciting frontier is the intersection of TDA and modern artificial intelligence. As neural networks have grown larger and more capable — from BERT-style language models to image generators and scientific foundation models — the question of what these systems actually learn has become both urgent and obscure. They produce outputs that are impressive, sometimes startling, yet their internal workings remain largely opaque. TDA is emerging as one of the sharpest tools for opening that black box.
In 2017, researchers Zixuan Cang and Guo-Wei Wei reported the first integration of persistent homology with deep neural networks, creating what they called topological deep learning (TDL). Rather than treating topology as an external analysis tool, TDL embeds topological invariants directly into the learning architecture — effectively teaching the network to be aware of shape. This has proved especially powerful in molecular biology, where protein structures and drug interactions are fundamentally geometric, and where the scale of complexity would exhaust any conventional feature-engineering approach.
A 2025 review in Artificial Intelligence Review documents the breadth of this new synthesis. Persistent Laplacians, topological neural networks built on simplicial complexes, and hybrid approaches combining persistent homology with graph neural networks are now achieving state-of-the-art results in protein structure prediction, drug discovery, and materials science. The insight driving all of this work is consistent: the shape of the data is not just a curiosity. It is a compressed representation of the most stable, most meaningful structure in the information — the kind of signal that survives noise, survives parameter variation, and generalizes across tasks in ways that raw numerical features often do not.
“Topological data analysis has emerged as a powerful framework for extracting robust, multiscale, and interpretable features from complex molecular data for artificial intelligence modeling. By leveraging the high-level abstraction of topology, TDL dramatically reduces dimensionality, simplifies geometric complexity, and offers an interpretable learning framework.”
— A Review of Topological Data Analysis in Molecular Sciences, Journal of Chemical Information and Modeling, 2025IX. Future Outlook: The Age of Shape
We are entering an era in which the volume of data generated by human civilization doubles roughly every two to three years. Large language models train on trillions of words. Genomic databases contain the DNA sequences of millions of organisms. Climate simulations run at resolutions fine enough to fill petabytes of storage. Astronomical surveys catalog billions of galaxies. In this context, the question of how to find signal in noise — structure in chaos — is not merely academic. It may be among the defining technical challenges of the coming decades.
TDA is not a replacement for statistics, machine learning, or domain expertise. It is a complement — a lens that reveals certain kinds of structure that other methods, by design, cannot see. Its chief advantages are robustness (topological invariants do not change when you add small amounts of noise), coordinate-independence (the results do not depend on how you choose to measure or represent your data), and multi-scale sensitivity (it examines structure at all scales simultaneously rather than committing to one). These are precisely the virtues most needed when dealing with high-dimensional, noisy, and incompletely understood data.
The field is also maturing rapidly in its computational efficiency. Early implementations of persistent homology were computationally expensive enough to limit applications to small datasets. Libraries such as Ripser, GUDHI, and Giotto-TDA have dramatically reduced the computational overhead, making TDA practical on datasets with millions of points. Quantum computing researchers have also begun exploring quantum topological data analysis (QTDA), which could, in principle, offer exponential speedups for certain homology computations — though that work remains largely theoretical.
As AI systems generate ever-larger oceans of information — internal embeddings, activation patterns, latent representations — could understanding the shape of that information become more important than analyzing the individual data points themselves? Could the topological fingerprint of an AI system’s learned knowledge become the key to understanding, auditing, and improving it?
These are live questions, not rhetorical ones. A 2024 paper in the Notices of the American Mathematical Society from researchers at INRIA, Oxford, and LSCE makes a compelling case that topology is not merely useful for climate analysis but may be necessary — that the structure of weather regimes in the atmosphere’s phase space is fundamentally topological, and that no amount of statistical averaging will reveal it without topological tools. Similar arguments are now being made for neural networks, for protein folding space, and for the representation of knowledge in large language models. If these arguments hold, then topology is not a niche technique. It is a foundational tool for understanding the information age itself.
X. A Philosophical Conclusion: Shape as Knowledge
There is something philosophically resonant about topology’s sudden relevance to data science. Mathematics is often described as the language in which the universe is written. But perhaps more precisely, different branches of mathematics capture different aspects of reality — arithmetic captures quantity, geometry captures space, probability captures uncertainty. What does topology capture?
Topology, I would argue, captures relationship. It is concerned not with how large things are, but with how they are connected; not with precise position, but with the essential architecture of proximity and separation. In an era when our most important datasets are fundamentally relational — social networks, protein interaction maps, neural activation patterns, ecological webs — topology’s moment may have been a long time coming.
Leonhard Euler’s insight at the bridges of Königsberg was that some problems are not fundamentally about measurement at all. Henri Poincaré’s insight was that algebra could be used to count the indestructible features of shape. Edelsbrunner, Carlsson, and their collaborators’ insight was that this counting could be done on real, noisy, finite data — and that what it found would be genuinely new. The patient mathematician, looking at a cloud of points and asking “what shape is this?” turns out to have been asking a question that statistics alone could never fully answer.
As the data deluge continues — as AI systems grow denser, genomes multiply, climate models deepen, and neural networks accumulate ever-stranger internal geometries — the ability to read shape will become an increasingly critical scientific skill. Not to replace the equation or the algorithm or the human analyst, but to stand beside them: another eye, trained on the topology of a world that has always had more structure than our instruments could previously see.
“The continued and dramatic rise in the size of data sets has meant that new methods are required to model and analyze them. Topological data analysis offers a method for modeling data by geometric objects — graphs and their higher-dimensional versions — that preserves global shape information that statistics alone cannot capture.”
— Gunnar Carlsson & Mikael Vejdemo-Johansson, Topological Data Analysis with Applications, Cambridge University PressSources and Further Reading
All ten sources listed below were consulted in the preparation of this essay and represent the current state of the TDA literature. Links are provided to open-access versions where available.
-
Carlsson, G. (2009). Topology and Data. Bulletin of the American Mathematical Society, 46(2), 255–308.
https://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/ -
Perea, J. A. (2018). A Brief History of Persistence. arXiv:1809.03624.
https://arxiv.org/abs/1809.03624 -
Edelsbrunner, H., Letscher, D., & Zomorodian, A. (2002). Topological Persistence and Simplification. Discrete & Computational Geometry, 28(4), 511–533. (Foundational paper introducing persistent homology groups.)
https://link.springer.com/article/10.1007/s00454-002-2885-2 -
Ghrist, R. (2008). Barcodes: The Persistent Topology of Data. Bulletin of the American Mathematical Society, 45(1), 61–75.
https://www2.math.upenn.edu/~ghrist/preprints/barcodes.pdf -
Carlsson, G. & Vejdemo-Johansson, M. (2021). Topological Data Analysis with Applications. Cambridge University Press.
https://www.cambridge.org/core/books/topological-data-analysis-with-applications/ -
Lawson, P., Sholl, A. B., Brown, J. Q., Fasy, B. T., & Wenk, C. (2019). Persistent Homology for the Quantitative Evaluation of Architectural Features in Prostate Cancer Histology. Scientific Reports, 9, 1139. See also: Brüningk, S. C. et al. (2021). Topology of data: opportunities for cancer research. Bioinformatics, 37(19), 3091–3102.
https://academic.oup.com/bioinformatics/article/37/19/3091/6329825 -
Strommen, K. (2022). A topological perspective on weather regimes. Climate Dynamics, 58, 3537–3551.
https://link.springer.com/article/10.1007/s00382-022-06395-x -
Faranda, D., Lacombe, T., Otter, N., & Strommen, K. (2024). Climate Science at the Interface between Topological Data Analysis and Dynamical Systems Theory. Notices of the American Mathematical Society.
https://inria.hal.science/hal-04396161v1 -
Gu, W. et al. (2025). Topological data analysis and topological deep learning beyond persistent homology: a review. Artificial Intelligence Review.
https://link.springer.com/article/10.1007/s10462-025-11462-w -
Li, J., Jiang, Z., et al. (2025). A Review of Topological Data Analysis and Topological Deep Learning in Molecular Sciences. Journal of Chemical Information and Modeling.
https://pubs.acs.org/doi/10.1021/acs.jcim.5c02266
Historical topology background further informed by: MacTutor History of Mathematics, University of St Andrews (mathshistory.st-andrews.ac.uk); Britannica articles on Topology; and Wikipedia articles on Persistent Homology and Topological Data Analysis for cross-reference.
