Applied ML
Kernel Methods for Molecules
Tanimoto kernels on Morgan fingerprints, Coulomb-matrix and SOAP descriptors for materials, FCHL atomic kernels, and GP regression on SMILES with string kernels as a calibrated baseline against GNNs.
Prerequisites
Why This Matters
Before message-passing neural networks dominated molecular property benchmarks, the standard pipeline was a fixed descriptor plus a kernel machine. That pipeline did not go away. Kernel ridge regression on FCHL or SOAP still beats GNNs in the small-data regime that matters most in chemistry: a few hundred to a few thousand DFT-labeled structures of new chemistry, where neural models overfit and miscalibrate.
A positive-definite molecular kernel encodes a similarity prior good enough to work without training. A Gaussian process on top returns calibrated uncertainty, which matters when an active-learning loop picks the next DFT calculation or compound to synthesize.
Core Ideas
The most-used molecular kernel is Tanimoto on circular fingerprints. ECFP (Rogers-Hahn 2010, J. Chem. Inf. Model. 50) and the equivalent Morgan fingerprint encode each atom's neighborhood up to radius (typically 2) as a hashed bit vector with or . The Tanimoto similarity is
This is positive definite (Gower 1971) so it plugs into kernel ridge regression and Gaussian processes directly. The MinHash and SECFP extensions trade some bit collisions for better generalization on rare substructures.
For materials and 3D chemistry the descriptor changes but the strategy does not. The Coulomb matrix (Rupp et al. 2012, Phys. Rev. Lett. 108) encodes a molecule as off-diagonal and on the diagonal, sorted by row norm for permutation invariance. SOAP (Bartók-Kondor-Csányi 2013, Phys. Rev. B 87) projects a smooth atom density onto spherical harmonics for rotation- and permutation-invariance. FCHL (Christensen et al. 2020, J. Chem. Phys. 152) extends SOAP with chemistry-aware basis functions and is the strongest kernel baseline on QM7b and QM9 atomization energies.
Kernel ridge regression is the workhorse fit. The training cost caps practical dataset sizes near without low-rank approximations. For SMILES, string kernels and GP-BO implementations like GAUCHE (Griffiths et al. 2024) make Bayesian optimization over chemical space tractable at typical discovery-campaign scale.
The fair comparison against a GNN is not million-molecule benchmarks. On a 500-molecule slice of new chemistry, FCHL plus kernel ridge typically matches or beats a freshly trained GNN, gives calibrated uncertainty, and trains in seconds. The GNN wins decisively once or when the descriptor stops being a good prior.
Molecular Kernel
A molecular kernel is a positive-definite similarity function between molecules after choosing a representation such as a fingerprint, Coulomb matrix, SOAP environment, or FCHL descriptor. Positive definiteness lets the kernel serve as an inner product in an implicit feature space.
Small-Data Prior Advantage
Statement
In small molecular datasets, a strong fixed descriptor plus kernel regression can outperform a neural model because the descriptor supplies more inductive bias than the data can learn.
Intuition
The kernel pipeline spends its modeling budget on a chemically meaningful similarity prior. The neural model must learn that prior from examples, which is hard when each label is expensive.
Failure Mode
The advantage disappears when the descriptor misses the relevant physics, when enough labeled data are available, or when the task requires learned 3D equivariant features.
Problem
You have 600 DFT labels for a new family of catalysts. Why is FCHL plus kernel ridge a serious baseline before training a graph neural network?
Common Confusions
Tanimoto similarity is not a metric and does not satisfy the triangle inequality
Tanimoto distance is a metric on binary vectors but the similarity itself is not. More importantly, two molecules with on ECFP4 can have very different activities (the "activity cliff" phenomenon) and two with can share a binding mode. Treat Tanimoto as a similarity prior, not a guarantee.
Coulomb matrices are not the same descriptor across molecules
The matrix is and a molecule with 12 atoms produces a different-sized object than one with 18. Standard practice pads to a fixed maximum size and sorts rows by norm, but the sort-by-norm step is only piecewise smooth in the coordinates, which matters for force-field applications.
References
- Rogers, Hahn, "Extended-Connectivity Fingerprints," J. Chem. Inf. Model. 50(5), 2010, pp. 742-754.
- Rupp, Tkatchenko, Müller, von Lilienfeld, "Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning," Phys. Rev. Lett. 108, 2012, 058301.
- Bartók, Kondor, Csányi, "On representing chemical environments," Phys. Rev. B 87, 2013, 184115, arXiv:1209.3140.
- Christensen, Bratholm, Faber, von Lilienfeld, "FCHL revisited: Faster and more accurate quantum machine learning," J. Chem. Phys. 152(4), 2020, 044107.
- Griffiths et al., "GAUCHE: A Library for Gaussian Processes in Chemistry," NeurIPS Datasets and Benchmarks 2023, arXiv:2212.02314.
- Faber et al., "Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error," J. Chem. Theory Comput. 13(11), 2017, pp. 5255-5264.
Related Topics
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
- Gaussian Processes for Machine Learninglayer 4 · tier 3
Derived topics
1- Gaussian Process Regressionlayer 3 · tier 2
Graph-backed continuations