Rapid Identification of Pathogenic Bacteria from Raman Spectra with a CNN–Transformer Hybrid Architecture
DOI:
https://doi.org/10.47852/bonviewJDSIS62027534Keywords:
Raman spectroscopy, bacterial identification, convolutional neural networks, transformer, data augmentationAbstract
Bacterial identification from Raman spectra offers a promising label-free and nondestructive approach, providing molecular fingerprints at the single-cell level. However, practical implementation is constrained by low signal-to-noise ratios arising from short acquisition times, severe class imbalance across bacterial species, and high inter- and intra-species spectral variability. This study presents a two-stage convolutional neural network (CNN)–Transformer pipeline evaluated on the Bacteria-ID dataset, covering 30 bacterial species across approximately 63,000 spectra. Preprocessing combined baseline subtraction, fast Fourier transform, and wavelet decomposition to improve signal quality prior to training. Class imbalance was addressed through synthetic minority oversampling technique and class-weighted loss, while mixed precision computation reduced GPU overhead. Hyperparameters were optimized via Bayesian search using Optuna. The CNN stem extracts local Raman peak features, while the Transformer encoder captures long-range spectral dependencies that convolutional layers alone cannot model efficiently. On the independent test set, the model achieved approximately 85% accuracy and weighted F1, surpassing ResNet (82.2%) and RamanNet (84.7%) evaluated under identical conditions. The lowest-performing species improved from 31% F1 in the unoptimized baseline to approximately 70% in the final configuration. External validation on spectra from alternative instruments or clinical settings has not yet been conducted and represents the most important direction for future work. Extensions toward MRSA/MSSA classification and antibiotic response prediction are planned.
Received: 31 August 2025 | Revised: 13 November 2025 | Accepted: 30 April 2026
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
This study uses the publicly available Bacteria-ID dataset (Ho et al., 2019). No new data were generated. The specific train/validation/test splits, trained model weights, and figure-generation outputs can be obtained from the corresponding author upon reasonable request for academic use. Source code is not publicly shared currently
Author Contribution Statement
Apoorv Patel: Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization. Hongying Meng: Resources, Writing – review & editing, Supervision, Project administration.Downloads
Published
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.