Insights into Nuclear Magnetic Resonance Data Pre-processing: A Comprehensive Review

: Nuclear Magnetic Resonance (NMR) and its derivatives play a pivotal role in molecular analysis across research and clinical domains. However, the intricate nature of NMR data pre-processing, which is integral for accurate analysis, is not easily understood despite the availability of numerous software tools. This comprehensive review aims to unravel the complexities of pre-processing algorithms in both the time and frequency domains. It covers essential steps such as direct current offset removal, eddy current correction, shift and linear prediction, weighting, zero filling, domain transformation, phase error correction, baseline correction, solvent filtering, calibration and alignment, reference deconvolution, binning/bucketing, peak picking, peak fitting/deconvolution, compound identification, integration and quantification, normalization, and transformation. The review uses plain language to enhance accessibility and understanding. By demystifying the algorithms behind these pre-processing steps, we seek to help researchers and practitioners in navigating the nuances of NMR data pre-processing, ultimately fostering better understanding and practical application in molecular analysis

NMR spectroscopy utilizes powerful magnetic fields to analyze samples.When exposed to the radiofrequency radiation produced by an NMR spectrometer, the nuclei in molecules absorb the energy and transition to higher energy levels when possible.This phenomenon is known as excitation.After the radiofrequency pulses are turned off, the nuclei undergo relaxation, releasing the absorbed energy and returning to their original energy levels.The decaying signal resulting from this relaxation process is captured by a receiver coil surrounding the sample tube.The weak, energy-varying currents induced by the relaxation are detected as raw signals from the molecules.Figure 1 uses a proton as an example nucleus to illustrate how a proton signal is generated.

Figure 1 Illustration of the process of generating a proton signal in NMR spectroscopy
Before becoming raw NMR data, these signals undergo amplification and digitization.Raw NMR data, as depicted in Figure 1's middle section, vividly illustrate signal changes over time, capturing dynamic variations and thus are termed time domain NMR data.When multiple signals are present, they diminish over time, mix together, and become difficult to analyze.Therefore, these raw data must undergo a series of pre-processing steps to prepare them for examination in the frequency domain, where signals are separated into distinct peaks.Domain transformation is just one unavoidable pre-processing step among many.
In addition to domain transformation, pre-processing addresses various other issues inherent in raw NMR data.The presence of noise, baseline distortions, and other artifacts necessitates pre-processing to ensure accurate interpretation of the data.Furthermore, for tasks such as peak-based molecule identification and quantification, it becomes imperative to enhance peak resolution and employ intelligent peak definition and deconvolution methods.Moreover, to facilitate meaningful data comparison across spectra, calibration, alignment, normalization, and transformation steps are often indispensable.
While a plethora of software options are available, this review article does not focus on delineating the functionalities of each tool.Instead, our examination delves into the algorithms employed for common NMR preprocessing steps.
To conduct this review, pertinent literature was identified using targeted keywords such as 'NMR,' 'preprocess' or 'pre-process,' and the names of specific preprocessing steps.Searches were executed across databases including PubMed, IEEE, Google Scholar, Copilot, and other relevant platforms, with screening methods involving the assessment of titles, abstracts, and full-text articles.
We categorize pre-processing steps based on common practice, starting with the time domain, where NMR raw data originate, and then proceeding to the frequency domain, where NMR data analysis is conducted on.Within each domain, we arrange the steps by their typical order of application, providing a structured framework for analysis.

Direct current (DC) offset removal
The initial step in NMR data pre-processing involves converting raw NMR time-domain files, referred to as Free Induction Decay (FID) data, from a binary format to text.After reading in the FID, the first issue we need to address is the removal of the direct current (DC) offset, which is a constant voltage added to the NMR signal due to various factors like instrument imperfections or interference.
In Figure 2A, the signal, centred at zero with 10 cycles per second (10 Hz), is converted from the time-domain to the frequency domain to distinguish signals (Figure 2B).However, if a signal detector in an NMR spectrometer has a DC voltage offset, it shifts the signal's centre away from zero in the time domain plot (Figure 2C), causing an unexpected non-signal line on the left in the frequency domain plot (Figure 2D).To tackle this, removing the DC offset in the time domain is necessary.Three methods are available: 1) Last data point method: Subtract the last data point's value from all data points.
3) Phase cycling method: Typically, only one detector is used to detect NMR signals.However, an additional detector positioned 180 degrees apart can be utilized.In this case, clean signals without DC offset can be obtained by subtracting the data of the additional detector from the original data.Note that signal amplitudes are doubled by this method.In this context, "phase" refers to the angular displacement of the NMR signal.To illustrate, considering the signal in Figure 2A, the extra signal detector begins recording the same signal when it reaches its first local minimum.While this idea may not be applicable to most 1D NMR data due to the absence of extra data, it is easily applied to MRI imaging sequences and extended to other phase cycling angles [21].
The most reliable approach to handling DC offset is phase cycling when an extra detector is available.In cases with sufficiently long FID recording times, estimating DC offset using the tail points can be considered.Unfortunately, no optimal solution exists for handling DC offset in other situations.

Eddy current (EC) correction
Eddy currents are induced by the interaction of changing magnetic fields with conductive elements in both the NMR sample and machine [22].These currents create their own magnetic fields, which subsequently affect the designed magnetic field in the NMR system.As a result, these currents lead to variations in observed frequencies, fluctuations in signal amplitude, and phase distortions in acquired NMR signals.
In Figure 3A, the absence of eddy currents results in a consistent cyclic signal in the time domain (Figure 3B), producing a single symmetric peak in the frequency domain (Figure 3C).The presence of eddy currents (Figure 3D) leads to irregular time domain signals (Figure 3E) and multiple peaks, including negative ones, in the frequency domain (Figure 3F), causing significant signal distortion.To correct eddy current effects, we discuss two NMR methods and one MRI method: 1) Phase correction with reference FID: Assuming that a reference-only FID is available and has the exact settings as an experimental FID without the reference, we can subtract the phase vector of the reference FID from that of the experimental FID, resulting in an EC-corrected phase vector.This corrected phase vector is then used to reconstruct a new FID file.The term "reference" here does not refer to a spike-in internal reference for signal quantification; instead, it pertains to a solvent, like water.In practice, utilizing a water-unsuppressed FID as a pseudo reference-only FID and a water-suppressed FID as our experimental FID for eddy current correction is costeffective because we just need to run the same sample twice with two different conditions, and water has significantly higher concentration, allowing us to disregard metabolite [23].
2) Phase error correction with opposite induction directions: Employing a two-step approach with positive and negative magnetic inductions effectively eliminates eddy current-induced phase errors, resulting in mitigated and even phase corrections [23,24].This is better than "Phase correction with reference FID" especially when a pseudo reference-only FID is used.
3) EC-induced magnetic model: This model, more applicable to MRI, establishes the relationship between the EC-induced magnetic field and spatial coordinates.An iterative optimization process, facilitated by specialized software such as "eddy" (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/eddy), determines model parameters to correct MRI data for EC effects.While low-order polynomials (up to the second order) are commonly used to model and correct eddy current-induced distortions, higher-order models (quadratic and cubic) might also find applications [23,[25][26][27][28].
While EC correction is recommended, it may not be applicable when additional data is unavailable.In such cases, alternative options can be employed to partially mitigate EC effects in the subsequent steps, including domain transformation, phase error correction, solvent filtering, and chemical shift calibration.

FID shift and linear prediction
The beginning of the FID sequence is particularly prone to distortion due to sudden radiofrequency shifts compared to the rest of the data.To mitigate this distortion, we may implement left shifts, moving points before time 0 beyond the FID, a method suitable for fully recorded FIDs with minor adjustments.Conversely, right shifts intentionally delay the FID [29], which effectively addresses significant distortions at the sequence start.
Whether using left or right shifts, data gaps inevitably occur.To address these gaps, Linear Prediction (LP) methods are employed to recover lost FID data caused by shifts.Backward LP focuses on data missing at the sequence start due to right shifts [29], while forward LP extends or fills missing segments at the sequence tail due to left shifts.
Furthermore, concerning the LP formulas [30]: (1)   = ∑    − +    =1 (2) Here, m represents the base point index, P is the total number of base points,   is the predicted point,  + and  − are the base points used for backward (1) and forward (2) LP, respectively.  stands for the coefficient of a base point, and   is the random error associated with the predicted point.The prediction process involves an iterative optimization utilizing a loss function, such as squared differences between   and ∑    +  =1 . Careful attention is necessary for FID shifts and backward LP due to potential data distortion [31].A small left shift is generally safer than a larger right shift, and forward LP is considered safer than backward LP.
Figures 4A-B illustrate a simulated FID in the time and frequency domains.Employing a decreasing exponential decay as a weighting function notably attenuates the FID's tail while preserving its initial segment, as demonstrated in Figure 4C.This enhances the signal-to-noise ratio (SNR) but may potentially broaden peaks and cause overlap in the frequency domain [17,39].When this process drives the tail values to zero, it is also termed as apodization [39].Apodization can enhance visualization but is cautioned against when preparing data for spectral analysis.Applying apodization before such analysis may compromise the statistical assumptions tied to the fitting model [27].

Figure 4 Illustration depicting the effect of weighting functions
Only real part is shown.A. Time domain plot of a simulated FID with a single peak.B. Frequency domain plot corresponding to A. C. Time domain plot of A times an exponential decay ( −2.5(−1)/ ).Here, j is index of a given point, and N is the total number of data points in FID.D. Frequency domain plot corresponding to C. E. Time domain plot of A times an exponential growth ( 2.5(−1)/ ).F. Frequency domain plot corresponding to E.
On the other hand, conversely, using an increasing exponential function enhances FID resolution (Figure 4E), resulting in a narrow peak in the frequency domain (Figure 4F).However, it amplifies noise in the FID tail, reducing SNR and potentially causing asymmetric peaks [40].
Utilizing weighting functions involves a delicate balance between sensitivity and resolution, where enhancement in one aspect may come at the expense of the other and could potentially introduce distortions, complicating data recovery.Applying these functions without a comprehensive understanding of the data or a specific sensitivity/resolution goal requires caution.Furthermore, maintaining consistent application of these functions throughout an experiment is vital to ensure comparability of the data.
Alternatively, for enhancing SNR without sacrificing resolution, employing singular-value decomposition-based approaches, such as Cadzow and PCA, alongside a new wavelet transform routine, proves effective in efficiently enhancing SNR and robustly denoising 1D and 2D NMR spectra [41].However, these methods should be applied after molecule identification and quantification, as they could potentially distort quantification results.

Zero filling
Zero filling involves adding zeros to the end of an FID, creating the illusion of higher digital resolution [20,27].For instance, doubling the spectral length and improving digital resolution are achieved by appending zeros equal to the number of experimental points, while additional zero-filling aids in data interpretation through interpolation [29].However, it's crucial to note that zero filling does not contribute real signal data and may increase noise due to the introduction of zeros.Techniques like forward linear prediction and apodization decay functions [20] may aid in such scenarios, but their effectiveness varies.Zero filling is generally safer with ample data and nearzero endpoints, but it's less beneficial with very few data points (Figure 5E-F).Prioritizing extended recording over zero filling is advisable.Consistency in zero filling across all FIDs within an experiment ensures data comparability without exception.Essentially, zero filling merely interpolates points in the frequency domain data without adding new information.Therefore, relying solely on zero filling is insufficient; extending signal recording time is vital.

Domain transformation
Domain transformation is pivotal in converting FID time domain data into the frequency domain.The primary method used for this transformation is the discrete Fourier transform (DFT), which mathematically produces the frequency content of discrete signals through the following formula (3): (3) Here's a breakdown of the formula components:    represents the kth complex number in the frequency domain.
   represents the nth complex number in the time domain.
 N is the total number of data points in the sequence.
 k is the frequency bin index, ranging from 0 to N-1.
This transformation allows FID signals to manifest as single peaks in the resulting spectrum [39].
There are several alternative methods for domain transformations.The linear model, although good for complementing the Fourier Transform, is generally less accurate for independently analyzing multiple signal FIDs [42].Bayesian methods relying on prior distributions [43,44].The wavelet transform is adept at handling uneven frequencies [45].
In standard scenarios without eddy current issues, the Fourier transformation is recommended for its reliability.However, when addressing eddy current-induced frequency alterations, the wavelet transform is preferable as it doesn't require additional data [46].Post-wavelet transformation, it's important to note that the symmetry of frequency domain peaks might not be perfect, risking information loss if forcibly shaped into predetermined forms.

Phase error correction
Raw NMR signals in the time domain are complex numbers representing nuclei's energy changes along two orthogonal directions.After being transformed into the frequency domain, they remain complex numbers, with the real part referred to as absorption and the imaginary part as dispersion.Figure 6A shows a simulated absorption spectrum with three sharp and concentrated peaks.Correspondingly, Figure 6B displays the simulated dispersion spectrum.The phase, calculated as Phase = tan −1   , indicates the relationship between absorption and dispersion.Its corresponding plot is shown in Figure 6C.Figures 6A-C represent ideal signals with no phase errors; thus, Figure 6D shows a phase error plot with all values at 0. However, NMR data always contain phase errors, which can significantly alter absorption, dispersion, phase, and phase error plots, as illustrated in Figures 6E-H.Consequently, a naïve data analysis based on peak locations and areas under curves in the absorption plot (Figure 6E), without considering phase errors, is unreliable because the apparent peaks differ from true peaks without phase errors (Figure 6A).Therefore, phase error correction must be performed before any data analysis.Unfortunately, phase error correction is one of the most challenging pre-processing steps in the frequency domain.Current NMR phase error correction approaches mainly rely on a simple linear model applied to the entire spectrum [17].This model searches for the intercept (zero-order parameter) and slope (first-order parameter) through an optimization process.Different algorithms employ various optimization functions [27,[47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62].However, this simple linear model approach cannot effectively handle non-linear phase errors such as shown in Figure 6H.Recent research continues to rely on manual phase error correction [27,29,36,38,63,64].Unfortunately, manual phase error correction heavily depends on individual experience, leading to inconsistencies and a lack of inter-user reliability.
To address non-linear phase errors, researchers incorporated non-linear terms into a linear model, surpassing the performance of a simple linear model [47,52,65,66].In contrast to standard simultaneous parameter correction, Jaroszewicz et al. [66] proposed an iterative order-by-order search for a linear model extended with quadratic terms.This involves optimizing the first-order parameter, followed by the zeroth order, and concluding with the second order in each iteration.The phase range progressively narrows between iterations.The method continues until reaching the maximum iteration limit or observing no significant changes, aligning with other optimization strategies.The authors stress the automatic nature of the approach, requiring no prior knowledge.However, this approach identifies phase values that maximize absorption and minimize dispersion respectively, and so it might overlook solutions where both objectives are optimized simultaneously.
In addressing all types of phase errors, whether constant, linear, or non-linear, we have recently developed a new R package, 'NMRphasing' (https://cran.rproject.org/web/packages/NMRphasing/).One algorithm in 'NMRphasing' starts with phase error-free data, such as magnitude and power spectra, which theoretically should not contain any phase errors.Subsequently, the algorithm derives the phase error-free absorption spectrum, as illustrated in Figure 6A.Alternatively, we propose multiple linear models to correct phase errors in different peak ranges.In addition, we introduced a novel optimization function aimed at minimizing the disparity between the absolute area under a curve and the net area under the same curve.This approach seeks to maximize absorption through net area while simultaneously minimizing dispersion via absolute area.A smaller absolute area of absorption implies less contamination of dispersion within the observed absorption, consequently reducing the net area of dispersion.This is desirable, as ideal dispersion should ideally exhibit zero net area.
Spatially varying phase errors can be effectively corrected using Adaptive Phase Correction (APC) in magnetic resonance imaging (MRI).Unlike traditional phase correction processes relying on regularization, APC utilizes MRI noise information for complex-valued image regularization, addressing noise bias and improving accuracy in diffusion MRI, especially in regions with diverse noise characteristics.The method involves applying a regularization operator and adjusting the phase based on noise variance estimates, resulting in a final image that accommodates noise characteristics in different regions [67].However, phase-corrected images from this approach still contain phase errors and negative intensities.It is recommended to manually inspect these images, and ensure their compatibility with subsequent processing steps.
Baseline bias estimation in all these algorithms is based on regions without signals [17].Of course, it might be challenging to distinguish noise and signal regions when no prior information about signal locations is available.One method is to classify individual points as either signal or noise points and subsequently employ linear interpolation between noise points to establish the baseline.After the baseline is constructed, it is subtracted from the corresponding spectrum.Most baseline correction methods are automated, although semi-automatic or manual baseline correction methods also exist [29, [70][71][72][73].
Regardless of the baseline correction method used, it is essential to be aware that baseline correction itself can introduce distortion and bias to the data, as it is intertwined with noise modeling [29].

Solvent filtering
As mentioned in Section 2.2 on Eddy Current (EC) correction, solvent filtering becomes a viable alternative when EC correction is not possible due to the unavailability of additional data, as intense solvent peaks often capture most of the effects of eddy currents, excluding the solvent signal range in the frequency domain and minimizing the impact of eddy currents [23].
When the solvent is water, as depicted in Figure 7 around 4.7-5.0ppm, the common practice is to run samples with water suppressed [74,75].If samples are run without solvent suppression and eddy current effects persist uncorrected in the time domain, and if phase error correction fails to rectify the distortion, solvent peaks can become severely distorted.To address this issue, common methods for solvent filtering include: 1).Subtract a solvent-only FID from the experimental FID of the sample and transforming the solvent-filtered FID into the frequency domain.
2).Create a pseudo solvent-only FID by isolating data within the solvent peak range from the frequency spectrum, setting other data points to zero, and transforming it into the time domain.Subsequently, subtract this pseudo solventonly FID from the experimental FID of the sample and transform the resulting data into the frequency domain.
3).Use specialized filters targeting the solvent's frequency range to eliminate the solvent signal [23].
4).Integrate solvent peak removal with baseline correction in the frequency domain [23].
5).Zero out data points within the solvent peak range or setting them to baseline values to effectively remove the solvent peaks [70-72, 76, 77].
However, it is important to note that filtering solvent peaks may also inadvertently remove some true signals from their neighboring components.

Calibration and alignment
To ensure comparability of NMR spectra across different spectrometers, frequencies are expressed in parts per million (ppm) using the ratio of a signal's frequency to the spectrometer's frequency.Calibration, also known as global alignment, sets the internal reference signal's ppm to zero by shifting the entire spectrum [17,63,70,72,79].On the other hand, (local) alignment is to align each peak across a group of spectra to the same ppm position [17,68].
Regardless of the method chosen, during the alignment process, the distance between two neighboring peaks might be increased or decreased.
Figure 8 shows the effect of alignment.While alignment ensures that the same peaks are matched across different spectra, improving visual consistency, it can also alter the distances between peaks within a single spectrum.This is evident in peaks within sample 2 and sample 3, where the intra-sample peak distances have changed.The decrease in intra-sample peak distance could affect peak areas and quantification [14].Therefore, it has been suggested that quantification should be processed on unaligned spectra to avoid this potential issue [14].Also, calibration can be applied without alignment; however, alignment should not be applied before calibration.

Reference deconvolution
In reference deconvolution, the internal reference signal undergoes a transformation into a Lorentzian line, defining an ideal peak shape.This process then extends to all signals, eliminating lineshape distortion across the spectrum.FIDDLE (Free Induction Decay Deconvolution for Lineshape Enhancement), a widely used reference deconvolution method [29, [84][85][86], begins with the generation of a pseudo reference-only spectrum.The spectrum is then transformed into the time domain to obtain the reference-only FID.Through simulation, this FID is deconvolved to achieve an ideal FID with a Lorentzian lineshape.Adjustments using this ideal FID are made to the full FID, resulting in a corrected whole FID, which is then finalized by transforming it back into the frequency domain.
For 2D NMR data, a combined approach integrates reference deconvolution with peak alignment using a "reference spectrum" (also known as the "average spectrum") derived from Principal Component Analysis (PCA) [87].The calculation of the first principal component (PC1) for each peak represents it in the "average spectrum," aligning peaks across spectra with matching phase values to those in the PC1 "average spectrum."Despite not requiring a Lorentzian lineshape, the PCA-based method's alignment process may lead to discontinuous baselines and distorted overlapping peaks [88].Some researchers adopt a hybrid approach, integrating FIDDLE and PCA, by replacing FIDDLE's lineshape with the average lineshape from PCA.While effective for groups of spectra in 2D NMR, this method assumes aligned peaks share the same shape and location, making it particularly suitable for DOSY (Diffusion-Ordered Spectroscopy) data but not universally applicable.
Reference deconvolution primarily addresses lineshape distortion from phase errors [88], not corrected in the phase error correction step.Given the strong assumptions inherent in all reference methods, caution is advised against the indiscriminate application of reference deconvolution.
Fixed-width binning might cause signals to be split or combined, resulting in non-meaningful bin summary data [14].It also struggles with overlapping peaks, and comparability is hindered by alignment issues [39,68].
Intelligent binning, employing AI approaches, overcomes these challenges, generating more meaningful divided ranges [68,89,90].Techniques such as wavelet transformations, dynamic algorithms, and Gaussian or exponential functions are used to detect peak edges [14,69,91].AI binning allows small ppm adjustments across spectra and can be applied to each bin after fixed binning when complex computations are involved [69,92].
Challenges in AI binning include peak screening, necessitating threshold definition considering factors like signal-to-noise ratio and variance.Prior knowledge and manual intervention may also be necessary for effective peak screening [90].
NMRNet, a deep learning approach for automated peak picking [93], identifies peaks by locating points with higher intensity in the spectrum, excluding those below the noise level The key challenge is distinguishing true peaks from noise, treated as a binary classification problem.NMRNet addresses this by inputting retained peaks into a convolutional neural network (CNN), calculating probabilities for their significance.The final step refines the peak list through rule-based filtering.This process involves normalizing resolution and intensity, aiding peak identification.However, normalized data isn't directly usable for quantification.
Another novel algorithm, DEEP Picker, focuses on peak picking and extends its functionality to include deconvolution.Developed by Li et al. [89], DEEP Picker utilizes a deep neural network (DNN)-based approach, employing a sliding window and stacked convolutional layers for point-by-point spectrum prediction.The algorithm classifies each spectrum point into three categories (Class 2 peaks, Class 1 peaks, and Class 0 non-peaks) using a neural network architecture that includes seven 1D convolutional layers, a max-pooling layer, and a SoftMax activation function for classification.However, similar to other peak picking and intelligent binning methods, the determination of low peak amplitude cutoffs, which could vary from protein to protein and from sample to sample for the same protein, relies on prior knowledge [89].
There is no doubt that peak picking and intelligent binning are much better than fixed-width binning, and while their existing methods show promise and can be automated, human inspection is necessary to train more accurate models and allow room for the development of new methods in the future.Additionally, if the peak picking process involves normalization, these normalized data should not be used for further analysis especially quantification.

Peak fitting/deconvolution, and compound identification
In this step, our aim is to identify molecules from data signals or "peaks" using peak fitting and deconvolution.Peak fitting precisely defines peak characteristics, while deconvolution untangles overlapping peaks, separating different molecule contributions.Figure 9 illustrates the challenging pre-processing step of deconvolution, which involves optimization using specific loss functions, such as the sum of squared differences [90].1).In the DEEP pipeline by Li et al. [94], a single convolutional layer with a linear activation function is employed to forecast peak characteristics, including position, amplitude, width, and the Lorentzian fraction of the peak shape.
2).Häckl et al. [71] created a user-friendly R-package for fully automated deconvolution of overlapping signals using Lorentzian line-shapes.The process involves constructing individual Lorentz curves for each signal, requiring a peak selection procedure and parameter approximation method.The integral of the Lorentz curve is ultimately used as the area under the curve for singlets or multiplets.
3).Prostko et al. [60] developed a customized and automated deconvolution method for ssNMR mixture spectra, employing linear combination modeling (LCM) by integrating reference spectra of pure solid-state components.
4).Schmid et al. [95] presented a robust deep learningbased deconvolution algorithm for 1D experimental NMR spectra, leveraging a neural network trained on synthetic spectra with customized pre-processing and labeling for accurate estimation of critical peak parameters.
The compound identification usually relies on libraries like the Human Metabolomics Database (www.hmdb.ca)and Biological Magnetic Resonance Databank (www.bmrb.wisc.edu).Lefort et al. [77] developed the R package ASICS, which includes a metabolite library comprising 190 spectra.The identification of metabolites is accomplished by fitting a mixture model to the library spectra, employing a sparse penalty, and quantifying the concentration of metabolites in a complex spectrum.Wang et al. [96] developed NMRQNet, which aims to establish a deep-learning-based pipeline for the automatic identification and quantification of predominant metabolite candidates in human plasma samples.
Challenges arise due to potential data-library incompatibility from different sources or handling methods.We must remain vigilant about these issues during compound identification [10,97].

Integration and quantification
In the integration and quantification process, we employ summation within specific ranges, facilitating the quantification process.This involves determining the concentration of each molecule in the sample based on the area under the curve of the peaks [64].
Integration of signals is conceptually straightforward with raw intensities, although some researchers prefer to integrate absolute intensities [71].The challenge lies in defining signal edges intelligently, a task initiated in the peak picking and peak fitting/deconvolution steps.A simple approach is to use a range of 24 times the signal width for integration, but caution is advised as this may inadvertently include unintended signals [98].A more practical way is to set the integration range to be at least twice the full width at half maximum (FWHM).However, using a too narrow range may lead to challenges like overlapping peaks.Therefore, it is advisable to restrict integration to datasets featuring sparse, well-phased peaks and devoid of baseline or macromolecule (MM) interference [27].
Quantification relies on factors such as the area of a signal, the number of nuclei in the signal, the area of a reference signal, the number of nuclei in the reference, and, notably, the reference's concentration in the specimen.In the absence of an internal reference concentration, alternative methods include external references or electronic references [98].While internal references generally offer more accurate concentration estimations, care should be taken if they interact with other signals [91].In cases where area determination is challenging, such as in 13C NMR spectra, height may be used instead.
When multiple signals contribute to the concentration estimation of a compound, the choice is between selecting the most stable and isolated peak or calculating the mean value from multiple signals.In instances of multiple technical replicates, concentration estimation should be based on the mean value across these spectra [64].
Although quantification is typically based on a reference signal, for comparative analysis, Canlet et al. [63] employed additional methods in the metabolite quantification process.These methods included determining concentrations from peak areas using a regression model, a calibration curve, calibration-range solutions, and a sum of Voigt pseudo-function shapes fitted through a combination of Gaussian and Lorentzian functions with optimization.Other researchers may also use peak fitting for quantification [29,87,88].
While these fit lines contain fewer or no random errors, they might deviate from observed spectrum data [34], leading to inaccurate peak areas and compound concentrations.Additionally, if the research goal is to identify significantly different peaks between two groups of spectra, using "error-free" numbers can potentially reduce or underestimate variance between groups and increase the false positive rate.

Normalization and transformation
This step aims to make data comparable or suitable for the assumptions needed in subsequent statistical analysis.

Normalization
Normalization is to make data comparable, which can be classified into spectrum-wise and location-wise normalization and can involve various approaches.a. Spectrum-wise normalization Spectrum-wise techniques, like dividing peak areas by total spectrum area [74,77,90,99], assume equal total signal quantities, possibly impractical in diverse spectra.An alternative is normalizing using an internal reference area [14,76], adaptable to binned NMR data.
Less common spectrum-wise techniques include distribution-based strategies like quantile normalization [37,100,101], histogram (matching) normalization [14], and spline normalization, align data distributions.Quantile normalization orders and transforms values across spectra to achieve uniform distributions.Histogram normalization scales data based on minimum and maximum values from a reference spectrum [102].When a reference spectrum is unavailable, the average or median spectrum across a group of spectra can serve for histogram normalization.Spline normalization fits quantiles from experimental and reference spectra to a smooth cubic spline, which is then used to generate normalized data for the experimental spectrum [101,103].Similarly, the cubic spline can be replaced by LOWESS (Locally Weighted Scatterplot Smoothing) [104].
Among these techniques, reference-based normalization, relying on a spike-in internal reference with a known concentration, is widely regarded as the most robust choice.

b. Location-wise normalization
Location-wise normalization ensures comparability of a variable across different locations.While methods in this section can be applied to Spectrum-wise normalization (Section a.), those in Section a. are generally not applicable here.The simplest method is variable centering, which involves subtracting the mean or m edian across spectra for the same location and adding a constant [14].
Level scaling adjusts variables by dividing them by their mean at the same location across spectra, promoting alignment and facilitating comparative analysis [14,17].Unit variance scaling (auto-scaling) standardizes variables by dividing each by its standard deviation, ensuring all variables contribute equally to analysis regardless of their initial scale.Vast scaling enhances sensitivity to mean differences by multiplying unit variance-scaled data by their coefficients of variation, highlighting variations effectively [14,105].Pareto scaling mitigates the impact of large variances while preserving data structure, making it suitable for datasets with heterogeneous variance.Range scaling adjusts variables based on their range, facilitating comparisons across different scales by normalizing their values relative to their spread [96].
Standardization, a traditional normalization method, involves subtracting the mean and dividing by the standard deviation.However, direct standardization is not applicable to NMR data due to positivity concerns.A variation involves subtracting the mean, adding a constant, and then dividing by the standard deviation.
Vignoli et al. [106] compiled a list of 23 state-of-the-art normalization methods, recognizing the elusive consensus on optimal normalization due to the contingent nature of method choices based on available information and research goals.
The absence of consensus regarding the ideal normalization method emphasizes the necessity for ongoing research and evaluation in this field.Diverse methods may yield varying interpretations regarding the data's structure and variable significance, and impact results [105].In current practice, it is essential to ensure consistency by applying a specific normalization method consistently throughout an entire experiment to maintain data comparability.Additionally, regardless of the chosen normalization approach, there is a risk of amplifying the noise range, potentially compromising the integrity of the entire dataset if noise is misclassified as peaks during the peak picking step.Lastly, it is crucial to note that locationwise normalized data should not be utilized for quantification purposes, as it obscures quantity differences among peaks.

Transformation
Transformation is applied to each variable in NMR data to align the data with the assumptions required by a statistical method.The most commonly used transformation is the log transformation, enhancing normality and mitigating heteroscedasticity [14].It's important to note that log transformation is unsuitable for non-positive numbers, and its nonlinear nature may lead to noise amplification.The G-log transformation refers to a generalized log transformation or its variants [107][108][109].While high values are logarithmically transformed similar to the regular log transformation, low values or noise undergo specialized transformation to avoid noise amplification issues.Implementing G-log requires prior knowledge of high and low-value thresholds [17,90].
The Box-Cox transformation aims to find the optimal power transformation for effective normalization, reducing non-normality effects and eliminating heteroscedasticity [14,110].
Regardless of the transformation method used, it is crucial to note that while variable values can be transformed back to the original scale, reverting variances and 95% confidence intervals to the original scale poses challenges.
With the exception of internal reference-based area spectrum-wise normalization, all other methods in Section 3.9.are tailored for statistical analysis, not molecule quantification.

Conclusion
In conclusion, this comprehensive review provides a detailed exploration of NMR data pre-processing in both the time and frequency domains.In the time domain, it carefully covers key pre-processing steps, including DC offset removal through methods like phase cycling, eddy current correction primarily addressing phase errors, two directions of FID shift and linear prediction, the impact of weighting functions, zero-filling ratio, and choices in domain transformation.Emphasizing the importance of each step for reliable data analysis, the review discusses potential distortions and provides guidelines for application.
Transitioning to the frequency domain, the article delves into the intricacies of NMR data pre-processing, spotlighting critical steps.Dealing with non-linear phase errors can be challenging, but the 'NMRphasing' R package offers potential solutions.Baseline correction methods and solvent filtering techniques are discussed with attention to potential distortions.The review also covers alignment methods and their impact on quantification.While reference deconvolution aims to address lineshape distortion, the assumptions behind it are often not practical.Additionally, the review discusses binning strategies and emerging artificial intelligence approaches, recognizing the need for human intervention.Challenges in compound identification, integration, quantification, and a comprehensive overview of normalization and transformation techniques tailored for statistical analysis are addressed, underscoring the careful selection of methods to ensure accurate NMR data interpretation.
Among these pre-processing steps, non-linear phase error correction, peak picking, intelligent binning, and peak deconvolution present notable challenges.While various methods exist, promising avenues for improvement are offered by optimization processes, particularly those aided by artificial intelligence techniques and deep learning with neural networks.However, adapting neural networks to NMR data requires balancing complexity with practical application, which poses a significant challenge similar to other deep learning applications.Additionally, the size limitation of NMR datasets poses a formidable obstacle to effectively training deep learning models.
Strategies to overcome this size limitation include aggregating NMR spectra from various sources and implementing normalization methods across datasets to create large, comparable training datasets.However, as discussed earlier, normalized spectra cannot be used for quantification, adding another layer of challenge compared to deep learning in other fields such as natural language processing.A potential solution is to apply a traceable normalization method before deep learning on training datasets, maintaining consistency with new spectra but reverting to non-normalized spectra after intelligent binning and peak deconvolution.
Alternatively, incorporating various factors, such as source differences, into deep learning models may enhance performance and ease of application, albeit at the expense of increased complexity.
Ultimately, the pursuit of innovative approaches that strike a balance between complexity and applicability will drive advancements in NMR data pre-processing.These advancements have the potential to not only improve the accuracy and reliability of NMR data analysis but also facilitate broader utilization across diverse research domains.
Looking ahead, an exciting future involves developing fully automatic AI programs capable of generating comprehensive data, including lists of components and quantities, immediately after NMR analysis.Achieving this goal requires building a robust training database from diverse samples and developing tailored deep learning algorithms for NMR data.This approach aims to simplify data analysis scientific disciplines and enhance realtime clinical applications such as MRI and fMRI.By focusing on practical usability, these advancements aim to support researchers in various fields.

APPENDIX
Websites for NMR data preprocessing software mentioned in text

Figure 2
Figure 2 Effect of DC voltage on a signal

Figure 3
Figure 3 Illustration depicting the eddy current effect in time and frequency domains

Figure 5 Frequency
Figure 5 Frequency plots illustrating the impact of ending value, zero filling, and signal percentage

Figure 6
Figure 6 Illustration of NMR phase errors

Figure 7 A
partial segment of a real-world absorption spectrum illustrating the distorted water peak around 4.7 -5.0 ppm in a urine sample

Figure 8
Illustration of alignment in NMR spectraTop panel: Before alignment.Bottom panel: After alignment.Each color represents a partial segment of a realworld absorption spectrum in a urine sample.

Figure 9
An example of a partial simulated 1D NMR spectrum (in black) and its three deconvolution peaks (colorcoded) Multiple deconvolution methods have been presented in recent years, including the following examples: