Weakly Supervised Detection of Baby Cry

Detection of baby cries is an important part of baby monitoring and health care. Almost all existing methods use supervised SVM, CNN, or their varieties. In this work, we propose to use weakly supervised anomaly detection to detect a baby cry. In this weak supervision, we only need weak annotation if there is a cry in an audio file. We design a data mining technique using the pre-trained VGGish feature extractor and an anomaly detection network on long untrimmed audio files. The obtained datasets are used to train a simple CNN feature network for cry/non-cry classification. This CNN is then used as a feature extractor in an anomaly detection framework to achieve better cry detection performance.


Introduction
Baby monitoring is an important application of video surveillance, computer vision and machine learning.It helps take better care of babies and reduce the burden of care givers.Baby cry is a signal to communicate their needs including hunger, discomfort, or pain.It is used not only for the purpose of care giving, but also for purpose of disease diagnosis.
The goal of is to detect the baby cry and localize its starting and end position in an audio signal.It is a challenging task since the baby cry sound may be mixed with different background noise in various environments, such as home and hospital.
In recent years, detection of baby cry has been studied using both traditional machine learning and deep learning algorithms.The traditional algorithm, typically SVM, works well on hand crafted acoustic features in both frequency domain and time domain.Typical examples include MFCCs and their varieties, pitch-related features, harmonic features, energy, zero crossing rate etc.For a review of these approaches, the readers are referred to [13,4,24].
The deep learning algorithms mostly use CNN [13,4,24,8], and some use other types of neural network [15].It has been shown in [4,24] that the performance of CNN is much better than traditional machine learning.On the other hand, the complexity of the CNN may be high, preventing it be used in common embedded devices, like low-cost IP cameras or tablets.In this paper we will address this issue by designing a super lightweight CNN.
Most, if not all, existing CNN methods use supervised learning.Therefore, frame-level annotation is needed.This annotation is very time consuming and prone to human mistakes.In the few datasets available online, some of the audio files [10,25,21] are trimmed and annotated, some others [9] are untrimmed with only audio-level annotations.In this work, we propose to use the weakly supervised anomaly detection [22] to detect baby cry on audio signal.This weak supervision only requires weak annotation, i.e., if there is cry in the audio file without frame-level annotation, therefore makes the data annotation a lot easier.
Based on this weakly supervised anomaly detection, and a well-known pretrained VGGish network [12] for audio feature extraction, we design a data mining technique to obtain frame-level datasets for supervised CNN classification.We use this dataset to train a delicately designed super lightweight CNN, which runs very fast on embedded devices.This CNN, as the feature extractor in an anomaly detection framework, gives better performance than the CNN alone.
The contribution of this paper is three-fold, • First we propose to use anomaly detection to detect baby cry in audio signals.To our knowledge, we are the first to use anomaly detection for the purpose of baby cry detection.
• We design a data mining technique using the pretrained VGGish [12] feature extractor and an anomaly 1 arXiv:2304.10001v3[cs.CV] 26 Nov 2023 detector.The obtained dataset has similar performance to an annotated dataset on our lightweight CNN.
• We design a super lightweight CNN.This CNN makes our framework possible to run on embedded devices.

Related Work
In this section we first review literature on baby cry detection.Then we review the anomaly detection approaches on video signals.We borrow this anomaly detection idea and extend it to baby cry detection on audio signals.

Baby Cry Detection
The first step of baby cry detection is audio signal preprocessing.The main tasks are denoising and audio segmentation.The purpose of denoising is to filter-out noise in unwanted frequency band.Audio segmentation is to use vocal activity detector (VAD) to remove silent duration.In this work we treat silent duration as non-cry segments and directly apply the cry detection on it.
The signal processing features of audio signal can be categorized to cepstral domain, prosodic domain, time domain, image domain, and wavelet domain [13].The cepstral domain is the most widely used.It includes the Mel frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCCs), and the corresponding spectrogram.Please note that the spectrogram is 2D data, while the cepstral coefficients are a number of scalar data.
Baby cry detection is essentially a binary classification task.A variety of classification techniques can be used, including 2-D CNN, 1-D CNN, SVM, KNN, multiple-layer perceptron (MLP).In previous work [24,3,20,6,11,8,18,16], these different features and classification methods are used on their private datasets.Since their datasets are private, it is impossible to conclude which is better.However, in their own comparison, a common recommendation is that the CNN outperforms the traditional machine learning methods.For a complete review, please refer to [13,4,24].
As a super-set of baby cry detection, audio anomaly detection typically use unsupervised learning [14].The work [1] presents a large audio dataset for anomaly detection including baby cry detection.However, even the term anomaly detection is used, in their detection algorithm, audio files are first cut into small segments then supervised learning is used on classification every segment.We are the first to explore weakly supervised learning for its significantly lower workload to prepare dataset.

Anomaly Detection on Videos
Weakly supervised anomaly detection only uses videolevel annotations.These annotation only gives a binary label of abnormal or normal for a video.Sultani et al. [22] propose the MIL framework using only video-level labels and introduce the large-scale anomaly detection dataset, UCF-Crime.This work inspires quite a few follow-up studies [29], [19], [5], [17], [27], [26], [7], [23], [27].However, in the MIL-based methods, abnormal video labels are not easy to be used effectively.Typically, the classification score is used to tell if a snippet is abnormal or normal.This score is noisy in the positive bag, where a normal snippet can be mistakenly taken as the top abnormal event in an anomaly video.To deal with this problem, Zhong et al. [29] treat this problem as a binary classification under noisy label problem and use a graph convolution neural (GCN) network to clear the label noise.In RTFM [23], a robust temporal feature magnitude (RTFM) is used to select the most reliable abnormal snippets from the abnormal videos and the normal videos.They unify the representation learning and anomaly score learning by an temporal feature ranking loss, enabling better separation between normal and abnormal feature representations, improving the exploration of weak labels compared to previous MIL methods.More details will be given later.

Anomaly Detection in Audios
The task of anomaly detection is to find and localize anomalous or abnormal events in videos.There are selfsupervised methods trained only on normal datasets and weakly-supervised methods trained on both abnormal and normal datasets annotated with video or audio level labels.The weakly supervised anomaly detection in videos is first proposed in [22].It uses a multiple-instance learning (MIL) framework to find a segment in the positive (abnormal) or negative (normal) data sample whose classification scores are the maximum.Then the distance between the two segment scores are maximized for the best discriminability.
In this work, we extend the anomaly detection from video signal to audio signal.In videos, since the frame is 2D, therefore a segment of frames is 3D.In audio signal, a segment is 1D audio signal and its spectrogram is 2D.So instead of a 3D CNN backbone, a 2D CNN backbone is needed.
Let V a and V n represent the segments in the abnormal and normal audio.The MIL expects to have the following objective function, where B a and B n are the bags of segments in the abnormal and normal audio, f is the predicted anomaly score in range of 0 and 1.This function max is taken over all instances in a bag.It is used because the segment level annotation is not available.It is expected that in the positive bag, the highest-scored instance is a true abnormal segment. .
The highest-scored instance in the negative bag is the one most similar to the positive bag, but is actually a negative instance.This makes the negative instance a hard one and therefore benefits the discriminability in the model training.
To push the positive instance and negative instance further apart, the MIL ranking loss is defined as, It is worth noting that this loss function looks similar to the contrastive loss function which is used to separate two or more classes as farther as possible.Two regularization terms, the smoothness term and the sparsity term are added onto it.So the overall loss function is [22], It is expected in Eq. 1 that abnormal segments have higher score than normal segments.However, this is not always true.A few methods [19], [17], [27], [26], [7], [23], [27] have been studied how to improve the score quality so that the correct abnormal segment is chosen in the abnormal bag.In the work RTFM [23], a different approach is used.Instead of using the classification score as the criterion to choose the abnormal segment, the authors propose to use a feature magnitude, which they believe has better discriminability between abnormal and normal instances.Furthermore, they propose to use multiple instances whose feature amplitude are the largest and call them the top-k instances.In their approach, the MIL ranking loss is defined on the feature magnitude, The standard cross-entropy loss is used as the classification loss.However, it is applied on the top-k segments whose feature magnitude are the largest.If k > 1, the scores are averaged before feeding into the cross-entropy loss function, (5) The same smoothness term and sparsity term are also used, so the overall loss function is, where α,λ 1 and λ 2 are pre-defined weight factors.
In addition, a multi-scale (dilated convolution) non-local aggregation (MSNL) block is used [23] on the feature extracted from the pre-trained CNN backbone.This block is also important for the feature magnitude training.Without this block the feature is fixed and cannot be learned.The MSNL is used in [23], but other simpler network, e.g., a few full-connection (FC) layers may also work.

Proposed Anomaly Detection Framework
The overall block diagram of our proposed network is illustrated in Fig. 1.The style of the figure is borrowed from [22].A framework similar to RTFM [23] is used, with all necessary modifications for anomaly detection of baby cry in audios.
The abnormal or normal audio signal is first divided to a certain number of equal-length segments.We use 16 in the figure as an example.Every segment is called an instance in the positive or negative bag of instances.All instance pass through a pre-trained CNN backbone, and CNN features are extracted.In Figure 1, the CNN backbone is called BlazeNet -our delicately designed super lightweight network, whose detail will be given later.This CNN feature is fed into a feature refinement network and a second CNN feature is extracted.The features of the positive instance bag form the positive feature bag, same is for the negative feature bag.The top-k instances whose feature magnitudes are the largest among this bag are selected.The classification network is typically two or three FC layers.Through the back propagation of the loss function in Eq. 5, the feature magnitude and the classification are both learned at the same time.
In implementation, when the audio file is very short and the audio signal is divided into 16 segments, every segment may not be long enough for a frame.So CNN feature is extracted for every frame, then linear interpolation is used to generate features for 16 segments.

BlazeNet
Our goal of this study is to design a baby cry detection framework that can work efficiently on embedded devices.So the CNN backbone network must be super lightweight, and at the same time, achieve good performance.We try the popular MobileNet, ShuffleNet, SqueezeNet, and find that they are still too large, no need to mention the popular VGGish-Net widely used in the audio recognition.We take the backbone from the BlazeFace in [2].We make changes so that the input size is 64x64 and the output feature size is 224.We use 16 BlazeBlocks, where the 11th BlazeBlock output is classified by first classifier FC1(88,2), and the 16th BlazeBlock output is classified by a second classifier FC2(96,6).The outputs of these two classifiers are flattened and concatenated then classified by the final FC(224,2) classifier.The details of our BlazeNet are listed in Table 1.The total number of parameters of this model is 89,680 in PyTorch.

Data Mining Datasets for BlazeNet
The CNN backbone in Figure 1  and the anomaly detection network are trained together endto-end, the GPU can be easily overflowed.For this reason, all previous anomaly detection methods in videos use pretrained 3D CNN.So we need to pre-train the CNN backbone before the whole anomaly detector framework can work to detect baby cry in audios.To do so, we need some training data (herein including validation and test dataset without confusion).One way to do so is to prepare some audio data manually, which we do in this work.More details will be given later.However, manually annotating audio data is very time consuming and is prone to human mistakes.We propose a second way to mine training data from weakly-annotated dataset and a different pre-trained CNN backbone.
In [9], the authors publish a large audio dataset called AudioSet.At the same time they publish a pre-trained VG-Gish backbone [12] for CNN feature extraction.With their default settings, the log-Mel-spectrogram is used on a 0.96s frame.The output CNN feature is 128-D.We use this VGGish network to exact CNN features for audio files, then apply the anomaly detection framework onto it.So to make it clear, the 2D BlazeNet in Figure 1 is replaced with the pre-trained VGGish network.After the anomaly detection network (feature refinement network and classification network) is trained, the framework is set to inference mode, and all training, validation, and test datasets are processed.Please note that, in the inference mode, the audio files are not divided into 16 segments.In stead, the audio signal in The pipeline of the data mining and anomaly detection frame work for baby cry is shown in Figure 2. The blocks on the left perform the data mining, and the blocks on the right perform the anomaly detection of baby cry detection.Please note only the training procedure of anomaly detection is plotted, and the testing procedure can be derived accordingly.

Datasets
Even though there are quite some publications on baby cry detection, none of the used dataset are publicly available.We search online and organize the datasets from the sources listed in Table 2. Please note that the background of all these datasets are clean except for the AudioSet [9], whose background is very noisy.We clean the datasets and filter out the ambiguous cases.The numbers of cleaned samples are listed in the table.We will publish the dataset we organize soon.
In total we have 2624 baby cry audios and tremendous other audios.We use all other audios in [10,25,21], then collect randomly other audios in [9].The total number of other non-cry audios is about the same as the number of baby cry audio.
For the training of BlazeNet, we have two audio frame lengths, 5s and 1s.If the length of an audio is longer than two times the frame length, them more than one frames can be cut from an audio file.After the cutting we manually check if there is really a segment of baby cry in the audio file.The total audio frames are randomly divided into training, validation and test datasets with ratio 8:1:1.
For the anomaly detection, we use two audio lengths, one is 5s, the other is the original audio file length.When the length is 5s, it is the same dataset as above.In this case, the frame length is only 1s.When the length is the original audio file length, the frame length is also 1s.
Until very recently, we find that a new baby cry dataset is released in [28].However, the link to where the dataset is saved is broken.So it is not really publicly available.

Implementation Details
For the BlazeNet as a standalone CNN classification network on baby cry detection, we implement it in PyTorch.The SGD is used as the optimizer with starting learning rate 0.001 and momentum 0.9.The training runs 60 epochs.The learning rate is decayed by factor 0.1 every 20 epochs.A batch size 32 is used.
For anomaly detection, we use the RTMF codebase [23] in PyTorch.Every audio file is divided to 5 segments when the input audio file is 5s long.It is divided to 10 segments when the original audio files are used.The top-k is set to 2. Two dataset iterators, one for the abnormal data and the other for the normal data, are used.This way, the pairing of abnormal and normal data is random, even when the numbers of abnormal and normal samples are different.An initial training rate of 1E-3 is used, and the training runs 20000 steps (we do not use epochs because the abnormal data loader and the normal data loader are iterating).A batch size 128 is used.
For the VGGish input, the default log Mel spectrogram parameters are used.Specifically, sampling rate = 16K Hz, number of frames in batch = 96, number of Mel bands = 64, FFT window length = 0.025s, FFT hop length = 0.01s, min Mel frequency = 125 Hz, max Mel frequency = 7500 Hz, log offset=0.01,example hop seconds = 0.96.The only change we make is example window seconds = 1s so that we have a 96x64 log Mel spectrogram output for every 1s of audio.
For the BlazeNet input, we use different log Mel spectrogram parameters and we use the Librosa library.When the example window seconds = 1s, sampling rate = 8K Hz (this is the audio signal sampling rate on most IP cameras), Source length annotation cleaned number background [10] 5s Baby Cry, other 3 108, 324 clean [25] 7s Baby Cry only 482 clean ESC-50 [21] 5s Baby Cry, many others 40, 1960 clean AudioSet [9] Untrimmed Baby Cry, many others 1364, a lot noisy Table 2. Dataset sources of baby cry we find online.

Dataset
Val Acc Test Acc 5s audios 0.9447 0.9312 untrimmed audios 0.9370 0.9223 Table 3. Performance of anomaly detection of baby cry using VGGish features.Accuracy is measured at default classification threshold = 0.5.
number of Mel bands = 64, FFT window length = 0.064s, FFT hop length = 0.01475, min Mel frequency = 0 Hz, max Mel frequency = 8000 Hz.Other default parameters are used.Please note that we use a FFT hop length such that the spectrogram out size is 64x64, which is required by the BlazeNet.Resizing the Mel spectrogram array is not recommended since it cause performance loss.When the example window seconds = 5s, the FFT window length and FFT hop length are adjusted accordingly so the spectrogram out size is 64x64.Please note that, in all our performance evaluations, the data unit the 1s audio segment.In practice, we give detection result for every 1s audio signal input.

Data Mining using VGGish
We first test how VGGish [12] features work in the anomaly detection network.The performance must be good for the data mining to work well.From the standpoint of anomaly detection, positive instances must have larger score than the negative instances (see Eq. 1).In other words, the largest scored instances in the abnormal bag must be truly positive instances.
We do this test on both the trimmed 5s audio files and the untrimmed audio files as dataset.VGGish features are extracted for 1s segments sequentially without overlap in every audio file.These features and their audio labels are used in training and testing the anomaly detection network.
The experiment results are listed in Table 3.We observe that the validation and test accuracy results are all higher than 0.90.We believe this good performance will make the data mining method work well, which will be verified later with the performance of the standalone BlazeNet classification.

BlazeNet Classification Results
We first train the BlazeNet on trimmed 1s audios.The trimmed 5s audios are prepared manually.To get 1s audio dataset, every 5s audio is cut into 5 segments of 1s-long audios without overlap.When the 1s audios are used in training, there are two modes.In the first mode, all 5 segments out of every 5s long audio file are used.In the second mode, only 2 randomly selected segments are used.This is for a fair comparison with the data mining method, where only the top-2 segments are saved as training dataset.
Please note that in all these experiments, only the training dataset changes, while the validation and test datasets stay the same.In the last experiment, long untrimmed audios are added to the training dataset.We argue that always using short 1s audio files as validation and test datasets is reasonable because in practical applications, a decision per 1s audio signal is preferred to avoid long latency.
The experiment results are listed in Table 4, where all the accuracy results are taken at default threshold = 0.5.The accuracy result of using all 1s segments is the best, and the one using random 2 segments is a little bit worse.This is probably because the selected segments do not cover as many cases as using all 1s segments.
Considering the 5s audios are well annotated, any 1s segment should be good to be used.When top-2 1s segments are mined from 5s audios, we expect to have same performance as the random 2 1s segments, however the results are not so.The accuracy is worse than using the annotated data by 1%.We test further two cases, one using only mined positive data, the other using only mined negative data.The results show that hard negative samples with highest scores in the mined negative data are preferred, while the positive samples with highest scores are not preferred.
The performance of the mined data from the long untrimmed audios is even worse.This is understandable since some negative samples may be chosen as positive samples.However, as a backbone for an anomaly detection framework, the backbone does not need to be perfect.As in anomaly detection in videos, an pre-trained 3D CNN is used without training on the anomaly detection dataset [22,23].Experiment result will be shown in next subsection.

Anomaly Detection Results
In this subsection we test the anomaly detection framework shown in Figure 1 with the fixed BlazeNet, which has been trained in the previous subsection.
In the first experiment, we use this method on the manually annotated 5s audio files.The goal is to find the detection results on every 1s audio segments and measure the Dataset Val Acc Test Acc all 1s segments from 5s audios 0.8868 0.8820 random 2 1s-segments from 5s audios 0.8784 0.8654 mined top-2 1s-segments from 5s audios 0.8609 0.8542 mined positive top-2 1s-segments from 5s audios 0.8748 0.8562 mined negative top-2 1s-segments from 5s audios 0.8748 0.8622 mined top-2 1s-segments from untrimmed audios 0.8142 0.8292 accuracy.In this experiment, since there are only 5 1s segments in a 5s audio file, the number of segments in anomaly detection is set to 5, and the top-k is set to 2.
In the second experiment, long untrimmed audio files are used as training dataset, while 5s audio files are used as validation and test datasets.After looking at the distribution of the audio file length, we set the number of segments in anomaly detection to 10.The the top-k is still set to 2.
The experiment results are listed in Table 5.We observe that the performance is very close to that using VGGish feature in Table 3, while the complexity of the BlazeNet is a lot lower than that of the VGGish network.For BlazeNet trained at Table-4 Line-6, the performance is worse than the one trained at Table-4 Line-1.So the quality of the BlazeNet feature does matter in the anomaly detection performance.So manually annotating some dataset for the BlazeNet is preferred.

Discussion: Anomaly Detection vs. Classification
Our goal is to find a solution for baby detection on embedded devices, so we ignore any results directly using VG-Gish network in inference mode.When comparing the results of BlazeNet in Table 4 and Table 5, we see that the performances of anomaly detection is already more than 2% better than of the standalone BlazeNet.
Furthermore, we note that we only use the accuracy at default threshold = 0.5 in all these experiments (We use this threshold because it is dominantly used in training of a binary classifier).So we do some analysis in terms of max F1 score and the ROC curve.
We collect prediction scores of all 1s audio segments in the test dataset, then calculate the max F1 score and the corresponding threshold.Finally we calculate the test accuracy at this threshold.The results are listed in Table 6.It is observed that for the BlazeNet, since it is a binary classification, the threshold to achieve the max F1 is near 0.5, and the test accuracy at this threshold is almost identical to the one in Table 4.While for anomaly detection, for its nature of MIL ranking loss, the threshold to achieve the max F1 is pushed to the 1.0 side.
The ROC curves of the same three cases in Table 6 are plotted in Figure 3.The gain from the anomaly detection is obvious, so is the ROC AUC.

Conclusion
In this paper, we study the baby cry detection in audios.We design a super lightweight BlazeNet for baby cry/non-cry classification.On top of that, we propose to use anomaly detection to do the same task.Experiment results show that the anomaly detection can achieve better performance than the standalone BlazeNet with a little bit extra complexity.

Figure 1 .
Figure 1.Anomaly detection of baby cry block diagram.The log-Mel-spectrom is not shown.
a and X top−k n are the top-k segments whose feature magnitudes are the largest k instances out of the abnormal and normal bag, f is the feature magnitude function, m is a pre-defined margin, and d is the defined distance function between the two sets of top-k features.In their implementation, this function is simply the square of mean of the top-k feature magnitude.

Figure 2 .
Figure 2. pipeline of data mining and anomaly detection for baby cry detection .

Figure 3 .
Figure 3. ROC curves of standalone BlazeNet and the anomaly detection.

Table 1 .
is pre-trained and fixed when the anomaly detection network (feature refinement network and classification network) is trained.If this CNN Layers of out BlazeNet.The BlazeNet parameters are (number of input channels, number of output channels, kernel size, stride).The sequence of layers is from top to bottom, from left to right.

Table 4 .
Performance of BlazeNet as a standalone baby cry classifier.Accuracy is measured at default classification threshold = 0.5.

Table 5 .
Performance of anomaly detection of baby cry using BlazeNet features.Two trained BlazeNet backbones from Table4are used.Accuracy is measured at default classification threshold = 0.5.

Table 6 .
Performance comparison of standalone BlazeNet and anomaly detection with BlazeNet backbone from Table-4 Line-1.Test accuracy is at the threshold where F1-max is achieved.