A study on the robustness of speaker verification techniques using D-vectors.
Speaker verification is one of the main tasks in speaker recognition modeling, realized through the principal mean of human communication, the voice. As one type of biometrics, speaker verification stands out mainly by its non-intrusive and widely available form of data collection. This type of technology is extremely important for access control, security and forensic analysis in general.
Speaker verification is made through the confirmation, or not, of the identity of a speaker based on the comparison of features extracted from different voice segments. Thanks to the large technological advances, both in infrastructure and modeling, the application of deep neural networks to this problem has been growing considerably. One of these techniques is called \emph{d-vectors}, which are fixed-size representations of the voice, extracted from layers prior to the output layer of a deep neural network, and has become the state of the art for solving this problem.
Differently from neural networks trained for speaker classification, the d-vectors allow to perform the verification of non-existing speakers at the training dataset. In this context, it was noted there is a need to compare these techniques in a standardized way, in scenarioss where there is not data for training obtained from the same source as the test base, representing a real problem, where it is necessary to choose a model, and there are no training and testing data with the same characteristics. Therefore, three experiments were conducted: The first one was done by comparing the SincNet and GE2E models trained on different training datasets, including data augmentation techniques, and testing on the same dataset. The second experiment was made with one combination of trainning and test data from the first experiment, although this time adding more two models: ResNet with triplet loss cost function and the model proposed in this work, called SincNet + GE2E, that is an adaptation of SincNet for trainning with the GE2E cost function. The third and final experiment was done by using models already trained in the previous experiments with the same unique training dataset, and this time tested in various other testing datasets, drawn from different backgrounds data sources.
The results obtained on the first experiment showed that SincNet performed better in all training datasets compared to the GE2E model. The \emph{data augmentation} strategies were much less effective in decreasing EER than the possibility of augmenting the training database with data from the same source as the test base. The large variation of EER related to modeling with respct to the effort to increase the data, motivated the second experiment, where different forms of modeling were used. The SincNet + GE2E model performed better than the GE2E, but it did not outperform the original SincNet, nor the ResNet networks with a triplet loss cost function.
In the third experiment, the comparisons continued with SincNet as the best model, but in one of the test conditions, the model using ResNet performed better. This result indicates that, among all experimental models and databases presented in this work, even with the problem of variability in voice data, SincNet is the best choice in most of them.