Speaker verification using d-vectors
Speaker verification techniques have great importance for security in the authentication of people. Created through voice data, their role as autentication tool has improving performances by deep learning, using d-vectors. Among its benefits, there is no need to retrain new models when testing with speakers that were not previously in the training databases. In this context, it was noted that there is a necessity to compare models employing similar techniques in situations where we have data for training that was not obtained from the same source as test data, representing a real problem, where it is necessary to choose a model, but there is no training and test data with the same characteristics. The comparison were made with the models SincNet and GE2E , varying the training conditions.
Therefore, the present work proposes new ways to train speaker verification with SincNet and GE2E models using data augmentation techniques with urban and white noises. It is also proposed in this paper a combination of SincNet and GE2E, called SincNet + GE2E, where the SincNet network was adapted to be trained with GE2E cost function.
The obtained results show that SincNet performed better. The proposed SincNet + GE2E, however, performed better than GE2E, but did not outperform the standard SincNet. This model, in addition to better results, presents less dependence on the data source for generalization with respect to GE2E.