Before answering these questions, we need to first provide a little background on biometrics. The term “biometrics” generally refers to the measurement and analysis of unique biological or behavioral characteristics. The use of biometric technologies is increasingly associated with identity verification, with many companies now using biometrics such as fingerprints, voiceprints, and facial recognition to help confirm the identities of their customers.
So, what exactly is a voiceprint? A voiceprint is a digital model of the unique vocal characteristics of an individual. Voiceprints are created by specialized computer programs which process speech samples, typically in WAV format. The creation of a voiceprint is often referred to as “enrollment” in a biometric system. There are two general approaches to the creation and use of voiceprints.
In traditional voice biometric systems which use classical machine learning algorithms, a voiceprint is created by performing “feature extraction” on one or more speech samples. This feature extraction process essentially creates personalized calculations or vectors related to specific attributes that make the user’s speech unique. In these systems, feature extraction is also used to create a Universal Background Model or “UBM”.
UBMs are created from many speech samples collected from many representative users of the system – and they can be thought of as a “super voiceprint”. To help check a person’s identity, a process called “verification” is used. Essentially, a new speech sample is compared to both the individual’s personal voiceprint and the UBM. The score differences are calculated to arrive at a single score, which can then be interpreted as “passing” or “failing,” depending on the desired confidence for the usage scenario.
Newer voice biometric systems use deep neural networks (DNNs) and are commonly referred to as “deep learning” approaches. These systems start by creating a composite DNN model, which is like a UBM conceptually. This DNN model is derived from processing often hundreds of hours of representative speech samples.
To create an individual voiceprint, users provide one or more enrollment speech samples to the DNN model, then the DNN is fine-tuned to learn the individual’s unique speech characteristics. The DNN modeling process occurs directly against the speech samples (i.e., raw WAV files) – no feature extraction is needed. To verify a user, a speech sample is evaluated against the fine-tuned DNN model to arrive at a score, which can again be interpreted as “passing” or “failing” depending on the desired confidence for the usage scenario.
DNN approaches are considered current and state-of-the-art. However, regardless of the voiceprint model used, the primary purpose of a voiceprint is to help confirm someone’s identity. Voiceprints can also be used more broadly to “identify” someone. This is like the verification process outlined above, except instead of comparing one speech sample to one voiceprint, speech samples are compared to many voiceprints. The output of an identification process is often the “best match”, which can then be used to help figure out someone’s identity if there is no other supporting information or help detect if a known fraudster is trying to get into a system.
In the next article, we’ll look at voiceprint security.
Check out our recent white paper on voice biometrics to learn more.