Researchers at Facebook have developed a method to isolate as many as five voice speaking into a microphone at once. The process is described in a 2020 International Conference on Machine Learning (ICML) paper.
Facebook contends that their AI performs better than current best in class technologies in several key speech-source separation metrics. While improved speech separation is crucial for creating better video tools and voice messaging applications, the proposals in Facebook’s paper can solve issues such as background noise suppression, which could prove useful in musical recordings.
Facebook’s artificial intelligence model is based on a novel recurrent neural network. This class of algorithm makes use of an internal state that operates like memory to process inputs of varying lengths. The model makes use of an encoder network to map audio waveforms in their raw form to what is known as latent representation in machine learning applications.
Next, a voice separation network converts the data into approximate audio signals for each voice.
In order to operate, the system must know how many voices are participating in the conversation. According to the paper, a special subsystem can determine the number of speakers so that it can choose the correct speech model.
The paper describes how researchers successfully trained models that can separate two to five speakers. As far as applications go, Facebook’s AI has the potential to improve hearing aid audio quality. People equipped with this technology would be able to isolate voices in environments like bars and parties that have excessive background noise.
Facebook’s paper comes on the heels of Google’s recent proposal to use MixIT (mixture invariant training) for similar applications. MixIT uses an unsupervised approach to isolate, sperate, and improve voices when there are several voices in an audio recording.