Cross-lingual Voice Conversion

Voice conversion (VC) modifies the speech of one speaker (source) to make it sound as if it is spoken by another speaker (target). Cross-lingual VC is a special case where source and target speakers speak different languages.


Monolingual PPG VC (Baseline)

Phonetic PosteriorGram (PPG) is a time-versus-class vector that represents the posterior probabilities of phonetic classes for a specific time frame. Due to PPG's speaker-independent and languageindependent properties, they have been successfully applied for cross-lingual voice conversion[1]. The training and conversion stages are shown as below:

Monolingual PPG system framework
Figure1. Block diagram of (a) training and (b) conversion workflows of the cross-lingual VC system with monolingual PPGs.

Bilingual PPG VC (Baseline)

To capture accurate phonetic information of both source and target languages, bilingual PPGs are introduced to represent the speaker-independent features of speech signals from different languages in the same feature space. In particular, a bilingual PPG is formed by stacking two monolingual PPG vectors, which are extracted from two monolingual speech recognition systems.

Bilingual PPG system framework with average modeling
Figure2. Block diagram of (a) training and (b) conversion workflows of the cross-lingual VC system with bilingual PPGs.

Average Modeling Approach with Bilingual PPG (Proposed)

To further enhance the quality of converted speech, we proposed an average modeling approach[2], which utilizes acoustic and linguistic information from many speakers in both source and target languages for average model training. Additional i-vectors are concatenated with bilingual PPGs to represent speaker indentities.


Speech Samples

The model is evaluated with speech from two database:

English: VCC2018[3]

Mandarin: The Speech synthesis-library of average model provided by Databaker(http://www.data-baker.com/hc_pm_en.html).

Enable a Monolingual English Speaker to Speak Mandarin

Female → Female

Source Target Monolingual Baseline Bilingual Baseline Proposed
Sample 1
Sample 2
Sample 1
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed
Sample 2
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed

Male → Male

Source Target Monolingual Baseline Bilingual Baseline Proposed
Sample 1
Sample 2
Sample 1
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed
Sample 2
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed

Enable a Monolingual Mandarin Speaker to Speak English

Female → Female

Source Target Monolingual Baseline Bilingual Baseline Proposed
Sample 1
Sample 2
Sample 1
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed
Sample 2
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed

Male → Male

Source Target Monolingual Baseline Bilingual Baseline Proposed
Sample 1
Sample 2
Sample 1
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed
Sample 2
Source
Target
Monolingual Baseline
Bilingual Baseline
Proposed

References

[1] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li and Helen Meng, “Personalized, Cross-lingual TTS Using Phonetic Posteriorgrams,” in INTERSPEECH, 2016, pp. 322-326.

[2] Xiaohai Tian, Junchao Wang, Haihua Xu, Eng-Siong Chng, and Haizhou Li, “Average modeling approach to voice conversion with non-parallel data,” in Proceedings of Odyssey The Speaker and Language Recognition Workshop, 2018, pp. 227–232.

[3] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel method,” in Speaker Odyssey, 2018.