VCC2020 Challenge:

Improved Methods for StarGAN-Based Voice Conversion

Shengjie Huang (黄圣杰)
Institute of Artificial Intelligence, School of Information Science and Technology, Beijing Forestry University

TEAM NAME : URSpeaking
arXiv:To be published, June 2020

URSpeakingGAN-VC

To advance the research on multi-domain non-parallel VC, we rethink conditional methods in StarGAN-VC and propose an improved variant called URSpeakingGAN-VC. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner.

network

Converted speech samples

Task and dataset
  • 1st task: voice conversion within the same language
    • In training, a sentence set uttered by the source speaker is different from that uttered by the target speaker, but they are still the same language. Moreover, only a small number of sentences are shared between these two sentence sets.
    • In conversion, the source speaker's voice is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
    • We will provide voices of 4 source and 4 target speakers (consisting of both female and male speakers) from fixed corpora as training data. Each speaker utters a sentence set consisting of 70 sentences in English. Only 20 sentences are parallel and the other 50 sentences are nonparallel between the source and target speakers.
    • Using these data sets, voice conversion systems for all speaker-pair combinations (16 speaker-pairs in total) will be developed by each participant.
  • 2nd task: cross-lingual voice conversion
    • In training, a sentence set uttered by the source speaker is totally different from that uttered by the target speaker as a language of the source speaker is different from that of the target speaker.
    • In conversion, the source speaker's voice in the source language is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
    • We will also provide voices of other 6 target speakers (consisting of both female and male speakers) from fixed corpora as training data. The source speakers are the same as in the 1st task. Each target speaker utters another sentence set consisting of around 70 sentences in a different language; 2 target speakers utter in Finnish, 2 target speakers utter in German, and 2 target speakers utter in Mandarin.
    • Using these nonparallel data sets, voice conversion systems for all speaker-pair combinations (24 speaker-pairs in total) will be developed by each participant.
  • We evaluated our method on the non-parallel multi-speaker VC task.
  • To evaluete our system with StarGAN-VC We used the Voice Conversion Challenge 2018 (VCC 2018) dataset.
    In which we selected a subset of speakers as covering all inter- and intra-gender conversions: VCC2SF1, VCC2SF2, VCC2SM1, and VCC2SM2.
    • Each speaker has 81 sentences (about 5 minutes) for training. This is relatively little for VC.
  • Our goal is to learn 4 x 3 = 12 different source-and-target mappings only using a single model.
  • Note that we did not use any extra data, module, or time alignment procedures for training.
Results
We summarize the results in three ways:

NOTE: Recommended browsers are Apple Safari, Google Chrome, or Mozilla Firefox.


1. Comparison between the same language: English

Notation
  • Source is the source speech samples.
  • Target is the target speech samples. They are non-parallel with source speech. Note that we did not use these data during training.
  • URSpeakingGAN-VC is the converted speech samples, in which our proposed method URSpeakingGAN-VC was used to convert MCEPs.

Female (VCC2020-SEF1) → Female (VCC2020-TEF1)

Source Target URSpeakingGAN-VC
(Proposed)
Sample 1
Sample 2
Sample 3

Female (VCC2020-SEF1) → Male (VCC2020-TEM1)

Source Target URSpeakingGAN-VC
(Proposed)
Sample 1
Sample 2
Sample 3

Male (VCC2020-SEM1) → Male (VCC2020-TEM1)

Source Target URSpeakingGAN-VC
(Proposed)
Sample 1
Sample 2
Sample 3

Male (VCC2020-SEM1) → Female (VCC2020-TEF1)

Source Target URSpeakingGAN-VC
(Proposed)
Sample 1
Sample 2
Sample 3

2. Comparison between cross languages: English && ( Finnish - German - Mandarin )

Male (VCC2020-SEM1) → Female (Finnish-TFF1), Female (German-TGF1), or Female (Mandarin-TMF1)

Real speech VCC2020-SEM1 Finnish-TFF1 German-TGF1 Mandarin-TMF1
Reference
Converted speech VCC2020-SEM1
(Source)
Finnish-TFF1
(Converted)
German-TGF1
(Converted)
Mandarin-TMF1
(Converted)
Sample 1
Sample 2
Sample 3

Male (VCC2020-SEM1) → Male (Finnish-TFM1), Male (German-TGM1), or Male (Mandarin-TMM1)

Real speech VCC2020-SEM1 Finnish-TFM1 German-TGM1 Mandarin-TMM1
Reference
Converted speech VCC2020-SEM1
(Source)
Finnish-TFM1
(Converted)
VGerman-TGM1
(Converted)
Mandarin-TMM1
(Converted)
Sample 1
Sample 2
Sample 3

3. Voice comparing between stargan-vc in VCC2018 data.(To Do)



References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. StarGAN-VC: Non-parallel Many-to-Many Voice Conversion with Star Generative Adversarial Networks. arXiv:1806.02169, June 2018 (SLT, 2018). [Paper] [Project]

[2] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Speaker Odyssey, 2018. [Paper] [Dataset]

[3] K. Kobayashi and T. Toda. sprocket: Open-Source Voice Conversion Software. Speaker Odyssey, 2018. [Paper] [Project] [Samples (zip)]

[4] T. Kaneko and H. Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. arXiv:1711.11293, Nov. 2017 (EUSIPCO, 2018). [Paper] [Project]

[5] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. ICASSP, 2019. [Paper] [Project]

[6] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. ACVAE-VC: Non-parallel Many-to-Many Voice Conversion with Auxiliary Classifier Variational Autoencoder. arXiv:1808.05092, Aug. 2018. (IEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2019). [Paper] [Project]

[7] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks. Interspeech, 2017. [Paper] [Project]