VCC2020比赛

URSpeakingGAN-VC

To advance the research on multi-domain non-parallel VC, we rethink conditional methods in StarGAN-VC and propose an improved variant called URSpeakingGAN-VC. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner.

Converted speech samples

Task and dataset

1st task: voice conversion within the same language
- In training, a sentence set uttered by the source speaker is different from that uttered by the target speaker, but they are still the same language. Moreover, only a small number of sentences are shared between these two sentence sets.
- In conversion, the source speaker's voice is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
- We will provide voices of 4 source and 4 target speakers (consisting of both female and male speakers) from fixed corpora as training data. Each speaker utters a sentence set consisting of 70 sentences in English. Only 20 sentences are parallel and the other 50 sentences are nonparallel between the source and target speakers.
- Using these data sets, voice conversion systems for all speaker-pair combinations (16 speaker-pairs in total) will be developed by each participant.
2nd task: cross-lingual voice conversion
- In training, a sentence set uttered by the source speaker is totally different from that uttered by the target speaker as a language of the source speaker is different from that of the target speaker.
- In conversion, the source speaker's voice in the source language is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
- We will also provide voices of other 6 target speakers (consisting of both female and male speakers) from fixed corpora as training data. The source speakers are the same as in the 1st task. Each target speaker utters another sentence set consisting of around 70 sentences in a different language; 2 target speakers utter in Finnish, 2 target speakers utter in German, and 2 target speakers utter in Mandarin.
- Using these nonparallel data sets, voice conversion systems for all speaker-pair combinations (24 speaker-pairs in total) will be developed by each participant.

We evaluated our method on the non-parallel multi-speaker VC task.
To evaluete our system with StarGAN-VC We used the Voice Conversion Challenge 2018 (VCC 2018) dataset.
In which we selected a subset of speakers as covering all inter- and intra-gender conversions: VCC2SF1, VCC2SF2, VCC2SM1, and VCC2SM2.

Each speaker has 81 sentences (about 5 minutes) for training. This is relatively little for VC.

Our goal is to learn 4 x 3 = 12 different source-and-target mappings only using a single model.
Note that we did not use any extra data, module, or time alignment procedures for training.

Results

We summarize the results in three ways:

NOTE: Recommended browsers are Apple Safari, Google Chrome, or Mozilla Firefox.

1. Comparison between the same language: English

Notation

Source is the source speech samples.
Target is the target speech samples. They are non-parallel with source speech. Note that we did not use these data during training.
URSpeakingGAN-VC is the converted speech samples, in which our proposed method URSpeakingGAN-VC was used to convert MCEPs.

Female (VCC2020-SEF1) → Female (VCC2020-TEF1)

	Source	Target	URSpeakingGAN-VC (Proposed)
Sample 1
Sample 2
Sample 3

Female (VCC2020-SEF1) → Male (VCC2020-TEM1)

	Source	Target	URSpeakingGAN-VC (Proposed)
Sample 1
Sample 2
Sample 3

Male (VCC2020-SEM1) → Male (VCC2020-TEM1)

	Source	Target	URSpeakingGAN-VC (Proposed)
Sample 1
Sample 2
Sample 3

Male (VCC2020-SEM1) → Female (VCC2020-TEF1)

	Source	Target	URSpeakingGAN-VC (Proposed)
Sample 1
Sample 2
Sample 3

2. Comparison between cross languages: English && ( Finnish - German - Mandarin )

Male (VCC2020-SEM1) → Female (Finnish-TFF1), Female (German-TGF1), or Female (Mandarin-TMF1)

Real speech	VCC2020-SEM1	Finnish-TFF1	German-TGF1	Mandarin-TMF1
Reference

Converted speech	VCC2020-SEM1 (Source)	Finnish-TFF1 (Converted)	German-TGF1 (Converted)	Mandarin-TMF1 (Converted)
Sample 1
Sample 2
Sample 3

Male (VCC2020-SEM1) → Male (Finnish-TFM1), Male (German-TGM1), or Male (Mandarin-TMM1)

Real speech	VCC2020-SEM1	Finnish-TFM1	German-TGM1	Mandarin-TMM1
Reference

Converted speech	VCC2020-SEM1 (Source)	Finnish-TFM1 (Converted)	VGerman-TGM1 (Converted)	Mandarin-TMM1 (Converted)
Sample 1
Sample 2
Sample 3

3. Voice comparing between stargan-vc in VCC2018 data.（To Do）

References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. StarGAN-VC: Non-parallel Many-to-Many Voice Conversion with Star Generative Adversarial Networks. arXiv:1806.02169, June 2018 (SLT, 2018). [Paper] [Project]

[2] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Speaker Odyssey, 2018. [Paper] [Dataset]

[3] K. Kobayashi and T. Toda. sprocket: Open-Source Voice Conversion Software. Speaker Odyssey, 2018. [Paper] [Project] [Samples (zip)]

[4] T. Kaneko and H. Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. arXiv:1711.11293, Nov. 2017 (EUSIPCO, 2018). [Paper] [Project]

[5] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. ICASSP, 2019. [Paper] [Project]

[6] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. ACVAE-VC: Non-parallel Many-to-Many Voice Conversion with Auxiliary Classifier Variational Autoencoder. arXiv:1808.05092, Aug. 2018. (IEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2019). [Paper] [Project]

[7] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks. Interspeech, 2017. [Paper] [Project]

VCC2020 Challenge:

Improved Methods for StarGAN-Based Voice Conversion

URSpeakingGAN-VC

Converted speech samples

Task and dataset

Results

1. Comparison between the same language: English

Notation

2. Comparison between cross languages: English && ( Finnish - German - Mandarin )

3. Voice comparing between stargan-vc in VCC2018 data.（To Do）

References

URSpeakingGAN-VC

Converted speech samples

Task and dataset

Results

1. Comparison between the same language: English

Notation

2. Comparison between cross languages: English && ( Finnish - German - Mandarin )

3. Voice comparing between stargan-vc in VCC2018 data.（To Do）

Collection of converted speech samples

References