EmoVoice: Transformation of Speech Emotions

From HLT@INESC-ID

The following demos are available:

Utterance

It is possible either to select an utterance from our speech database or to upload any utterance. The upload file has to be a WAV file (".wav" extension). The avaiable files for selection were obtained from the arctic database. We have already performed the computations for these files so that the results will be outputed faster than if you upload a speech file.

Speech Parameters Computations

In this section it is possible to obtain text files for the pitchmarks, the pitch contour or the waves transcription for a given speech file. Also, the residual signal can be computed from the speech signal for download.

Pitchmarks

Pitchmarks correspond to the instants of glottal closure in a laryngograph waveform (see Figure 1). We use the pitchmark detector from the Entropic's (ESPS) tools. Our techniques for speech transformations are pitch-synchronous thus they are dependent on the robustness of the pitch marking algorithm. The file with the computed pitchmarks has two columns. The first presents the time instants of the pitch marks and the second has the same length as the first and is filled with the character "1" (it means the pitchmarks correspond to voiced regions only). It is possible to manually correct the pitchmarks from the downloaded file and upload the new file in sections 2 and 3 for speech transformations. To correct the pitchmarks you can compute the residual signal and open it together with the pitchmarks transcription with an appropriate software such as the WaveSurver.

Pitch contour

Pitch contour is predicted from the pitchmarks. F0 values are estimated as the time interval between sucessive pitchmarks in the voiced regions. Thus, the number of F0 points is equal to the number of voiced pitchmarks. In the output file the first column presents the time instants and the second column presents the correspondent F0 values. You can modify the computed pitch conotour and use it as the target pitch contour to transform the pitch of the speech signal. There are appropriate tools to open a speech file together with the pitch contour which permits to easily modify the pitch contour in section 2. For example, you can use the WaveSurver or the Praat software.

Waves transcription

The speech transcription is in the Waves format. The first column has information about the pitchmarks instants wether the second column has the labels for voicing classification (see Table 1). Pitchmarks and voiced/unvoiced classification were predicted with the Entropic's (ESPS) tools. For the silence classification we used as speech features the zero counting and the energy. To open the wav file and the pitchmarks transcripton you can use, for example, the WaveSurver software.

Label Voicing Classification
S Silence
UV Unvoiced
V Voiced

Table 1 - Labels of Waves transcription

Residual signal

Residual signal is computed by inverse filtering the speech signal. The LPC analysis is pitch-synchronous so that we compute the LPC coefficients using Hanning windows centered in the estimated pitchmarks and with duration 20ms.

Transformation of speech parameters

Method

The Pitch-Synchronous Time-Scaling (PSTS) [2] method is used by default for the prosodic and voice quality transformations of speech. The user has the possibility to select the LP-PSOLSA [1] which differs from the PSTS in the way the pitch is modified.

Pitchmarks

You can select between computing the pitchmarks using our tools or to upload the file with the pitchmarks. The pitchmarks file must have two columns. The first presents the time instants of the pitch marks and the second has the same length as the first and is filled with the character "1" (it means the pitchmarks correspond to voiced regions only). This is an example of the pitchmarks file arctic_a0001m.pm. This option gives the flexibility to use a different pitch mark detector or to use the manually corrected pitchmarks that were computed in section 1 using our tools.

Prosodic parameters

  • Pitch
    Pitch frequency is equivalent to the fundamental frequency F0. You can select to tranform the mean value of the original pitch contour or to upload a file with the pretended pitch contour. The text file for upload must have two columns, the first with the time instants and the second with the F0 values. This is an example of the pitch contour file arctic_a0001m.f0. In case of modifying the mean value of the original pitch contour the transformation factor must be given in percentage of the pitch. For example, a transformation factor of 0.5 is equivalent to increase the mean pitch value in 50%. Thus, a transformation factor equal to 1 doubles the mean pitch while if it is equal to -0.5 the mean pitch is halved. The lower bownd value is -1. We recommend the transformation factor is between -0.5 to 1.5 to avoid audible artefacts due to distortion of the speech signal. To compute the new pitch contour first the mean pitch value is calculated and it is subtracted to the pitch contour to obtain the time-varying component of the pitch. Then the mean pitch is multiplied by the absolute transformation factor (transformation factor in percentage plus 1) and it is added to the time-varying component.
  • Pitch range
    Pitch range transformation changes the difference between the maximum and the minimum value of the pitch contour. The time-varying component of the pitch contour is computed and multiplied by the pitch range factor (it is equal to the pitch range factor in percentage plus 1). We recommended the pitch range factor in percentage to be between -1 and 1.
  • Duration
    You can select to tranform the total duration of the speech utterance or to upload a file with the pretended duration contour. The text file for upload must have two columns, the first with the time instants and the second with the transformation factors in percentage. An example of the duration file is example.dur. For example, a transformation factor of 0.5 is equivalent to increase the duration value in 50%. The lower bound value is -1. We recommended the duration factor in percentage to be between -0.8 and 1.5.
  • Energy
    You can select to apply a constant energy factor to the speech utterance or to upload a file with the pretended energy contour. The text file for upload must have two columns, the first with the time instants and the second with the transformation factors in percentage. An example of the energy file is example.en. For example, a transformation factor of 0.5 is equivalent to increase the energy value in 50%. The lower bound value is -1. We recommended the duration factor in percentage to be between -0.8 and 1.5.

Voice quality parameters

  • Jitter
    Pitch period perturbations are introduced by multiplying each pitch period with a random factor. The random factor is equal to r*jitter/2, where r is a random number between 0 and 1 and jitter is the input jitter factor. For example, if the jitter is 0.1 the random factor is equal to a value in the range -0.05 to 0.05. Jitter is related to roughness in voice quality. Normal voices contain jitter, generally less than 1%. We suggest the pitch factor to be selected up to 10%.
  • Shimmer
    Vocal shimmer is the cycle-to-cycle variation in amplitude. Each voiced speech frame is multiplied by a random factor. The range of the random factor is limited by the selected shimmer factor. The random factor is equal to 1+r*shimmer/2 where r is a random number between 0 and 1 and shimmer is the input shimmer factor. For example, if the shimmer is 0.5 the random factor is equal to a value in the range -1.25 and 1.25. We suggest for shimmer values between 0.2 and 0.5. Shimmer has the same effect as jitter in the voice quality that is roughness.
  • Aspiration noise
    Turbulent flow is expected to occur during the open phase of the glottal cicle and to be maximum at the glottal opening instant. To affect the perceptual voice quality the noise is high-pass filtered, amplitude modulated and synchronised with the pitch. We use a high-pass filter with cutt-off frequency in the range 1.2-4 kHz. The Gaussian noise is amplitude modulated using Hanning windows centered in the estimated glottal opening instant and with length equal to the pitch period. Aspiration noise is given in percentage of the signal energy. Our suggested range of values for the aspiration noise is between 6 and 10 (dB). Aspiration noise contributes to the sensation of breathiness.
  • Open quotient
    The open quotient (OQ) is related to the duration of the open phase of the glottal cycle (see Figure 1). To change the duration of the open phase we perform a time-scale transformation over the open phase segment for each pitch cycle. The length of the closed phase is adjusted by truncating or padding with zeros to preserve the pitch cycle duration. The OQ factor is equal to the variation of the open phase duration. We suggest the open quotient factor to be limited between 0.5 and 1.5.
  • Speed quotient
    The speed quotient (SQ) accounts for variations in the shape of the open phase of the glottal-flow. It is equal to the quotient between the opening phase duration and the closing phase duration (see Figure 1). We modify this parameter using time-scale transformations over the two phases. Large values of SQ corresponds to tense voice quality whereas its small value is characteristic of a lax or hypofunctional voice quality. We suggest the SQ factor to be limited in the range 0.8 to 1.2.
  • Return quotient
    This parameter is related with the duration of the closing phase. To modify the return quotient (RQ) a time-scale transformation is performed to the return phase (see Figure 1). To maintain the pitch period and the open quotient, the peak flow is also time-scaled by an adequate factor. We suggest the RQ factor to be limited in the range 0.5 to 1.5.

(a) Glottal flow waveform (b) Glottal flow derivative waveform

Figure 1 - (a) Glottal flow waveform, (b) Glottal flow derivative waveform.


Time instants and durations represented in Figure 1:

Pitchmark i: n0(i)

Return phase: Na=ncl

Closed phase: Nc=nop-ncl

Opening phase: Nop=np-nop

Closing phase: Ncl=N-np

Open phase: No=Na+Nop+Ncl

Transformation of speech emotions

To re-synthesize speech with different emotions we transformed the prosodic and voice quality parameters described in the previous chapter. For each emotion we defined the values of the transformation factors for the set of parameters. These values were chosen based on published studies and from our informal experiments. There are seven emotions avaiable for selection. It is also possible to choose the intensity of the emotion between normal and high.

Pitchmarks

You can select between computing the pitchmarks using our tools or to upload the file with the pitchmarks. The pitchmarks file must have two columns. The first presents the time instants of the pitch marks and the second has the same length as the first and is filled with the character "1" (it means the pitchmarks correspond to voiced regions only). This is an example of the pitchmarks file arcticm_a0001m.pm. This option gives the flexibility to use a different pitch mark detector or to use the manually corrected pitchmarks that were computed in section 1 using our tools.

References

[1] Moulines, E. and Charpentier, F., "Pitch-synchronous waveform processing techniques for text to speech synthesis using diphones", Speech Communications, Vol. 9, pp. 453-476, December 1990.

[2] Cabral, J. P. and Oliveira, L. C., "Pitch-Synchronous Time-Scaling for Prosodic and Voice Quality Transformations", Proc. Interspeech'2005, Lisbon, Portugal, September 2005.

Contacts