Helpful Documentation


Home


    Index

  1. Utterance
  2. Speech parameters computations
  3. Transformation of speech parameters
  4. Transformation of speech emotions
  5. References


  1. Utterance

    It is possible either to select an utterance from our speech database or to upload any utterance. The upload file has to be a WAV file (".wav" extension). The avaiable files for selection were obtained from the arctic database. We have already performed the computations for these files so that the results will be outputed faster than if you upload a speech file.


  2. Speech Parameters Computations

    In this section it is possible to obtain text files for the pitchmarks, the pitch contour or the waves transcription for a given speech file. Also, the residual signal can be computed from the speech signal for download.

    Pitchmarks

    Pitchmarks correspond to the instants of glottal closure in a laryngograph waveform (see Figure 1). We use the pitchmark detector from the Entropic's (ESPS) tools. Our techniques for speech transformations are pitch-synchronous thus they are dependent on the robustness of the pitch marking algorithm. The file with the computed pitchmarks has two columns. The first presents the time instants of the pitch marks and the second has the same length as the first and is filled with the character "1" (it means the pitchmarks correspond to voiced regions only). It is possible to manually correct the pitchmarks from the downloaded file and upload the new file in sections 2 and 3 for speech transformations. To correct the pitchmarks you can compute the residual signal and open it together with the pitchmarks transcription with an appropriate software such as the WaveSurver.

    Pitch contour

    Pitch contour is predicted from the pitchmarks. F0 values are estimated as the time interval between sucessive pitchmarks in the voiced regions. Thus, the number of F0 points is equal to the number of voiced pitchmarks. In the output file the first column presents the time instants and the second column presents the correspondent F0 values.
    You can modify the computed pitch conotour and use it as the target pitch contour to transform the pitch of the speech signal. There are appropriate tools to open a speech file together with the pitch contour which permits to easily modify the pitch contour in section 2. For example, you can use the WaveSurver or the Praat software.

    Waves transcription

    The speech transcription is in the Waves format. The first column has information about the pitchmarks instants wether the second column has the labels for voicing classification (see Table 1). Pitchmarks and voiced/unvoiced classification were predicted with the Entropic's (ESPS) tools. For the silence classification we used as speech features the zero counting and the energy. To open the wav file and the pitchmarks transcripton you can use, for example, the WaveSurver software.


    Label Voicing Classification
    S Silence
    UV Unvoiced
    V Voiced

    Table 1 - Labels of Waves transcription

    Residual signal

    Residual signal is computed by inverse filtering the speech signal. The LPC analysis is pitch-synchronous so that we compute the LPC coefficients using Hanning windows centered in the estimated pitchmarks and with duration 20ms.


  3. Transformation of speech parameters

    Method

    The Pitch-Synchronous Time-Scaling (PSTS) [2] method is used by default for the prosodic and voice quality transformations of speech. The user has the possibility to select the LP-PSOLSA [1] which differs from the PSTS in the way the pitch is modified.

    Pitchmarks

    You can select between computing the pitchmarks using our tools or to upload the file with the pitchmarks. The pitchmarks file must have two columns. The first presents the time instants of the pitch marks and the second has the same length as the first and is filled with the character "1" (it means the pitchmarks correspond to voiced regions only). This is an example of the pitchmarks file arctic_a0001m.pm. This option gives the flexibility to use a different pitch mark detector or to use the manually corrected pitchmarks that were computed in section 1 using our tools.

    Prosodic parameters

    Voice quality parameters



  4. (a) Glottal flow waveform (b) Glottal flow derivative waveform

    Figure 1 - (a) Glottal flow waveform, (b) Glottal flow derivative waveform.


    Time instants and durations represented in Figure 1:

    Pitchmark i: n0(i)

    Return phase: Na=ncl

    Closed phase: Nc=nop-ncl

    Opening phase: Nop=np-nop

    Closing phase: Ncl=N-np

    Open phase: No=Na+Nop+Ncl


  5. Transformation of speech emotions

    To re-synthesize speech with different emotions we transformed the prosodic and voice quality parameters described in the previous chapter. For each emotion we defined the values of the transformation factors for the set of parameters. These values were chosen based on published studies and from our informal experiments. There are seven emotions avaiable for selection. It is also possible to choose the intensity of the emotion between normal and high.

    Pitchmarks

    You can select between computing the pitchmarks using our tools or to upload the file with the pitchmarks. The pitchmarks file must have two columns. The first presents the time instants of the pitch marks and the second has the same length as the first and is filled with the character "1" (it means the pitchmarks correspond to voiced regions only). This is an example of the pitchmarks file arcticm_a0001m.pm. This option gives the flexibility to use a different pitch mark detector or to use the manually corrected pitchmarks that were computed in section 1 using our tools.


  6. References

    [1] Moulines, E. and Charpentier, F., "Pitch-synchronous waveform processing techniques for text to speech synthesis using diphones", Speech Communications, Vol. 9, pp. 453-476, December 1990.

    [2] Cabral, J. P. and Oliveira, L. C., "Pitch-Synchronous Time-Scaling for Prosodic and Voice Quality Transformations", Proc. Interspeech'2005, Lisbon, Portugal, September 2005.