For English synthesis, we used an existing English voice in the Festival Speech Synthesis System [3]. Although there may have been a slight advantage in building a targeted synthesizer for conversations, it would not have been significantly different in quality. A few lexicon additions were made for the particular domain, but the existing English voice was essentially used unchanged.
For Croatian, it was necessary to build a complete new speech synthesis voice. To do this, we used the tools available in the CMU FestVox project [1], which is designed to provide the necessary support for building new synthetic voices in new languages. Synthetic voices require: text processing, lexicons, a method for waveform synthesis, and prosodic models.
In this case, the text processing was minimal, as the type of language being given to the synthesizer was fairly regular, since it would be generated by the translation system (or the Croatian recognizer).
Luckily, orthographic-to-phoneme rules for Croatian are relatively easy and could be written by hand, so building a lexicon was much easier than it might be for some other languages. (The same lexicon and letter-to-sound rules were used by the recognition engine).
The waveform synthesis was done using a constrained version of general unit selection techniques. From the translated utterances from chaplain dialogs and other Croatian text, we selected 1000 utterances that best covered the phonetic space (using the technique more fully described in [2]). These were spoken by a native male Croatian and automatically labelled by a simple dynamic time warp technique using cross-linguistic prompts (as decribed in [1]). These were then hand corrected.
The final required piece was a set of prosodic models for Croatian; we found a very simple rule-based method of phrasing adequate for this domain (mostly shorter sentences). We trained duration models from the recorded Croatian speech, which worked well. However, the intonation model was harder. We found that a model trained from the relatively small amount of speech in the Croatian database did not produce a good intonation model. Thus we fell back on a different technique: we simply used our English intonation model modified to the range of our Croatian speaker. On listening tests, native Croatians preferred this over the natively-trained model. For other languages such short cuts may not be so acceptable.
The resulting quality-although not always fluent-was understandable almost all the time, and much better than a standard diphone synthesizer. It also readily captured the voice quality of the original Croatian speaker.