We have already been asked to make Flite much smaller. By moving to spike excite LPC or similarly encoding the residual into a smaller form, we can reduce the size of the diphone database to 423 kilobytes from 2.1 megabytes. Removing certain diphones is also possible as not all diphones are distinct, e.g a vowel going into different stops are often close enough to be rendered with the same diphone. We can probably encode the LPC coefficients by quantizing them and our other work is already looking at acoustic modeling to find acoustically distinct units so that much smaller databases than a standard diphone set will be possible. Though of course there will be the consequence of degraded quality.
The lexicon is still the largest single item, but we can prune the exception list aggressively reducing the lexicon footprint to a few hundred kilobytes. However, this probably deserves more study. Tailoring the lexicon to the domain the application to run in is always a good thing to do.
With our present models and techniques we know we can make a full synthesizer with a footprint of 512K for code and data and use less than 512K of memory (assuming streaming synthesis). Though there are still requests for smaller footprints especially with respect to RAM requirements as battery power for RAM is expensive and some embedded systems have as little as 2K of memory, and if we wish to deliver speech synthesis in games, toys etc. meeting such targets would be good. Although we have not looked closely at this, we feel Flite is a basis from which to approach this problem.