Introduction
to Text to Speech Programs
The
basic goal of text-to-speech (TTS)
synthesis is to convert any unrestricted
text into speech waveforms through
a text-to-speech system. Historically,
the arise of text-to-speech synthesis
technology was strongly motivated
by the needs of those persons who
has voice handicap, especially the
blind. The TTS system can provide
them a convenient access to written
information. With the prevalence
of computers and development of
digital technology from the mid
1980s, the need for man-machine
communication increased significantly,
leading to the advent of many applications
for commercial purposes. For instance,
the interaction between human and
machine in telephone services, language
education tools with the assistance
of TTS system, talking toys or computer
games for children on the market
and so on. In return, these practical
applications motivate the further
research on this area which aim
is to develop high-quality TTS system.
Generally
speaking, a typical text-to-speech
synthesis system consists of three
main components: text pre-processing,
text to phonetic-prosodic translation,
and speech synthesizer.
Text
normalization
Usually,
the input text of the system is
a sequence of unrestricted characters,
containing number, symbol, acronyms,
and abbreviation. Then, the text
normalizer translates them into
full plain text. For example, '3:15'
will be transformed into 'a quarter
past three'. At the first sight,
this task seems to be very easy.
However, an important problem is
often encountered during this translation:
semantic ambiguities. One of typical
examples is the translation of 'Dr'.
It can represent 'Doctor' or 'Drive'
according to its specific context.
Text
to phonetic-prosodic translation
The
translation from text to pronunciation
is central to a full text-to-speech
system. This module converts the
pre-processed text into a phonetic
transcription with the prosodic
information (like intonation and
rhythm) as well. It is a relatively
complicated process and at a large
extent determines the final quality
of the output speech.
Speech
synthesis
In
general, digital speech synthesis
is an integrated technology for
simulating the human processes that
generates speech form symbolic representation
of utterance to acoustic waveforms.
With the rapid development in text-to-speech
system in recent years, the possibility
for speech synthesis has increased
dramatically, because the text written
in ordinal form can be explained
with some phonological representation
which is not difficult to understand.
Nowadays, there are many text-to-speech
systems on the commercial market
and some of them are even multi-linguistics
systems. In the following sections,
two widely used speech synthesis
approaches will be introduced.
Formant
synthesis
Historically,
formant synthesis is also known
as the source-filter synthesis.
It describes the speech by a series
of parameters, most of which are
related formant or anti-formant
frequencies and bandwidths together
with glottal waveforms. These formant
and anti-formant frequencies are
very similar to the frequency response
characteristics of the vocal tract.
Therefore, it is very necessary
to learn some basic knowledge of
human's speech production before
my further discussion on the formant
synthesis. Figure 2 illustrates
human's speech production system.
It is essentially composed of lungs,
windpipe, pharyngeal cavity (including
larynx), oral cavity, and nasal
cavity. In the discussion, we usually
combine oral and nasal cavity together
referred to as vocal tract. Larynx
is the organ that generates the
sound. It contains two pieces of
cartilage called vocal folds which
can repeatedly open and close as
the air expelled from lung is forced
through the opening between them.
Another critical organ is the velum
at the rear of nasal cavity. It
controls the connection between
oral cavity and nasal cavity. During
the production of non-nasal sounds,
velum seals off the entrance of
airflow to the nasal cavity and
there is only one transmission path
via mouth. For the nasal sounds,
velum is lowered and nasal cavity
is coupled into the speech system.
According
to the types of excitation, the
speech sounds can be roughly divided
into three categories: (1) voiced
sounds, (2) unvoiced sounds, and
(3) plosive sounds.
Voiced
sounds are generated when an airflow
is enforced through the opening
between the two pieces of vocal
folds. The waveform is periodic
or at least quasi-periodic because
of the periodic vibration of vocal
folds forced by the airflow and
has a spectrum of rich harmonics
with different pith frequency which
attenuate at a rate of roughly 12dB/octave.
The pith frequency for an adult
male ranges from 50 to 250 Hz and
for an adult female the upper limit
of this range can reach 500 Hz,
much higher than male. All the vowels
and semi-vowels in English belong
to voiced sounds.
In
contrast with voiced sounds, there
is no vibration of vocal folds in
the unvoiced sounds. One of the
examples of unvoiced sounds is the
/s/ sound. The airflow is constricted
at some point in the oral cavity
to produce the turbulence that causes
a random noise excitation. Unvoiced
sounds are divided into fricative
and aspirated sounds.
A
plosive sound may be voiced, or
unvoiced, or a combination of them.
It is produced by making a closure
at some point in the vocal tract,
building the air pressure, and then
releasing it suddenly.
As
we mentioned before, a voiced sound
consists of a series of harmonics
with different pith frequency. When
a neutral vowel sound is produced,
human's vocal tract can be seen
as a tube (with the vocal folds
closed at one end and the lip opened
at the other). Such tube has a series
of odd frequency resonances which
is presented as , , , ......etc.
The lowest frequency can be calculated
by the equation of the quarter wave
resonance of the vocal tract: ,
where L is the length of the tube,
c is the transmission velocity of
sound in the air. For a typical
vocal tract of an adult male, the
length L is 17cm. If we take c=340m/s,
the resulting resonant frequencies
will be 500 Hz, 1500 Hz, 2500 Hz,
...... etc. We call these resonances
in the vocal tract as formants.
In theory, the vocal tract has infinite
formants, but their amplitudes decay
with the increase of order at the
rate of 12 dB/octave. So it is only
necessary to consider the first
three or four formants in practice.
The
above discussion results the idea
of founding a source-filter model
to simulate the speech production
process. Figure 3 shows the block
diagram of this model. The periodic
pulses and random noise are generated
by the glottal pulse source and
random noise source respectively.
Then these two kinds of signals
are combined together and filtered
by a set of filters with certain
parameters which guarantee the resonant
properties of these filters are
similar to that of the vocal tract.
The output speech is obtained by
multiplying the spectrum of input
signals by the filters' spectral
properties. and in this diagram
are used to control the gain of
input signals.
In
practical formant synthesizers,
formant generators can be connected
either in the parallel or serial
pattern. Figure 4 gives the model
of a typical serial formant synthesizer.
There are three parallel channels
in this system. The middle branch
is composed by a set of low-pass
filters (usually three filters are
used and the fourth is used sometimes
for a better quality) connected
in series. Each of these filters
has different center frequency which
is similar to the resonant frequency
of the non-nasal sounds, such as
the vowel or vowel-like sounds.
Similarly, the nasal and fricative
sounds are produced by varying the
parameters of the corresponding
filters. Some successful serial
formant synthesis systems have been
developed for years, such as the
famous OVE-2 Speech Synthesizer
of Gunnar Fant in 1962. For the
parallel synthesizers, formant resonators
are connected in the parallel way
and their output are weighted and
summed to produce the synthetic
speech signal. The amplitude control
in the parallel synthesizer is represented
with the formant frequency together.
Although this adds more parameters,
it becomes more powerful for the
system to control the status of
the frequency spectrum of the output
speech. Compared with the serial
synthesizer which can get a better
quality output of vowel or vowel-like
sound, the parallel one is more
suitable for the nasal and fricative
sounds. The JSRU (Joint Speech Research
Unit) synthesizer developed by Holmes
in 1985 is one typical example of
this kind of synthesizer. Compared
with the serial synthesizer which
can get a better quality output
of vowel or vowel-like sound, the
parallel one is more suitable for
the nasal and fricative sounds.
In order to obtain more natural
synthetic speech, a combination
of these two methods is developed.
The general idea of this formant
synthesizer is that a serial connection
of formant resonators is used to
produce voiced sounds, whereas the
nasal and fricative sounds are produced
by applying a parallel connection.
Concatenative
synthesis
Due
to the improvement of the performance
of processors and storage devices
in recent years, concatenative synthesis
has become increasingly important.
It is based on the concatenation
of segments of recorded speech which
are stored in the database. During
the playback, the concatenation
is activated and these recorded
segments are extracted from the
database to produce speech according
to the processed input text. There
are three main subtypes of cocantenative
synthesis.
The
first type is called Unit Selection
Synthesis. This kind of synthesizer
usually requires a large size database.
Before the creation of the database,
the unit of the stored segments
of speech should be defined. The
unit can be individual phones, syllables,
morphemes, words, even phrases and
sentences. Once the unit has been
fixed, each recorded utterance in
the language will be divided into
a sequence of units. Typically,
this division process is done by
using a specially modified speech
recognizer set to a "forced
alignment" mode with some hand
correction afterward, using visual
representations such as the waveform
and spectrogram. Then these segmented
units are stored in the database
with their acoustic parameters such
as pitch, amplitude, duration and
the neighboring phones. When a target
utterance is required, the system
chooses the best sequence of candidate
units from the database. This process
is called unit selection and it
can give a high quality of naturalness
of output sounds because it does
not involve very complicated digital
signal processing which often degrades
the naturalness of the output speech.
However, the unit selection synthesizer
has an obvious disadvantage that
the great naturalness results very
large size speech databases, measured
by gigabytes in some systems. This
usually means the need of more storage
devices and cost.
Due
to the large size of database in
the unit selection synthesis system,
another method known as Diphone
Synthesis is created by using a
speech database storing all the
diphones in a certain language.
Although the number of diphones
in different languages can vary
in a wide range, the required database
is much smaller than the one of
unit selection synthesizer. At runtime,
the prosodic information such as
intonation, duration, and amplitude
is added on these diphones by means
of some digital signal processing
technologies, for example, Linear
Predictive Coding (LPC), pith synchronous
overlap and add (PSOLA), and MBROLA.
Usually, the quality of the output
speech is not as good as that of
unit selection. In addition, the
diphone synthesis has the problem
of glitches of concatenative synthesis
which make the speech sounds uncomfortable.
However, with the advantage of simple
speech database and reasonable sound
naturalness, it is still widely
used in many commercial applications.
The
third type is Domain-specific Synthesis.
In this kind of synthesis techniques,
the speech units stored in the database
are often at the level of phrases
and sentences. It is used in the
applications where the output speech
of the system is limited in a certain
domain, for example, the telephone
enquiry service or the weather broadcast.
This technology is very easy to
implement and requires a relatively
less cost, so it has been in commercial
use for a long time, like talking
clocks, calculators, and some other
gadgets. The output of these systems
can be very natural because the
number of phrases or sentences is
limited and more prosodic information
can be added accurately to match
human's utterance. However, due
to the limitation of phrases and
sentences in its speech database,
these systems can not be used for
general purposes and they are restricted
to the combination of words or phrases
which they have pre-processed with.
About the Author
http://www.eiuoo.com