THE COMPANY NEWSLETTER





HomeNewsletterArchiveSubscribeArticlesTraffic TipsContact Us

 
 

More Articles

Computer Crashes & What to Do

10 Ways To Recognize Spoof and Phishing Emails

Effectively Using Overture/Yahoo To Get Website Visitors

Website Basics-Choosing a Domain Name

Use Free Advertising Techniques To Grow Your Business

11 Ways To Promote Your Website

Blogging for Beginners

It IS Possible to Effectively Market Online for Free!

Four Easy Steps To Get Readers To View Every Single Page on Your Website

How To Build And Promote A Website With Little or No Money

Google - The New Big Brother

HOW TO INVESTIGATE HOME BASED BUSINESS OPPORTUNITIES

21 Ways To Promote Your Website

Why Anti-Virus Software Is Important

The New Apple iPhone - Thin. Sleek. Stylish

Using CDs To Backup Your Computer

Don't let your computer bite the dust!

3 Ways to Make Money with your Blog

Getting Immediate Targeted Traffic From Digg-Like Sites

Optimizing Your Website For Google And Yahoo

Blogging Tips For Beginners

What Are RSS Feeds?

What Is USB Flash Drive for Data Storage!

Simple Steps To Increase Traffic To Your Website

Beginners - Getting Listed In Search Engines

Ten Hints for
Better Websites

27 Quick Tips To Search
Engine Rankings

Music May Drown Out
Your Message

Creating Targeted Content

Top 7 Reasons to Review your Web Traffic Analysis

A Basic Introduction to Blogging

Google's Giant Sandbox

20 Sure-Fire Ways To Get People to Link to Your Web Site

10 Classic E-zine Advertising Tips

What is VOIP?

Top 7 Techniques For Generating Subscribers To A Newsletter

The Truth About Your
Website Visitors

The Essentials of Wireless Security

9 Great Reasons To Help
An Internet Marketing Newbie

What Is Pay-Per-Click Search Advertising?

What Are RSS Feeds?

How To Choose The Right Domain
Name For Your Company

Keep It Basic And Profit More!

How to Create a Useful,
Popular Websit
e

How to Create & Syndicate
a RSS Feed for Your Web Site

3 Case Studies On Using
Color For Organization

Google's Page Rank Explained

The Essentials of Wireless Security

Give The Folks At Google What
They Want

How To Speed Up Your Computer

Tips for Picking a Domain Name
Enter Here

5 Little Known Ways To Double,
Or Even Triple The Effectiveness
Of Your Web Copy Enter Here

Anti-Virus Software
Maintenance Lesson

5 Computer Software Websites
Every Computer User Needs

Google's SEO Advice For Your
Website: Content

Use Articles to Market Your Business

Choosing Anti-virus Software

Tips For Submitting Articles Online

7 Reasons Why Your Business
Should Use RSS

7 Tips for an Organically Grown
Subscriber List Enter Here

Adware Can Mean Real
Danger
Enter Here

Why Anti-Virus is A Must
Have? Enter Here

Tips to Protect Your PC Files from
External Attack Enter Here

Outlook Tips to Clean Your Inbox
Enter Here

Tips on Creating a Site Map Enter Here

Five Steps to Make Windows XP
Shut-Down Faster Enter Here

Blogging: What It Is And Why
You Need One Enter Here

How To Get Free Traffic To Your
Website Enter Here

How to Create & Syndicate
a RSS Feed for Your Web Site

Understanding E-books

10 Simple Ways To Speed Up
Windows XP

 

 


 

Introduction to Text to Speech Programs

The basic goal of text-to-speech (TTS) synthesis is to convert any unrestricted text into speech waveforms through a text-to-speech system. Historically, the arise of text-to-speech synthesis technology was strongly motivated by the needs of those persons who has voice handicap, especially the blind. The TTS system can provide them a convenient access to written information. With the prevalence of computers and development of digital technology from the mid 1980s, the need for man-machine communication increased significantly, leading to the advent of many applications for commercial purposes. For instance, the interaction between human and machine in telephone services, language education tools with the assistance of TTS system, talking toys or computer games for children on the market and so on. In return, these practical applications motivate the further research on this area which aim is to develop high-quality TTS system.

Generally speaking, a typical text-to-speech synthesis system consists of three main components: text pre-processing, text to phonetic-prosodic translation, and speech synthesizer.

Text normalization

Usually, the input text of the system is a sequence of unrestricted characters, containing number, symbol, acronyms, and abbreviation. Then, the text normalizer translates them into full plain text. For example, '3:15' will be transformed into 'a quarter past three'. At the first sight, this task seems to be very easy. However, an important problem is often encountered during this translation: semantic ambiguities. One of typical examples is the translation of 'Dr'. It can represent 'Doctor' or 'Drive' according to its specific context.

Text to phonetic-prosodic translation

The translation from text to pronunciation is central to a full text-to-speech system. This module converts the pre-processed text into a phonetic transcription with the prosodic information (like intonation and rhythm) as well. It is a relatively complicated process and at a large extent determines the final quality of the output speech.

Speech synthesis

In general, digital speech synthesis is an integrated technology for simulating the human processes that generates speech form symbolic representation of utterance to acoustic waveforms. With the rapid development in text-to-speech system in recent years, the possibility for speech synthesis has increased dramatically, because the text written in ordinal form can be explained with some phonological representation which is not difficult to understand. Nowadays, there are many text-to-speech systems on the commercial market and some of them are even multi-linguistics systems. In the following sections, two widely used speech synthesis approaches will be introduced.

Formant synthesis

Historically, formant synthesis is also known as the source-filter synthesis. It describes the speech by a series of parameters, most of which are related formant or anti-formant frequencies and bandwidths together with glottal waveforms. These formant and anti-formant frequencies are very similar to the frequency response characteristics of the vocal tract. Therefore, it is very necessary to learn some basic knowledge of human's speech production before my further discussion on the formant synthesis. Figure 2 illustrates human's speech production system. It is essentially composed of lungs, windpipe, pharyngeal cavity (including larynx), oral cavity, and nasal cavity. In the discussion, we usually combine oral and nasal cavity together referred to as vocal tract. Larynx is the organ that generates the sound. It contains two pieces of cartilage called vocal folds which can repeatedly open and close as the air expelled from lung is forced through the opening between them. Another critical organ is the velum at the rear of nasal cavity. It controls the connection between oral cavity and nasal cavity. During the production of non-nasal sounds, velum seals off the entrance of airflow to the nasal cavity and there is only one transmission path via mouth. For the nasal sounds, velum is lowered and nasal cavity is coupled into the speech system.

According to the types of excitation, the speech sounds can be roughly divided into three categories: (1) voiced sounds, (2) unvoiced sounds, and (3) plosive sounds.

Voiced sounds are generated when an airflow is enforced through the opening between the two pieces of vocal folds. The waveform is periodic or at least quasi-periodic because of the periodic vibration of vocal folds forced by the airflow and has a spectrum of rich harmonics with different pith frequency which attenuate at a rate of roughly 12dB/octave. The pith frequency for an adult male ranges from 50 to 250 Hz and for an adult female the upper limit of this range can reach 500 Hz, much higher than male. All the vowels and semi-vowels in English belong to voiced sounds.

In contrast with voiced sounds, there is no vibration of vocal folds in the unvoiced sounds. One of the examples of unvoiced sounds is the /s/ sound. The airflow is constricted at some point in the oral cavity to produce the turbulence that causes a random noise excitation. Unvoiced sounds are divided into fricative and aspirated sounds.

A plosive sound may be voiced, or unvoiced, or a combination of them. It is produced by making a closure at some point in the vocal tract, building the air pressure, and then releasing it suddenly.

As we mentioned before, a voiced sound consists of a series of harmonics with different pith frequency. When a neutral vowel sound is produced, human's vocal tract can be seen as a tube (with the vocal folds closed at one end and the lip opened at the other). Such tube has a series of odd frequency resonances which is presented as , , , ......etc. The lowest frequency can be calculated by the equation of the quarter wave resonance of the vocal tract: , where L is the length of the tube, c is the transmission velocity of sound in the air. For a typical vocal tract of an adult male, the length L is 17cm. If we take c=340m/s, the resulting resonant frequencies will be 500 Hz, 1500 Hz, 2500 Hz, ...... etc. We call these resonances in the vocal tract as formants. In theory, the vocal tract has infinite formants, but their amplitudes decay with the increase of order at the rate of 12 dB/octave. So it is only necessary to consider the first three or four formants in practice.

The above discussion results the idea of founding a source-filter model to simulate the speech production process. Figure 3 shows the block diagram of this model. The periodic pulses and random noise are generated by the glottal pulse source and random noise source respectively. Then these two kinds of signals are combined together and filtered by a set of filters with certain parameters which guarantee the resonant properties of these filters are similar to that of the vocal tract. The output speech is obtained by multiplying the spectrum of input signals by the filters' spectral properties. and in this diagram are used to control the gain of input signals.

In practical formant synthesizers, formant generators can be connected either in the parallel or serial pattern. Figure 4 gives the model of a typical serial formant synthesizer. There are three parallel channels in this system. The middle branch is composed by a set of low-pass filters (usually three filters are used and the fourth is used sometimes for a better quality) connected in series. Each of these filters has different center frequency which is similar to the resonant frequency of the non-nasal sounds, such as the vowel or vowel-like sounds. Similarly, the nasal and fricative sounds are produced by varying the parameters of the corresponding filters. Some successful serial formant synthesis systems have been developed for years, such as the famous OVE-2 Speech Synthesizer of Gunnar Fant in 1962. For the parallel synthesizers, formant resonators are connected in the parallel way and their output are weighted and summed to produce the synthetic speech signal. The amplitude control in the parallel synthesizer is represented with the formant frequency together. Although this adds more parameters, it becomes more powerful for the system to control the status of the frequency spectrum of the output speech. Compared with the serial synthesizer which can get a better quality output of vowel or vowel-like sound, the parallel one is more suitable for the nasal and fricative sounds. The JSRU (Joint Speech Research Unit) synthesizer developed by Holmes in 1985 is one typical example of this kind of synthesizer. Compared with the serial synthesizer which can get a better quality output of vowel or vowel-like sound, the parallel one is more suitable for the nasal and fricative sounds. In order to obtain more natural synthetic speech, a combination of these two methods is developed. The general idea of this formant synthesizer is that a serial connection of formant resonators is used to produce voiced sounds, whereas the nasal and fricative sounds are produced by applying a parallel connection.

Concatenative synthesis

Due to the improvement of the performance of processors and storage devices in recent years, concatenative synthesis has become increasingly important. It is based on the concatenation of segments of recorded speech which are stored in the database. During the playback, the concatenation is activated and these recorded segments are extracted from the database to produce speech according to the processed input text. There are three main subtypes of cocantenative synthesis.

The first type is called Unit Selection Synthesis. This kind of synthesizer usually requires a large size database. Before the creation of the database, the unit of the stored segments of speech should be defined. The unit can be individual phones, syllables, morphemes, words, even phrases and sentences. Once the unit has been fixed, each recorded utterance in the language will be divided into a sequence of units. Typically, this division process is done by using a specially modified speech recognizer set to a "forced alignment" mode with some hand correction afterward, using visual representations such as the waveform and spectrogram. Then these segmented units are stored in the database with their acoustic parameters such as pitch, amplitude, duration and the neighboring phones. When a target utterance is required, the system chooses the best sequence of candidate units from the database. This process is called unit selection and it can give a high quality of naturalness of output sounds because it does not involve very complicated digital signal processing which often degrades the naturalness of the output speech. However, the unit selection synthesizer has an obvious disadvantage that the great naturalness results very large size speech databases, measured by gigabytes in some systems. This usually means the need of more storage devices and cost.

Due to the large size of database in the unit selection synthesis system, another method known as Diphone Synthesis is created by using a speech database storing all the diphones in a certain language. Although the number of diphones in different languages can vary in a wide range, the required database is much smaller than the one of unit selection synthesizer. At runtime, the prosodic information such as intonation, duration, and amplitude is added on these diphones by means of some digital signal processing technologies, for example, Linear Predictive Coding (LPC), pith synchronous overlap and add (PSOLA), and MBROLA. Usually, the quality of the output speech is not as good as that of unit selection. In addition, the diphone synthesis has the problem of glitches of concatenative synthesis which make the speech sounds uncomfortable. However, with the advantage of simple speech database and reasonable sound naturalness, it is still widely used in many commercial applications.

The third type is Domain-specific Synthesis. In this kind of synthesis techniques, the speech units stored in the database are often at the level of phrases and sentences. It is used in the applications where the output speech of the system is limited in a certain domain, for example, the telephone enquiry service or the weather broadcast. This technology is very easy to implement and requires a relatively less cost, so it has been in commercial use for a long time, like talking clocks, calculators, and some other gadgets. The output of these systems can be very natural because the number of phrases or sentences is limited and more prosodic information can be added accurately to match human's utterance. However, due to the limitation of phrases and sentences in its speech database, these systems can not be used for general purposes and they are restricted to the combination of words or phrases which they have pre-processed with.


About the Author
http://www.eiuoo.com

 

 


Copyright ©2004 The Company Newsletter