Convert any text to lifelike speech using Amazon Polly

Utpal Kumar   4 minute read      

We will see how we can use amazon web services, specifically amazon polly to convert any text into a speech

We will see how we can use amazon web services, specifically amazon polly to convert any text into a speech, and save it into a MP3 file.

What is Amazon Polly?

Amazon Polly is, as you can guess, amazon’s cloud service that converts text into lifelike speech. It can be used by developer to build applications that increase engagement and accessibility. In this post, we will see how we can use it to synthesize the superior natural speech with high pronunciation accuracy.

Create an account on AWS

Amazon Polly is part of AWS or Amazon Web Services, so you are going to need to create an account first of all in order for you to access Amazon Polly. You can create an account on AWS at this link, and then follow the instructions.

After you made an account, you can quickly try the Amazon polly by finding “Amazon Polly” in the search box, and then “Try Amazon Polly”. Type or paste some text into the input box. The choose “Standard” or “Neural” for the engine. You can also select different language and voice. To listen to the speech immediately, choose Listen. In addition, you can also download the MP3 file.

Next, we want to write a Python script to automate the reading of any text from your local computer. This however requires the internet connectivity.

Create a user with admin privileges

To use the services in AWS, such as Amazon Polly, require that you provide credentials when you access them. So, you need to create an IAM user for yourself. You can create one using the IAM console (search in the search box “IAM”). Then follow the intructions on the screen to create a new user and give it administrator privileges. For details, see this link.

Set up the Amazon Command-Line-Interface (CLI)

Follow the steps here to install the Amazon’s command line interface: Installing the AWS CLI. I recommend installing the version 2.

After you successfully install the aws command line interface, the you should be able to run the following command:

aws --version

This returns in my case ->

aws-cli/2.3.0 Python/3.8.8 Darwin/20.6.0 exe/x86_64 prompt/off

Now, you can run the command:

aws sts get-caller-identity

This command tried to find out who you are. Most likely (if you haven’t configured your credentials locally), it is going to thow error that Unable to locate credentials. You can configure credentials by running "aws configure"..

You can run the command aws configure to configure your credentials.

Verify for the availability of Amazon Polly in your CLI

You can verify the availability of Amazon Polly by typing the following help command at the AWS CLI command prompt.

aws polly help

If a description of Amazon Polly and a list of valid commands appears in the AWS CLI window, Amazon Polly is available in the AWS CLI and can be used immediately.

If you are prompted error, then follow the steps here.

Convert your first text to speech using aws command line

aws polly synthesize-speech \
    --output-format mp3 \
    --voice-id Matthew \
    --text 'Hope you like earthinversion. Please share it with your friends and colleagues' \
    earthinversion.mp3

This one uses the voice id “Matthew”. You can look for other voice types using the command:

aws polly describe-voices

Python script to convert text to speech

Now since we have aws command line interface installed, we can install the python library boto3 and can simply run the script.

To install boto3

pip install boto3

Python script to convert text input

from boto3 import Session
from botocore.exceptions import BotoCoreError, ClientError
import sys

# Create a client using the credentials and region defined in the [adminuser]
# section of the AWS credentials file (~/.aws/credentials).
session = Session(profile_name="default")
polly = session.client("polly")

try:
    # Request speech synthesis
    response = polly.synthesize_speech(Text="Hello world! This is earthinversion speaking", OutputFormat="mp3",
                                        VoiceId="Matthew")
except (BotoCoreError, ClientError) as error:
    # The service returned an error, exit gracefully
    print(error)
    sys.exit(-1)

# Access the audio stream from the response
if "AudioStream" in response:
    body = response['AudioStream'].read()

    file_name ='speech.mp3'

    with open(file_name, 'wb') as file:
        file.write(body)

else:
    sys.exit(-1)

Python script to convert long text to speech

"""
Synthesizes speech from the input string of text or ssml.
Utpal Kumar, Oct, 2021
"""
from boto3 import Session
from botocore.exceptions import BotoCoreError, ClientError
import sys
from os import path
from pydub import AudioSegment
import os

import argparse

info_string = ''' by Utpal Kumar, IESAS, 2021/08
'''

PARSER = argparse.ArgumentParser(
    description=info_string, epilog="")


def main(textfilename, wavOut="test.wav", tmpOutMp3File='test.mp3'):
    if os.path.exists(tmpOutMp3File):
        os.remove(tmpOutMp3File)

    # Create a client using the credentials and region defined in the [adminuser]
    # section of the AWS credentials file (~/.aws/credentials).
    session = Session(profile_name="default")
    polly = session.client("polly")

    # Set the text input to be synthesized
    with open(textfilename) as bt:
        textdata = bt.readlines()
    # print(allText)
    print('Total lines: ', len(textdata))

    for N in range(len(textdata)):
        allText0 = textdata[N]
        print(f"{N+1}/{len(textdata)}", allText0)
        # Request speech synthesis
        response = polly.synthesize_speech(Text=allText0, OutputFormat="mp3",
                                        VoiceId="Matthew")
        body = response['AudioStream'].read()
        # The response's audio_content is binary.
        with open(tmpOutMp3File, "ab") as out:
            # Write the response to the output file.
            out.write(body)

    # convert wav to mp3
    sound = AudioSegment.from_mp3(tmpOutMp3File)
    sound.export(wavOut, format="wav")


if __name__ == "__main__":
    PARSER.add_argument("-f", '--textfilename', type=str, help="textfilename")
    PARSER.add_argument("-o", '--waveout', type=str,
                        help="wav output", default='test.wav')
    PARSER.add_argument("-m", '--tmpOutMp3File',
                        type=str, help="tmpOutMp3File", default='test.mp3')

    args = PARSER.parse_args()
    main(args.textfilename, wavOut=args.waveout,
         tmpOutMp3File=args.tmpOutMp3File)

Save the above script into a file named texttospeechaws.py. To use the above script, parse the input file using the f flag:

# python texttospeechaws.py -f mytextfile.txt

For more details on the use, run the command:

python texttospeechaws.py -h

Disclaimer of liability

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.


Leave a comment