Use Free Edge Text-to-Speech with Deno.js

Would you like to use a free text-to-speech service that produces high-quality voices? I wrote a TypeScript Deno repository to demonstrate how to use the amazing free Microsoft Edge text-to-speech. You can find the GitHub repository https://github.com/mojocn/codeape/blob/main/edgetts.ts.

I won't share the step-by-step process for locating the internal text-to-speech API in Microsoft Edge browser. But I'd be happy to share the detailed logs of the Burp-suite from text-to-speech WebSocket.

#Dive into Edge TTS

Let's dive into the websocket of Edge TTS by BurpSuite.

#How to send TTS message to server

When the websocket connection is established, we need to send a initial message to the server.

X-Timestamp:Thu Oct 31 2024 14:39:43 GMT+0800 (中国标准时间)
Content-Type:application/json; charset=utf-8
Path:speech.config

{"context":{"synthesis":{"audio":{"metadataoptions":{"sentenceBoundaryEnabled":"false","wordBoundaryEnabled":"true"},"outputFormat":"webm-24khz-16bit-mono-opus"}}}}

After the initial message, we can send the text, voice, volume etc. options to the server.

X-RequestId:d6f2bbbe6a9817a6030a3d4833e967fd
Content-Type:application/ssml+xml
X-Timestamp:Thu Oct 31 2024 14:39:43 GMT+0800 (中国标准时间)Z
Path:ssml

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis'  xml:lang='en-US'><voice name='Microsoft Server Speech Text to Speech Voice (en-HK, SamNeural)'><prosody pitch='+0Hz' rate ='+0%' volume='+0%'> Home | About Folklore | Quotes </prosody></voice></speak>

#STEP 1: Edge Text to Speech Voices

Get a list of all the voices that are supported by Microsoft Edge's text-to-speech feature.

https://speech.platform.bing.com/consumer/speech/synthesize/readaloud/voices/list?trustedclienttoken=6A5AA1D4EAFF4E9FB37E23D68491D6F4

you can click above link to get all available voices.

export class EdgeTts {
     websocket?: WebSocket;
     token: string;
     constructor(token?: string) {
          this.token = token ?? '6A5AA1D4EAFF4E9FB37E23D68491D6F4';
     }

     async voices(): Promise<Array<Voice>> {
          const url =
               `https://speech.platform.bing.com/consumer/speech/synthesize/readaloud/voices/list?trustedclienttoken=${this.token}`;
          const response = await fetch(url);
          const voices: Array<Voice> = await response.json();
          return voices;
     }
}

#STEP 2: Edge Text-to-Speech

We need to manage several Websocket events to access voice and speaking word information.

First, let's connect the WebSocket endpoint. We cannot customize headers to create a WebSocket connection in Deno without using a library, unlike in Go or Python where the standard library allows it.

If we can't customize the headers of a WebSocket connection, it may cause connection problems or even get blocked by Microsoft.

After the WebSocket open event, We will respond with an initial message. This message contains information about what details the text-to-speech endpoint should return.


    connWebsocket(): Promise<WebSocket> {
        const url = new URL(
            `/consumer/speech/synthesize/readaloud/edge/v1?TrustedClientToken=${this.token}`,
            'wss://speech.platform.bing.com',
        );
        const ws = new WebSocket(url);
        //sentenceBoundaryEnabled = true is not supported in some countries
        const initialMessage = `
X-Timestamp:${new Date().toString()}\r\n
Content-Type:application/json; charset=utf-8\r\n
Path:speech.config\r\n\r\n
{"context":{"synthesis":{"audio":{"metadataoptions":
{"sentenceBoundaryEnabled":"true","wordBoundaryEnabled":"true"},
"outputFormat":"audio-24khz-96kbitrate-mono-mp3"}}}}`;

        return new Promise<WebSocket>((resolve, reject) => {
            ws.addEventListener('open', () => {
                ws.send(initialMessage);
                resolve(ws);
            });
            ws.addEventListener('error', reject);
            ws.addEventListener('close', console.info);
        });
    }

Metadata options include things like the audio output format, timestamps for speaking words, and timestamps for speaking sentences.

At the same time, we need to listen for both the error and close events.

We combine the text and speaking options, then send the message to the Websocket. We Create a promise to handle WebSocket message for text-to-speech processing. The text to speech result will be filled once the promise is completed successfully.

In the WebSocket message event listener, we deal with two types of data. The Path:audio separator is used for audio files. The Path:audio.metadata separator is used to mark the time when word or sentence are spoken.

    async speak(options: TTSOptions): Promise<TtsResult> {
        const ws = await this.connWebsocket();
        this.websocket = ws;
        const textXml = ssmlStr(options);
        ws.send(textXml);
        const result = new TtsResult();
        const promise = new Promise<TtsResult>((resolve) => {
            ws.addEventListener(
                'message',
                async (message: MessageEvent<string | Blob>) => {
                    if (typeof message.data !== 'string') {
                        const blob: Blob = message.data;
                        const separator = 'Path:audio\r\n';
                        const text = await blob.text();
                        const index = text.indexOf(separator) +
                            separator.length;
                        const audioBlob = blob.slice(index);
                        result.audioParts.push(audioBlob);
                        return;
                    }
                    if (message.data.includes('Path:audio.metadata')) {
                        const parts = message.data.split('Path:audio.metadata');
                        if (parts.length >= 2) {
                            const meta = JSON.parse(parts[1]) as AudioMetadata;
                            result.marks.push(meta.Metadata[0]);
                        }
                    } else if (message.data.includes('Path:turn.end')) {
                        return resolve(result);
                    }
                },
            );
        });
        return await promise;
    }

#STEP 3: Write Audio to File

After everything is done, we can use the text to speech result to write a file or create a subtitle.

We put the parts of the MP3 audio together. Use the deno write sync function to write the blob data into an MP3 file, making sure the file is saved successfully. Now we can easily turn our text into an audio file using the free text-to-speech.

class TtsResult {
    audioParts: Array<BlobPart>;
    marks: Array<WordBoundary>;
    constructor() {
        this.audioParts = [];
        this.marks = [];
    }
    get mp3Blob(): Blob {
        return new Blob(this.audioParts, { type: 'audio/mpeg' });
    }
    async writeToFile(path?: string) {
        path = path ?? 'output.mp3';
        const blob = new Blob(this.audioParts);
        const arrayBuffer = await blob.arrayBuffer();
        const uint8Array = new Uint8Array(arrayBuffer);
        Deno.writeFileSync(path, uint8Array);
    }
}

#Dive into Edge TTS

#How to send TTS message to server

#STEP 1: Edge Text to Speech Voices

#STEP 2: Edge Text-to-Speech

#STEP 3: Write Audio to File

#Links

On this page