Use Free Edge Text-to-Speech with Deno.js
Would you like to use a free text-to-speech service that produces high-quality voices? I wrote a TypeScript Deno repository to demonstrate how to use the amazing free Microsoft Edge text-to-speech. You can find the GitHub repository https://github.com/mojocn/codeape/blob/main/edgetts.ts.
I won't share the step-by-step process for locating the internal text-to-speech API in Microsoft Edge browser. But I'd be happy to share the detailed logs of the Burp-suite from text-to-speech WebSocket.
Dive into Edge TTS
Let's dive into the websocket of Edge TTS by BurpSuite.
How to send TTS message to server
When the websocket connection is established, we need to send a initial message to the server.
X-Timestamp:Thu Oct 31 2024 14:39:43 GMT+0800 (中国标准时间)
Content-Type:application/json; charset=utf-8
Path:speech.config
{"context":{"synthesis":{"audio":{"metadataoptions":{"sentenceBoundaryEnabled":"false","wordBoundaryEnabled":"true"},"outputFormat":"webm-24khz-16bit-mono-opus"}}}}
After the initial message, we can send the text, voice, volume etc. options to the server.
X-RequestId:d6f2bbbe6a9817a6030a3d4833e967fd
Content-Type:application/ssml+xml
X-Timestamp:Thu Oct 31 2024 14:39:43 GMT+0800 (中国标准时间)Z
Path:ssml
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'><voice name='Microsoft Server Speech Text to Speech Voice (en-HK, SamNeural)'><prosody pitch='+0Hz' rate ='+0%' volume='+0%'> Home | About Folklore | Quotes </prosody></voice></speak>
STEP 1: Edge Text to Speech Voices
Get a list of all the voices that are supported by Microsoft Edge's text-to-speech feature.
you can click above link to get all available voices.
export class EdgeTts {
websocket?: WebSocket;
token: string;
constructor(token?: string) {
this.token = token ?? '6A5AA1D4EAFF4E9FB37E23D68491D6F4';
}
async voices(): Promise<Array<Voice>> {
const url =
`https://speech.platform.bing.com/consumer/speech/synthesize/readaloud/voices/list?trustedclienttoken=${this.token}`;
const response = await fetch(url);
const voices: Array<Voice> = await response.json();
return voices;
}
}
STEP 2: Edge Text-to-Speech
We need to manage several Websocket events to access voice and speaking word information.
First, let's connect the WebSocket endpoint. We cannot customize headers to create a WebSocket connection in Deno without using a library, unlike in Go or Python where the standard library allows it.
If we can't customize the headers of a WebSocket connection, it may cause connection problems or even get blocked by Microsoft.
After the WebSocket open event, We will respond with an initial message. This message contains information about what details the text-to-speech endpoint should return.
connWebsocket(): Promise<WebSocket> {
const url = new URL(
`/consumer/speech/synthesize/readaloud/edge/v1?TrustedClientToken=${this.token}`,
'wss://speech.platform.bing.com',
);
const ws = new WebSocket(url);
//sentenceBoundaryEnabled = true is not supported in some countries
const initialMessage = `
X-Timestamp:${new Date().toString()}\r\n
Content-Type:application/json; charset=utf-8\r\n
Path:speech.config\r\n\r\n
{"context":{"synthesis":{"audio":{"metadataoptions":
{"sentenceBoundaryEnabled":"true","wordBoundaryEnabled":"true"},
"outputFormat":"audio-24khz-96kbitrate-mono-mp3"}}}}`;
return new Promise<WebSocket>((resolve, reject) => {
ws.addEventListener('open', () => {
ws.send(initialMessage);
resolve(ws);
});
ws.addEventListener('error', reject);
ws.addEventListener('close', console.info);
});
}
Metadata options include things like the audio output format, timestamps for speaking words, and timestamps for speaking sentences.
At the same time, we need to listen for both the error and close events.
We combine the text and speaking options, then send the message to the Websocket. We Create a promise to handle WebSocket message for text-to-speech processing. The text to speech result will be filled once the promise is completed successfully.
In the WebSocket message event listener, we deal with two types of data.
The Path:audio
separator is used for audio files.
The Path:audio.metadata
separator is used to mark the time when word or sentence are spoken.
async speak(options: TTSOptions): Promise<TtsResult> {
const ws = await this.connWebsocket();
this.websocket = ws;
const textXml = ssmlStr(options);
ws.send(textXml);
const result = new TtsResult();
const promise = new Promise<TtsResult>((resolve) => {
ws.addEventListener(
'message',
async (message: MessageEvent<string | Blob>) => {
if (typeof message.data !== 'string') {
const blob: Blob = message.data;
const separator = 'Path:audio\r\n';
const text = await blob.text();
const index = text.indexOf(separator) +
separator.length;
const audioBlob = blob.slice(index);
result.audioParts.push(audioBlob);
return;
}
if (message.data.includes('Path:audio.metadata')) {
const parts = message.data.split('Path:audio.metadata');
if (parts.length >= 2) {
const meta = JSON.parse(parts[1]) as AudioMetadata;
result.marks.push(meta.Metadata[0]);
}
} else if (message.data.includes('Path:turn.end')) {
return resolve(result);
}
},
);
});
return await promise;
}
STEP 3: Write Audio to File
After everything is done, we can use the text to speech result to write a file or create a subtitle.
We put the parts of the MP3 audio together.
Use the deno write sync
function to write the blob data into an MP3 file, making sure the file is saved successfully.
Now we can easily turn our text into an audio file using the free text-to-speech.
class TtsResult {
audioParts: Array<BlobPart>;
marks: Array<WordBoundary>;
constructor() {
this.audioParts = [];
this.marks = [];
}
get mp3Blob(): Blob {
return new Blob(this.audioParts, { type: 'audio/mpeg' });
}
async writeToFile(path?: string) {
path = path ?? 'output.mp3';
const blob = new Blob(this.audioParts);
const arrayBuffer = await blob.arrayBuffer();
const uint8Array = new Uint8Array(arrayBuffer);
Deno.writeFileSync(path, uint8Array);
}
}
Links
Was this article helpful to you?
Provide feedback
Last edited on December 26, 2024.
Edit this page