Overview
Translates audio in any supported language into English text. Unlike transcription, this endpoint always outputs English text regardless of the input language.
Request Body
The audio file to translate. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Maximum file size is 25 MB.
model
string
default: "whisper-1"
The model to use. Currently only whisper-1 is supported.
An optional text to guide the model’s style or continue a previous segment. Should be in English.
The format of the output. Options: json, text, srt, verbose_json, vtt.
The sampling temperature, between 0 and 1. Higher values like 0.8 produce more random output, while lower values like 0.2 make output more focused and deterministic.
Response
The translated text in English.
For verbose_json format, the response also includes:
The detected language of the input audio.
The duration of the input audio in seconds.
Segments of the translated text with timestamps.
cURL
Python
JavaScript
Go
PHP
curl -X POST "https://api.lemondata.cc/v1/audio/translations" \
-H "Authorization: Bearer sk-your-api-key" \
-F "file=@german_audio.mp3" \
-F "model=whisper-1"
{
"text" : "Hello, my name is Wolfgang and I come from Germany. Where are you from?"
}
Translation vs Transcription
Feature Translation Transcription Output language Always English Same as input Use case Convert foreign audio to English Preserve original language Language parameter Not applicable Optional hint
The translation endpoint automatically detects the source language and translates to English. The language parameter from transcription is ignored.
The audio file to translate. Supported formats: flac , mp3 , mp4 , mpeg , mpga , m4a , ogg , wav , webm . Maximum file size is 25 MB.
The model to use. Currently only whisper-1 is supported.
An optional text to guide the model’s style or continue a previous segment. Should be in English.
The format of the output. Options: json , text , srt , verbose_json , vtt .
The sampling temperature, between 0 and 1. Higher values like 0.8 produce more random output, while lower values like 0.2 make output more focused and deterministic.
The translated text in English.
The detected language of the input audio.
The duration of the input audio in seconds.
Segments of the translated text with timestamps.