Whisper-Large-v3-Turbo

Whisper-Large-v3-Turbo
OpenAI · Transcription
POST /v1/audio/transcriptions

Controlled self-hosted Whisper Large v3 Turbo transcription with multilingual ASR, translation, VAD, timestamps, subtitles, hotwords, and decoder controls exposed.

At a glance

FieldValue
Model idwhisper-large-v3-turbo
Input modalitiesAudio
Output modalitiesText
Context window
Weight precisionFP16
Featurestranscription, translation, multilingual, word_timestamps, hotwords, srt_vtt
Native inferenceYes
NewYes
Supported endpointsPOST /v1/audio/transcriptions

Pricing

ChargeSpecRate
Controlled transcriptionper minute of audio$0.005 (was $0.006)

Example request

$curl https://api.empiriolabs.ai/v1/audio/transcriptions \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -F model=whisper-large-v3-turbo \
> -F file=@meeting.mp3

Parameters

ParameterTypeRequiredDefaultDescription
audio_urlstringnoURL of the audio file to transcribe. Mutually exclusive with audio_base64.
audio_base64stringnoBase64-encoded audio bytes. Mutually exclusive with audio_url.
audio_suffixstringno".audio"File extension hint (mp3, wav, m4a, etc.) when the audio source has no recognizable extension.
languagestringnoISO 639-1 language code (en, es, fr, etc.). Leave blank for auto-detection.
taskenumno"transcribe"transcribe = same language, translate = translate to English. · Allowed: transcribe, translate
beam_sizeintegerno5Beam search width. Higher = more accurate but slower. · Range: 1 – 32
best_ofintegerno5Number of candidates to sample with temperature > 0. · Range: 1 – 32
patiencenumberno1.0Beam search patience factor. Higher = explore more candidates. · Range: 0.0 – 10.0
length_penaltynumberno1.0Penalty applied to longer transcripts. Negative encourages shorter output. · Range: -10.0 – 10.0
repetition_penaltynumberno1.0Penalty for repeating tokens. >1 reduces repetition. · Range: 0.1 – 5.0
no_repeat_ngram_sizeintegerno0Block any n-gram of this size from repeating in the output. · Range: 0 – 20
temperaturestringno"0,0.2,0.4,0.6,0.8,1"Sampling temperature. 0 = deterministic, higher = more variation.
compression_ratio_thresholdnumberno2.4Treat output with compression ratio above this as failed and retry.
log_prob_thresholdnumberno-1.0Treat segments with average log-prob below this as failed and retry.
no_speech_thresholdnumberno0.6Mark a segment as silent when no-speech probability exceeds this AND log-prob is below threshold.
condition_on_previous_textbooleannotrueUse prior transcript as conditioning for the next segment.
prompt_reset_on_temperaturenumberno0.5Reset the conditioning prompt when temperature falls back during retry. · Range: 0.0 – 1.0
initial_promptstringnoInitial text prompt to guide vocabulary and style.
prefixstringnoText to prepend to the first segment’s transcript.
suppress_blankbooleannotrueSuppress empty outputs at the start of each segment.
suppress_tokensstringno"-1"Comma-separated token IDs to suppress during decoding.
without_timestampsbooleannofalseStrip per-segment timestamps from the response.
word_timestampsbooleannofalseInclude per-word timestamps in the response.
prepend_punctuationsstringnoPunctuation characters to merge with the following word.
append_punctuationsstringnoPunctuation characters to merge with the preceding word.
max_initial_timestampnumberno1.0Cap the first segment’s start time to this many seconds. · Range: 0.0 – 30.0
multilingualbooleannofalseAllow language switching within a single audio file.
vad_filterbooleannotrueApply Silero VAD to remove silence before decoding.
vad_parametersobjectnoVAD configuration as JSON (threshold, min_speech_duration_ms, etc.).
max_new_tokensintegernoCap on decoded tokens per segment.
chunk_lengthintegernoLength of each audio chunk in seconds before decoding.
clip_timestampsstringno"0"Only decode within these (start, end) second ranges. Format: “0.5,12.3,15.0,30.0”.
hallucination_silence_thresholdnumbernoTreat long silent sections above this many seconds as hallucinations and skip them.
hotwordsstringnoComma-separated hotwords to bias decoding toward (proper nouns, jargon).
language_detection_thresholdnumberno0.5Confidence threshold for auto language detection.
language_detection_segmentsintegerno1Number of leading segments to use for language detection. · Range: 1 – 20
include_tokensbooleannofalseInclude raw token IDs alongside each word/segment.
response_formatenumno"verbose_json"json | verbose_json | text | srt | vtt. · Allowed: verbose_json, json, text, srt, vtt

Notes

Supports URL/base64 audio, language/task, beam and temperature fallback controls, VAD/chunking, hotwords, prompts, word timestamps, punctuation controls, token debug output, and JSON/text/SRT/VTT formats.


Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/whisper-large-v3-turbo.