Whisper-Large-v3-Turbo
Whisper-Large-v3-Turbo

OpenAI · Transcription
POST /v1/audio/transcriptionsControlled self-hosted Whisper Large v3 Turbo transcription with multilingual ASR, translation, VAD, timestamps, subtitles, hotwords, and decoder controls exposed.
At a glance
| Field | Value |
|---|---|
| Model id | whisper-large-v3-turbo |
| Input modalities | Audio |
| Output modalities | Text |
| Context window | — |
| Weight precision | FP16 |
| Features | transcription, translation, multilingual, word_timestamps, hotwords, srt_vtt |
| Native inference | Yes |
| New | Yes |
| Supported endpoints | POST /v1/audio/transcriptions |
Pricing
| Charge | Spec | Rate |
|---|---|---|
| Controlled transcription | per minute of audio | $0.005 (was $0.006) |
Example request
$ curl https://api.empiriolabs.ai/v1/audio/transcriptions \ > -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \ > -F model=whisper-large-v3-turbo \ > -F file=@meeting.mp3
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
audio_url | string | no | — | URL of the audio file to transcribe. Mutually exclusive with audio_base64. |
audio_base64 | string | no | — | Base64-encoded audio bytes. Mutually exclusive with audio_url. |
audio_suffix | string | no | ".audio" | File extension hint (mp3, wav, m4a, etc.) when the audio source has no recognizable extension. |
language | string | no | — | ISO 639-1 language code (en, es, fr, etc.). Leave blank for auto-detection. |
task | enum | no | "transcribe" | transcribe = same language, translate = translate to English. · Allowed: transcribe, translate |
beam_size | integer | no | 5 | Beam search width. Higher = more accurate but slower. · Range: 1 – 32 |
best_of | integer | no | 5 | Number of candidates to sample with temperature > 0. · Range: 1 – 32 |
patience | number | no | 1.0 | Beam search patience factor. Higher = explore more candidates. · Range: 0.0 – 10.0 |
length_penalty | number | no | 1.0 | Penalty applied to longer transcripts. Negative encourages shorter output. · Range: -10.0 – 10.0 |
repetition_penalty | number | no | 1.0 | Penalty for repeating tokens. >1 reduces repetition. · Range: 0.1 – 5.0 |
no_repeat_ngram_size | integer | no | 0 | Block any n-gram of this size from repeating in the output. · Range: 0 – 20 |
temperature | string | no | "0,0.2,0.4,0.6,0.8,1" | Sampling temperature. 0 = deterministic, higher = more variation. |
compression_ratio_threshold | number | no | 2.4 | Treat output with compression ratio above this as failed and retry. |
log_prob_threshold | number | no | -1.0 | Treat segments with average log-prob below this as failed and retry. |
no_speech_threshold | number | no | 0.6 | Mark a segment as silent when no-speech probability exceeds this AND log-prob is below threshold. |
condition_on_previous_text | boolean | no | true | Use prior transcript as conditioning for the next segment. |
prompt_reset_on_temperature | number | no | 0.5 | Reset the conditioning prompt when temperature falls back during retry. · Range: 0.0 – 1.0 |
initial_prompt | string | no | — | Initial text prompt to guide vocabulary and style. |
prefix | string | no | — | Text to prepend to the first segment’s transcript. |
suppress_blank | boolean | no | true | Suppress empty outputs at the start of each segment. |
suppress_tokens | string | no | "-1" | Comma-separated token IDs to suppress during decoding. |
without_timestamps | boolean | no | false | Strip per-segment timestamps from the response. |
word_timestamps | boolean | no | false | Include per-word timestamps in the response. |
prepend_punctuations | string | no | — | Punctuation characters to merge with the following word. |
append_punctuations | string | no | — | Punctuation characters to merge with the preceding word. |
max_initial_timestamp | number | no | 1.0 | Cap the first segment’s start time to this many seconds. · Range: 0.0 – 30.0 |
multilingual | boolean | no | false | Allow language switching within a single audio file. |
vad_filter | boolean | no | true | Apply Silero VAD to remove silence before decoding. |
vad_parameters | object | no | — | VAD configuration as JSON (threshold, min_speech_duration_ms, etc.). |
max_new_tokens | integer | no | — | Cap on decoded tokens per segment. |
chunk_length | integer | no | — | Length of each audio chunk in seconds before decoding. |
clip_timestamps | string | no | "0" | Only decode within these (start, end) second ranges. Format: “0.5,12.3,15.0,30.0”. |
hallucination_silence_threshold | number | no | — | Treat long silent sections above this many seconds as hallucinations and skip them. |
hotwords | string | no | — | Comma-separated hotwords to bias decoding toward (proper nouns, jargon). |
language_detection_threshold | number | no | 0.5 | Confidence threshold for auto language detection. |
language_detection_segments | integer | no | 1 | Number of leading segments to use for language detection. · Range: 1 – 20 |
include_tokens | boolean | no | false | Include raw token IDs alongside each word/segment. |
response_format | enum | no | "verbose_json" | json | verbose_json | text | srt | vtt. · Allowed: verbose_json, json, text, srt, vtt |
Notes
Supports URL/base64 audio, language/task, beam and temperature fallback controls, VAD/chunking, hotwords, prompts, word timestamps, punctuation controls, token debug output, and JSON/text/SRT/VTT formats.
Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/whisper-large-v3-turbo.
