Whisper Large v3 Turbo | EmpirioLabs AI Docs

OpenAI · Transcription

POST /v1/audio/transcriptions

Controlled self-hosted Whisper Large v3 Turbo transcription with multilingual ASR, translation, VAD, timestamps, subtitles, hotwords, and decoder controls exposed.

At a glance

Field	Value
Model id	`whisper-large-v3-turbo`
Model release date	2024-10-01
Input modalities	Audio
Output modalities	Text
Context window	-
Weight precision	FP16
Features	transcription, translation, multilingual, word_timestamps, hotwords, srt_vtt
Native inference	Yes
New	No
Supported endpoints	`POST /v1/audio/transcriptions`

Pricing

Charge	Spec	Rate
Controlled transcription	per minute of audio	$0.005 (was $0.006)

Example request

$ curl https://api.empiriolabs.ai/v1/audio/transcriptions \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -F model=whisper-large-v3-turbo \
>   -F file=@meeting.mp3

Parameters

Parameter	Type	Required	Default	Description
`audio_url`	string	no	-	URL of the audio file to transcribe. Mutually exclusive with audio_base64.
`audio_base64`	string	no	-	Base64-encoded audio bytes. Mutually exclusive with audio_url.
`audio_suffix`	string	no	`".audio"`	File extension hint (mp3, wav, m4a, etc.) when the audio source has no recognizable extension.
`language`	string	no	-	ISO 639-1 language code (en, es, fr, etc.). Leave blank for auto-detection.
`task`	enum	no	`"transcribe"`	transcribe = same language, translate = translate to English. · Allowed: `transcribe`, `translate`
`beam_size`	integer	no	`5`	Beam search width. Higher = more accurate but slower. · Range: 1 – 32
`best_of`	integer	no	`5`	Number of candidates to sample with temperature > 0. · Range: 1 – 32
`patience`	number	no	`1.0`	Beam search patience factor. Higher = explore more candidates. · Range: 0.0 – 10.0
`length_penalty`	number	no	`1.0`	Penalty applied to longer transcripts. Negative encourages shorter output. · Range: -10.0 – 10.0
`repetition_penalty`	number	no	`1.0`	Penalty for repeating tokens. >1 reduces repetition. · Range: 0.1 – 5.0
`no_repeat_ngram_size`	integer	no	`0`	Block any n-gram of this size from repeating in the output. · Range: 0 – 20
`temperature`	string	no	`"0,0.2,0.4,0.6,0.8,1"`	Sampling temperature. 0 = deterministic, higher = more variation.
`compression_ratio_threshold`	number	no	`2.4`	Treat output with compression ratio above this as failed and retry.
`log_prob_threshold`	number	no	`-1.0`	Treat segments with average log-prob below this as failed and retry.
`no_speech_threshold`	number	no	`0.6`	Mark a segment as silent when no-speech probability exceeds this AND log-prob is below threshold.
`condition_on_previous_text`	boolean	no	true	Use prior transcript as conditioning for the next segment.
`prompt_reset_on_temperature`	number	no	`0.5`	Reset the conditioning prompt when temperature falls back during retry. · Range: 0.0 – 1.0
`initial_prompt`	string	no	-	Initial text prompt to guide vocabulary and style.
`prefix`	string	no	-	Text to prepend to the first segment’s transcript.
`suppress_blank`	boolean	no	true	Suppress empty outputs at the start of each segment.
`suppress_tokens`	string	no	`"-1"`	Comma-separated token IDs to suppress during decoding.
`without_timestamps`	boolean	no	false	Strip per-segment timestamps from the response.
`word_timestamps`	boolean	no	false	Include per-word timestamps in the response.
`prepend_punctuations`	string	no	-	Punctuation characters to merge with the following word.
`append_punctuations`	string	no	-	Punctuation characters to merge with the preceding word.
`max_initial_timestamp`	number	no	`1.0`	Cap the first segment’s start time to this many seconds. · Range: 0.0 – 30.0
`multilingual`	boolean	no	false	Allow language switching within a single audio file.
`vad_filter`	boolean	no	true	Apply Silero VAD to remove silence before decoding.
`vad_parameters`	object	no	-	VAD configuration as JSON (threshold, min_speech_duration_ms, etc.).
`max_new_tokens`	integer	no	-	Cap on decoded tokens per segment.
`chunk_length`	integer	no	-	Length of each audio chunk in seconds before decoding.
`clip_timestamps`	string	no	`"0"`	Only decode within these (start, end) second ranges. Format: “0.5,12.3,15.0,30.0”.
`hallucination_silence_threshold`	number	no	-	Treat long silent sections above this many seconds as hallucinations and skip them.
`hotwords`	string	no	-	Comma-separated hotwords to bias decoding toward (proper nouns, jargon).
`language_detection_threshold`	number	no	`0.5`	Confidence threshold for auto language detection.
`language_detection_segments`	integer	no	`1`	Number of leading segments to use for language detection. · Range: 1 – 20
`include_tokens`	boolean	no	false	Include raw token IDs alongside each word/segment.
`response_format`	enum	no	`"verbose_json"`	json \| verbose_json \| text \| srt \| vtt. · Allowed: `verbose_json`, `json`, `text`, `srt`, `vtt`

Notes

Supports URL/base64 audio, language/task, beam and temperature fallback controls, VAD/chunking, hotwords, prompts, word timestamps, punctuation controls, token debug output, and JSON/text/SRT/VTT formats.

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/whisper-large-v3-turbo.