Question: LLMs generate responses to user queries in a two - step process. The first is a pre - fill where the tokens in the input
LLMs generate responses to user queries in a twostep process. The first is a prefill where the tokens in the input prompt are processed in a parallel manner, and second is decoding where response text is generated one token at a time. A user sees the first token in the response once the prefill activity is finished. From then on the time taken for the entire response to be generated depends on the time it takes to generate subsequent tokens in the output. The latency of a LLM service is defined as the time taken by the model to generate the full response after a user submits their query.
Question
Not yet answered
Marked out of
Not flaggedFlag question
Question text
It takes milliseconds in the prefill step for a user query prompt while the time taken per output token is milliseconds. The typical response consists of tokens. What is the latency of the LLM service?
a
seconds
b
seconds
c
seconds
d
seconds
e
seconds
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
