Question: Exercise 4 ( 8 points ) Training a large language model depends on various architectural choices. However, recent papers such as Kaplan et al .

Exercise

4 (8

points

)

Training a large language model depends on various architectural choices. However, recent papers such as Kaplan et al

. (2020)

and the "Chinchilla" paper

(

Hoffman et al

, 2022),

people noticed that the performance of an LLM can be predicted quite accurately by just two quantities, i

.

., (1)

the number

N

of model parameters, and

(2)

the total number

D

of tokens the model is trained on

.

The table below contains data from the training of various LLM systems.

\

table

[[

LLM

, N -

Parameters

(

billions

), D -

Tokens

(

billions

),

Loss

], [

GPT

- 2, 1, 21, 2.527663], [

GPT

- 3, 175, 300, 2.001097], [

Gopher

, 280, 300, 1.994691], [

Chinchilla

, 70, 1400, 1.936333], [

PaLM

, 540, 780, 1.923154]]

(

)

Determine a power law expressing the relation between the loss, the number of parameters N and the number of tokens D used during training

(

Hint: use least squares and a model of the form

(

a N^{- 0.34} + b D^{- 0.28} + c} .

(

)

Based on this power law, determine possible reductions in the loss for the five LLMs reported in the table under the following scenarios:

(1)

an infinite number of parameters,

(2)

an infinite amount of tokens.

Exercise 4 ( 8 points ) Training a large language

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Exercise 4 ( 8 points ) - Do NOT use code please, and instead solve manually Training a large language model depends on various architectural choices. However, recent papers such as Kaplan et al . (...

Hi, I need someone to do summary for the article I upload AUDITING: A JOURNAL OF PRACTICE & THEORY Vol. 28, No. 2 November 2009 pp. 1-34 American Accounting Association DOI: 10.2308 / aud.2009.28.2.1...

The purpose of this assignment is to be able to critique a research article including critically examining its strengths and weaknesses, internal and external validity, and where appropriate,...

give a brief summary of the article, and then your application of the article to a business setting. It should be 2 pages on each article, do not use outside sources, use APA reference when you refer...

I need a 10 page paper for my MIS class. Please do not copy and paste as my school is getting stricter on plagiarism. I have attached the assignment and the sample \fData Analytic Thinking 1 Data...

Human Resource Development International ISSN: 1367-8868 (Print) 1469-8374 (Online) Journal homepage: http://www.tandfonline.com/loi/rhrd20 Assessing global leadership competencies: the critical role...

Hello, I havetwo articles about S Corp. and C Corp., and IJUST need towrite a conclusion paragraph based on professor's questions. can you help me to write it? Articles are attached Accounting...

Reading .1 International Assignments 2.1.1 Definition and Classification of International Assignments International work experience is one of the major requirements for promotion to higher-level...

J Quant Criminol (2010) 26:509-525 DOI 10.1007/s10940-010-9119-1 ORIGINAL PAPER The Development and Impact of Self-Report Measures of Crime and Delinquency Marvin D. Krohn Terence P. Thornberry Chris...

Calculate the mass of ascorbic acid (Vitamin C, C,H,O,) to be dissolved in 75 g of acetic acid to lower its melting point by 1.5C. K= 3.9 K kg mol.

15. Attending college is a case where the exceeds the monetary cost. A. budget constraint B. marginal analysis C. opportunity cost D. marginal utility Select one: A. C B. D C. A D. B

The Stanford Trust is required to distribute its accounting income every year, one - half to Crystal Stanford, and one - half to the Breast Cancer Research Center. What is the trust's personal...

Use the given functions to find g (f(x)), and give the restrictions on xf(x)=1/x-2 and g(x)= 3/x +2 g ( (x)) =where x ?