Want to create interactive content? It’s easy in Genially!

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Omkar Rane

Created on February 19, 2026

Start designing with a free template

Discover more than 1500 professional designs like these:

Modern Zen Presentation

Newspaper Presentation

Audio tutorial

Pechakucha Presentation

Desktop Workspace

Decades Presentation

Psychology Presentation

Explore all templates

Chain-of-Thought Prompting Elicits Reasoningin Large Language Models

AMS 691 Presentation

Omkar Rane

03.GOALS & benefits

04.Experimental results - arithmetic reasoning

02.motivation

01.introduction

06.conclusion & takeaways

05.limitations

04.Experimental results - symbolic reasoning

04.Experimental results - commonsense reasoning

01 introduction

01. introduction

Series of intermediate natural language reasoning steps that lead to the final output,
Few chain of thought demonstration given as exemplars in prompting, i.e. prompt contains triplets <input, chain of thought, output>

what is chain of thought? How do we use it to prompt?

02 motivation

Recent surge of language models in Natural Language Processing tasks
Scale of these models on an evergrowing upward trend

background

02. motivation

Reasoning abilities could now be unlocked by

Generating natural language rationales to drive intermediate steps in reasoning problems. - Ling et al. (2017)
Exploiting in-context few-shot learning via 'prompting' - Brown et al. (2020)

Motivation

Rationale augmented training and fine tuning with large set of high quality rationales, is expensive
Few shot prompting methods still work poorly on reasoning ability tasks; still mostly unaffected by increasing language scale

limitations

03 Goals & benefits

03. Goals & benefits

goals

To show that sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting

benefits

Allows models to decompose multi-step problems into intermediate steps - allowing more computation if required to each step
Provides interpretable window into behaviour of model and providing opportunities to potentially debug reasoning paths*
Applicable to a wide variety of tasks - math, commonsense, symbolic - virtually any task that can be solved by humans via language
Chain of thought prompting can be elicited in any sufficiently large off-the-shelf language model, simply through COT exemplars in the prompt

04 experimental results

arithmetic reasoning

04. i. experimental results - arithmetic reasoning

experimental setup

benchmarks

GSM8K - Benchmark of math word problems
SVAMP - Dataset of math word problems with varying structures
ASDiv - Dataset of diverse math word problems
AQuA - Dataset of algebraic word problems
MAWPS benchmark

language models

GPT-3 ~ InstructGPT with 350M, 1.3B, 6.7B, and 175B parameters*
LaMDA: 422M, 2B, 8B, 68B, and 137B parameters
PaLM: 8B, 62B, and 540B parameters
UL2: 20B parameters
Codex: code-davinci-002 in the OpenAI API

standard prompting

Treated as baselineStandard few-shot prompting, in which in-context exemplars of input-output pairs

chain of thought prompting

Augment each exemplar with chain of thought prompting.Manually composed set of eight few-shot exemplars (Eg: Math word problems, CSQA, StrategyQA, Data Understanding, Sports Understanding, SayCan, Last Letter Concatenation, Coin Flip)

04. i. experimental results - arithmetic reasoning

experimental setup - example prompts

04. i. experimental results - arithmetic reasoning

results & robustness

results

Chain of thought prompting

Is an emergent ability of model scale (performance yields ~100B parameters)
Has larger performance gains for more complicated problems
Has similar performance to a model finetuned for a task-specific model on a labeled training dataset

robustness

Robustness of chain of thought (i.e its sensitivity to different prompts) was evaluated by comparing different chains of thoughts written by different annotators
Despite some variance in these samples, all of them performed better than standard prompting by a large margin
Implying that successful use of chain of thought does not depend on particular linguistic style

04. i. experimental results - arithmetic reasoning

ablation study & insights

equations only

variable compute only

Intuition: Chain of thought allows model to spend more computation (i.e. intermediate tokens) on harder steps of the problem
Tested by: Model only prompted to output sequence of dots equal to number of characters in the equation needed to solve the problem
Observed: Approach performs about same as baseline (prompting without chain of thought)
Insight: Variable computation by itself, is not enough; utility in expressing intermediate steps via natural language

Intuition: Chain of thought produces mathematical equation to be evaluated
Tested by: Model only prompted to output mathematical equation directly, before answer
Observed: Approach not helpful enough in solving problems
Insight: Semantics are too challenging to directly translate into an equation, without the natural language reasoning steps in chain of thought

04. i. experimental results - arithmetic reasoning

ablation study & insights

chain of thought after answer

Intuition: Chain of thought prompts allow model to better access.relevant knowledge acquired during pretraining
Tested by: Only giving chain of thought after the answer
Observed: Approach performs similar to baseline
Insight: Sequential reasoning embodied in chain of thought is helpful, beyond just activating prior knowledge

04 experimental results

ii

commonsense reasoning

04. ii. experimental results - commonsense reasoning

experimental setup

benchmarks

prompts

Examples selected from training set and combined with manually composed chains of thought, to convert them into few-shot exemplars.For datasets without training, simply examples from evaluation sets was taken in the same approach as above.

CSQA - Benchmark of commonsense questions, often requiring prior knowledge
StrategyQA - Benchmark that requires model to infer a multi-hop strategy to answer questions
BIG-bench Date - Benchmark that requires inferring a date from given context
BIG-bench Sports - Benchmark that requires inferring plausibility of a sentence relating to sports
SayCan - Involves mapping a natural language instruction to a sequence of robot actions

04. ii. experimental results - commonsense reasoning

results

For all tasks, scaling up model size improved performance of standard prompting
Chain-of-thought prompting led to further gains, outperforming state of the art with standard prompting

04 experimental results

iii

symbolic reasoning

04. ii. experimental results - symbolic reasoning

experimental setup

Task

Last Letter Concatenation: Ask model to concatenate last letters of words in a name
Coin Flip: Asks model whether coin is still heads up after certain number of people either flip or don't flip it

In-domain vs Out-of-Domain test sets

Former had same number of steps in exemplars as training data while latter had more number of steps (to elicit reasoning) than exemplarsAgain, chain of thought prompts were manually composed.

04. ii. experimental results - symbolic reasoning

experimental setup

results

With larger models (PaLM 540B), chain of thought prompting leads to 100% solve rates; while standard prompting does solve the problems on larger models quite efficiently, however, smaller models still fail with few-shot standard prompting

05 limitations

05. limitations

CORRECT reasoning paths?

Although reasoning paths undertaken, no guarantee of their direction as correct or incorrect

Reasoning undertaken?

Although chain-of-thought emulates reasoning, this does not answer whether the neural network is actually 'reasoning'

real-world applications?

Emergence of chain-of-thought only at large model scales, makes it costly to serve in smaller models of real-world application

cost of finetuning?

Annotation costs of manually augmenting exemplars with chain of thought, on the finetuning scale, can be, can be prohibitive

08 conclusion & takeaways

08. conclusion & future work

conclusion & takeaways

Chain of thought

Elicits multi-step reasoning behaviour in off-the-shelf large langauge models of enough scale, without the need for finetuning
Improves performance by a large margin on arithmetic reasoning; much stronger than the ablations
Improves performance on commonsense reasoning, underscores hows how its linguistic nature makes it generally applicable
Through symbolic reasoning, facilitates OOD generalisation to longer sequence lengths
Leads to dramatically increasing scaling curves, and expanding the set of tasks that LLMs can perform successfully
Is robust to different annotators, exemplars and language models

View

Modern Zen Presentation

View

Newspaper Presentation

View

Audio tutorial

View

Pechakucha Presentation

View

Desktop Workspace

View

Decades Presentation

View

Psychology Presentation

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Start designing with a free template

View

Modern Zen Presentation

View

Newspaper Presentation

View

Audio tutorial

View

Pechakucha Presentation

View

Desktop Workspace

View

Decades Presentation

View

Psychology Presentation

Transcript

Chain-of-Thought Prompting Elicits Reasoningin Large Language Models

AMS 691 Presentation

Omkar Rane

TABLE OF CONTENTS

03.GOALS & benefits

04.Experimental results - arithmetic reasoning

02.motivation

01.introduction

06.conclusion & takeaways

05.limitations

04.Experimental results - symbolic reasoning

04.Experimental results - commonsense reasoning

01

introduction

01. introduction

what is chain of thought? How do we use it to prompt?

02

motivation

background

02. motivation

Motivation

limitations

03

Goals & benefits

03. Goals & benefits

goals

benefits

04

experimental results

arithmetic reasoning

04. i. experimental results - arithmetic reasoning

experimental setup

benchmarks

language models

standard prompting

chain of thought prompting

04. i. experimental results - arithmetic reasoning

experimental setup - example prompts

04. i. experimental results - arithmetic reasoning

results & robustness

results

robustness

04. i. experimental results - arithmetic reasoning

ablation study & insights

equations only

variable compute only

04. i. experimental results - arithmetic reasoning

ablation study & insights

chain of thought after answer

04

experimental results

ii

commonsense reasoning

04. ii. experimental results - commonsense reasoning

experimental setup

benchmarks

prompts

04. ii. experimental results - commonsense reasoning

results

results

04

experimental results

iii