Want to create interactive content? It’s easy in Genially!

Get started free

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Omkar Rane

Created on February 19, 2026

Start designing with a free template

Discover more than 1500 professional designs like these:

Modern Zen Presentation

Newspaper Presentation

Audio tutorial

Pechakucha Presentation

Desktop Workspace

Decades Presentation

Psychology Presentation

Transcript

Chain-of-Thought Prompting Elicits Reasoningin Large Language Models

AMS 691 Presentation

Omkar Rane

TABLE OF CONTENTS

03.GOALS & benefits

04.Experimental results - arithmetic reasoning

02.motivation

01.introduction

06.conclusion & takeaways

05.limitations

04.Experimental results - symbolic reasoning

04.Experimental results - commonsense reasoning

01

introduction

01. introduction

  • Series of intermediate natural language reasoning steps that lead to the final output,
  • Few chain of thought demonstration given as exemplars in prompting, i.e. prompt contains triplets <input, chain of thought, output>

what is chain of thought? How do we use it to prompt?

02

motivation

  1. Recent surge of language models in Natural Language Processing tasks
  2. Scale of these models on an evergrowing upward trend

background

02. motivation

Reasoning abilities could now be unlocked by

  1. Generating natural language rationales to drive intermediate steps in reasoning problems. - Ling et al. (2017)
  2. Exploiting in-context few-shot learning via 'prompting' - Brown et al. (2020)

Motivation

  1. Rationale augmented training and fine tuning with large set of high quality rationales, is expensive
  2. Few shot prompting methods still work poorly on reasoning ability tasks; still mostly unaffected by increasing language scale

limitations

03

Goals & benefits

03. Goals & benefits

goals

To show that sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting

benefits

  1. Allows models to decompose multi-step problems into intermediate steps - allowing more computation if required to each step
  2. Provides interpretable window into behaviour of model and providing opportunities to potentially debug reasoning paths*
  3. Applicable to a wide variety of tasks - math, commonsense, symbolic - virtually any task that can be solved by humans via language
  4. Chain of thought prompting can be elicited in any sufficiently large off-the-shelf language model, simply through COT exemplars in the prompt

04

experimental results

arithmetic reasoning

04. i. experimental results - arithmetic reasoning

experimental setup

benchmarks

  1. GSM8K - Benchmark of math word problems
  2. SVAMP - Dataset of math word problems with varying structures
  3. ASDiv - Dataset of diverse math word problems
  4. AQuA - Dataset of algebraic word problems
  5. MAWPS benchmark

language models

  1. GPT-3 ~ InstructGPT with 350M, 1.3B, 6.7B, and 175B parameters*
  2. LaMDA: 422M, 2B, 8B, 68B, and 137B parameters
  3. PaLM: 8B, 62B, and 540B parameters
  4. UL2: 20B parameters
  5. Codex: code-davinci-002 in the OpenAI API

standard prompting

Treated as baselineStandard few-shot prompting, in which in-context exemplars of input-output pairs

chain of thought prompting

Augment each exemplar with chain of thought prompting.Manually composed set of eight few-shot exemplars (Eg: Math word problems, CSQA, StrategyQA, Data Understanding, Sports Understanding, SayCan, Last Letter Concatenation, Coin Flip)

04. i. experimental results - arithmetic reasoning

experimental setup - example prompts

04. i. experimental results - arithmetic reasoning

results & robustness

results

Chain of thought prompting

  1. Is an emergent ability of model scale (performance yields ~100B parameters)
  2. Has larger performance gains for more complicated problems
  3. Has similar performance to a model finetuned for a task-specific model on a labeled training dataset

robustness

  • Robustness of chain of thought (i.e its sensitivity to different prompts) was evaluated by comparing different chains of thoughts written by different annotators
  • Despite some variance in these samples, all of them performed better than standard prompting by a large margin
  • Implying that successful use of chain of thought does not depend on particular linguistic style

04. i. experimental results - arithmetic reasoning

ablation study & insights

equations only

variable compute only

  1. Intuition: Chain of thought allows model to spend more computation (i.e. intermediate tokens) on harder steps of the problem
  2. Tested by: Model only prompted to output sequence of dots equal to number of characters in the equation needed to solve the problem
  3. Observed: Approach performs about same as baseline (prompting without chain of thought)
  4. Insight: Variable computation by itself, is not enough; utility in expressing intermediate steps via natural language
  1. Intuition: Chain of thought produces mathematical equation to be evaluated
  2. Tested by: Model only prompted to output mathematical equation directly, before answer
  3. Observed: Approach not helpful enough in solving problems
  4. Insight: Semantics are too challenging to directly translate into an equation, without the natural language reasoning steps in chain of thought

04. i. experimental results - arithmetic reasoning

ablation study & insights

chain of thought after answer

  1. Intuition: Chain of thought prompts allow model to better access.relevant knowledge acquired during pretraining
  2. Tested by: Only giving chain of thought after the answer
  3. Observed: Approach performs similar to baseline
  4. Insight: Sequential reasoning embodied in chain of thought is helpful, beyond just activating prior knowledge

04

experimental results

ii

commonsense reasoning

04. ii. experimental results - commonsense reasoning

experimental setup

benchmarks

prompts

Examples selected from training set and combined with manually composed chains of thought, to convert them into few-shot exemplars.For datasets without training, simply examples from evaluation sets was taken in the same approach as above.

  1. CSQA - Benchmark of commonsense questions, often requiring prior knowledge
  2. StrategyQA - Benchmark that requires model to infer a multi-hop strategy to answer questions
  3. BIG-bench Date - Benchmark that requires inferring a date from given context
  4. BIG-bench Sports - Benchmark that requires inferring plausibility of a sentence relating to sports
  5. SayCan - Involves mapping a natural language instruction to a sequence of robot actions

04. ii. experimental results - commonsense reasoning

results

results

  1. For all tasks, scaling up model size improved performance of standard prompting
  2. Chain-of-thought prompting led to further gains, outperforming state of the art with standard prompting

04

experimental results

iii

symbolic reasoning

04. ii. experimental results - symbolic reasoning

experimental setup

Task

  1. Last Letter Concatenation: Ask model to concatenate last letters of words in a name
  2. Coin Flip: Asks model whether coin is still heads up after certain number of people either flip or don't flip it

In-domain vs Out-of-Domain test sets

Former had same number of steps in exemplars as training data while latter had more number of steps (to elicit reasoning) than exemplarsAgain, chain of thought prompts were manually composed.

04. ii. experimental results - symbolic reasoning

experimental setup

results

With larger models (PaLM 540B), chain of thought prompting leads to 100% solve rates; while standard prompting does solve the problems on larger models quite efficiently, however, smaller models still fail with few-shot standard prompting

05

limitations

05. limitations

CORRECT reasoning paths?

Although reasoning paths undertaken, no guarantee of their direction as correct or incorrect

Reasoning undertaken?

Although chain-of-thought emulates reasoning, this does not answer whether the neural network is actually 'reasoning'

real-world applications?

Emergence of chain-of-thought only at large model scales, makes it costly to serve in smaller models of real-world application

cost of finetuning?

Annotation costs of manually augmenting exemplars with chain of thought, on the finetuning scale, can be, can be prohibitive

08

conclusion & takeaways

08. conclusion & future work

conclusion & takeaways

Chain of thought

  1. Elicits multi-step reasoning behaviour in off-the-shelf large langauge models of enough scale, without the need for finetuning
  2. Improves performance by a large margin on arithmetic reasoning; much stronger than the ablations
  3. Improves performance on commonsense reasoning, underscores hows how its linguistic nature makes it generally applicable
  4. Through symbolic reasoning, facilitates OOD generalisation to longer sequence lengths
  5. Leads to dramatically increasing scaling curves, and expanding the set of tasks that LLMs can perform successfully
  6. Is robust to different annotators, exemplars and language models

questions probed

  1. How much more reasoning ability will improve with increase in model scale?
  2. What other prompting methods can prove effective?

thank you

quiz

Omkar Rane

quiz

quiz

quiz