Passed the MIT undergraduate mathematics exam with a full score of GPT-4! This set of prompts is on fire

2023-06-18 03:50:39

Source: Qubit

Unexpectedly, the MIT math test was broken by GPT-4? !

Suddenly someone made a high-profile announcement in the latest paper work:

GPT-4 On MIT's Mathematics and EECS (Electrical Engineering and Computer Science Department) undergraduate degree exams, demonstrated ability to fully meet graduation requirements.

And properly get full marks!

You know, it is none other than the research team from MIT, Boston University, and Cornell University who measured this result.

And it is stronger than the previous generation king GPT-3.5. In the same test, it only succeeded in one-third.

△GPT-3.5 test scores

As soon as the paper came out, countless eyes were quickly attracted.

GPT-4's seemingly hacking behavior naturally aroused the emotion of many netizens.

Much better than GPT-3.5, yes!

Let's just say, is it possible to solve academic problems without a stronger model than GPT-4 in the future?

Some netizens showed their "cutting-edge" surfing on the Internet, playing a stalk that Yann LeCun complained about "GPT-4 IQ is not as good as a dog" in the past two days:

GPT-4 open MIT exam

Specifically, GPT-4 participated in such a test this time:

The research team curated a dataset containing 4,550 problems and solutions.

These 4,550 problems and solutions are from the course problem sets, midterm, and final exams that students** from the MIT Department of Mathematics and EECS need to study to earn an undergraduate degree. **

include:

6-1: Electrical Science and Engineering; 6-2: Electrical Engineering and Computer Science; 6-3: Computer Science and Engineering; 6-4: Artificial intelligence and decision-making; 18-1: General Mathematics; 18-2: Applied Mathematics; 18-3: Pure Mathematics; 18-C: Mathematics and Computer Science.

Detailed classification summary of each major

The questions are all from the MIT dataset, from which 228 questions are randomly generated, problems that do not involve images and existing solutions.

The difficulty level of the topics in order from easy to difficult is: exercises, exercises, midterm exams, final exams, experiments and special projects.

Sorted by answer type, the difficulty of the questions from easy to difficult is: programming, open, multiple choice, numerical, expression and image.

This time, not only GPT-4 and GPT-3.5, but also StableVicuna-13B, LLaMA-30B and LLaMA-60B** are participating in the test.

These 4 large models were chosen as test contestants because they are the "state-of-the-art large language models".

Final Exam Score

As can be seen from the data in the table, the tuned GPT-4 has the highest score, with a scoring rate of 100%; the most general performance is LLaMA-30B, which only scored 30% of the score.

It is worth noting that the original version of GPT-4 was used out of the box without tuning at all, and it also scored 90% in this MIT exam.

Tuning process, including Few-Shot+CoT+Self-critique+Experts.

From the tabular data of the final test results, we can see that every time a link is added from left to right, the tuned GPT-4 score will be improved to a higher level.

In addition, the research team also carried out engineering optimization in the prompt box, specific "spells" are as follows:

Wait, the rater is GPT-4 himself?

Seeing such a result, many netizens felt that the progress of LLM in the math test was a bit fast.

2 years ago, AI was struggling with elementary school math problems.

Similar to "Xiao Ming planted 5 lemon trees, and got 6 lemons from each tree every year, how many lemons he got in total in 10 years" this kind.

At the beginning of last year, a joint research by MIT+Harvard+Columbia University+Waterloo University stated that by converting mathematical problems into equivalent programming problems, GPT-3's brother, OpenAI's Codex, can master high numbers and reach MIT Undergraduate level.

I learned 6 randomly selected sample questions from MIT undergraduate basic mathematics courses. 25 questions were randomly selected for each of the 6 courses, plus 60 questions from an ACT level (American college entrance examination) data set.

**A total of 210 questions, AI answered all of them correctly. **

However, some people have suggested that the "MIT undergraduate level" achieved by AI is actually Codex doing language problems rather than math problems——

Because in the evaluation at that time, Codex was responsible for reading and writing, and did not include solving.

So, this time GPT-4 performed extremely well, what a wonderful word~

Well, I know you are anxious to praise it, but don't rush to praise it, because someone soon discovered something "weird".

There are mainly 2 major slots.

The first thing worth questioning is that OpenAI's training data set has not been fully released.

This also means that cannot prove that the 4550 problems and solutions in the data set do not exist in the GPT-4 training set.

In other words, if GPT-4 has been exposed to the test questions in the pre-training stage, then it will finally score a perfect score, and there will be no surprises.

It’s no wonder that some netizens yygq unceremoniously, and believe that GPT-4 got such a result, it must be that the data set has been included in the training data.

The second slot is the final 100% scoring rate of GPT-4. What seems wrong? ? ?

Take a closer look, there is a key point in Section 2.6 of the paper:

The team fine-tunes the open-source large model on the dataset, "Given a question Q, a ground truth solution S, and an LLM answer A, we use GPT-4 to automatically score the model responses."

In practice, each large model generates the answers to this test, and then sends GPT-4 to score, with a score between 0-5.

**So the one who gave GPT-4 full marks is actually GPT-4 itself. **

Ah, this... It's hard to say that there is no suspicion that Wang Po is selling melons and boasting.

In addition, many people complained about the need to provide "good hints" to GPT-4 in order for it to achieve full marks.

What exactly is a "good tip"? It seems impossible to define.

Some people even shouted that these questions should be thrown to the students of MIT mathematics and EECS, and keep giving them "good hints", so that human students can also get 100% of the questions...

One More Thing

A little easter egg:

Throughout the test, StableVicuna-13B, which can basically be deployed and run on a laptop, also has a score of 48%.

This score is not only nearly 10 percentage points higher than the LLaMA-65B with a larger model, but even the LLaMA-30B after MIT fine-tuing is even higher.

People have to fall into some thinking about the correlation between model size and capability.

Reference link: [1] [2] [3] [4]

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

0/400

No comments

Topic
#BTC ETFs Top $153B in Holdings
3k Popularity
#Fed Ends Novel Activities Supervision
3k Popularity
#Bit Digital’s Pivot Pays Off
3k Popularity
#Gate Alpha Peak Trading Competition
162k Popularity
#Chinese Capital Flows to Indonesia
90 Popularity

sitemap