💙 Gate Square #Gate Blue Challenge# 💙
Show your limitless creativity with Gate Blue!
📅 Event Period
August 11 – 20, 2025
🎯 How to Participate
1. Post your original creation (image / video / hand-drawn art / digital work, etc.) on Gate Square, incorporating Gate’s brand blue or the Gate logo.
2. Include the hashtag #Gate Blue Challenge# in your post title or content.
3. Add a short blessing or message for Gate in your content (e.g., “Wishing Gate Exchange continued success — may the blue shine forever!”).
4. Submissions must be original and comply with community guidelines. Plagiarism or re
Passed the MIT undergraduate mathematics exam with a full score of GPT-4! This set of prompts is on fire
Source: Qubit
Unexpectedly, the MIT math test was broken by GPT-4? !
Suddenly someone made a high-profile announcement in the latest paper work:
GPT-4 On MIT's Mathematics and EECS (Electrical Engineering and Computer Science Department) undergraduate degree exams, demonstrated ability to fully meet graduation requirements.
And properly get full marks!
You know, it is none other than the research team from MIT, Boston University, and Cornell University who measured this result.
And it is stronger than the previous generation king GPT-3.5. In the same test, it only succeeded in one-third.
As soon as the paper came out, countless eyes were quickly attracted.
GPT-4 open MIT exam
Specifically, GPT-4 participated in such a test this time:
The research team curated a dataset containing 4,550 problems and solutions.
These 4,550 problems and solutions are from the course problem sets, midterm, and final exams that students** from the MIT Department of Mathematics and EECS need to study to earn an undergraduate degree. **
include:
6-1: Electrical Science and Engineering; 6-2: Electrical Engineering and Computer Science; 6-3: Computer Science and Engineering; 6-4: Artificial intelligence and decision-making; 18-1: General Mathematics; 18-2: Applied Mathematics; 18-3: Pure Mathematics; 18-C: Mathematics and Computer Science.
Detailed classification summary of each major
The questions are all from the MIT dataset, from which 228 questions are randomly generated, problems that do not involve images and existing solutions.
The difficulty level of the topics in order from easy to difficult is: exercises, exercises, midterm exams, final exams, experiments and special projects.
Sorted by answer type, the difficulty of the questions from easy to difficult is: programming, open, multiple choice, numerical, expression and image.
This time, not only GPT-4 and GPT-3.5, but also StableVicuna-13B, LLaMA-30B and LLaMA-60B** are participating in the test.
These 4 large models were chosen as test contestants because they are the "state-of-the-art large language models".
Final Exam Score
As can be seen from the data in the table, the tuned GPT-4 has the highest score, with a scoring rate of 100%; the most general performance is LLaMA-30B, which only scored 30% of the score.
It is worth noting that the original version of GPT-4 was used out of the box without tuning at all, and it also scored 90% in this MIT exam.
Tuning process, including Few-Shot+CoT+Self-critique+Experts.
In addition, the research team also carried out engineering optimization in the prompt box, specific "spells" are as follows:
Wait, the rater is GPT-4 himself?
Seeing such a result, many netizens felt that the progress of LLM in the math test was a bit fast.
Similar to "Xiao Ming planted 5 lemon trees, and got 6 lemons from each tree every year, how many lemons he got in total in 10 years" this kind.
I learned 6 randomly selected sample questions from MIT undergraduate basic mathematics courses. 25 questions were randomly selected for each of the 6 courses, plus 60 questions from an ACT level (American college entrance examination) data set.
**A total of 210 questions, AI answered all of them correctly. **
Because in the evaluation at that time, Codex was responsible for reading and writing, and did not include solving.
So, this time GPT-4 performed extremely well, what a wonderful word~
There are mainly 2 major slots.
The first thing worth questioning is that OpenAI's training data set has not been fully released.
This also means that cannot prove that the 4550 problems and solutions in the data set do not exist in the GPT-4 training set.
In other words, if GPT-4 has been exposed to the test questions in the pre-training stage, then it will finally score a perfect score, and there will be no surprises.
It’s no wonder that some netizens yygq unceremoniously, and believe that GPT-4 got such a result, it must be that the data set has been included in the training data.
Take a closer look, there is a key point in Section 2.6 of the paper:
The team fine-tunes the open-source large model on the dataset, "Given a question Q, a ground truth solution S, and an LLM answer A, we use GPT-4 to automatically score the model responses."
In practice, each large model generates the answers to this test, and then sends GPT-4 to score, with a score between 0-5.
**So the one who gave GPT-4 full marks is actually GPT-4 itself. **
Ah, this... It's hard to say that there is no suspicion that Wang Po is selling melons and boasting.
What exactly is a "good tip"? It seems impossible to define.
One More Thing
A little easter egg:
Throughout the test, StableVicuna-13B, which can basically be deployed and run on a laptop, also has a score of 48%.
People have to fall into some thinking about the correlation between model size and capability.
Reference link: [1] [2] [3] [4]