AGI-Next Summit Transcript: Tang Jie, Yang Zhilin, Lin Junyang, and Yao Shunyu Debate the Future

On January 10th, at the "AGI-Next" summit initiated by the Beijing Key Laboratory of Foundational Models at Tsinghua University, Tang Jie, Yang Zhilin, Lin Junyang, and Yao Shunyu engaged in a fierce debate, with closing remarks by Academician Zhang Bo. Here is the full transcript of the core speeches and the panel discussion.

Core Report Highlights

• Tang Jie (Zhipu AI): Report "Making Machines Think Like Humans". Core views: The competition in the Chat paradigm has ended due to DeepSeek; Scaling remains an effective path, but autonomous model Scaling should be explored; The value of AGI lies in solving high-value problems and balancing costs; Future models must possess calibratable reflection and metacognitive capabilities. • Yang Zhilin (Moonshot AI/Kimi): Focusing on Token efficiency and long-context capabilities. Released the Muon optimizer, achieving double the Token efficiency compared to Adam; Proposed that models need values and taste, and Scaling is the co-evolution of technology, data, and aesthetics; Emphasized that lengthy context is the key differentiator in the Agent era. • Lin Junyang (Alibaba Qwen): Pointed out that RL is still in its early stages, and the core of the next paradigm is autonomous learning; The key to general Agents lies in solving long-tail problems; The probability of Chinese teams leading globally in 3-5 years is about 20%; Coding dominates API consumption in the US, reflecting changes in market demand.

Panel Discussion (Yang Qiang, Tang Jie, Lin Junyang, Yao Shunyu remotely)

• Yao Shunyu (Tencent): Coding is reshaping the industry, humans are communicating with computers using natural language; The differentiation between strong and weak models in the To B market is intensifying; Model companies making applications may not necessarily outperform scenario owners; The new AGI paradigm is most likely to be born at OpenAI. • Yang Qiang (Academia): Agents are divided into four stages (goal/planning defined by humans vs. automatically combined), currently still in the rudimentary script stage; Large models have Godel incompleteness issues, suggesting borrowing human sleep mechanisms to improve continuous learning capabilities. • Consensus and Divergence: Consensus that moving from Chat to Agent is the general direction; Divergence focuses on the priority of autonomous learning vs. RL deepening, and the timeline and path for Chinese teams to lead globally.

Closing Remarks by Academician Zhang Bo

Current large models have five deficiencies: referential deficiency, truth and causality deficiency, pragmatic deficiency, polysemy and dynamic context deficiency, and closed-loop behavior deficiency; Proposed five hard indicators for AGI: spatiotemporally consistent multimodal understanding, controllable online learning, verifiable reasoning, calibratable reflection, and cross-task strong generalization, emphasizing that the definition of AGI needs to be executable and verifiable.

Making Machines Think Like Humans

Speaker: Tang Jie (Chief Scientist of Zhipu AI, Professor at Tsinghua University)

Today's event is more of an academic activity, so we won't have too many preliminaries; let's get straight to the reports. I asked everyone and our team not to have a host this time, we don't need one. in the future, it will be the AI era, and AI will host. AI hasn't achieved that yet, so I'll host myself first. For the second report, Kimi can just come up directly, Junyang also come up directly, and then the Panel. I will start my report.

My report title covers, on one hand, reporting on some of the work our foundational laboratory is doing now, and on the other hand, discussing some ideas and views on the future with everyone. My title is "Making Machines Think Like Humans". Why do I say that? Actually, when I first proposed this title back then, Academician Zhang Bo opposed me, saying you can't keep saying making machines think like humans. But I added quotation marks, so now maybe I'm allowed to say it with quotes.

Origin and Spirit of Zhipu

We started thinking in 2019, can we make machines truly think like humans, even just a little bit, where possible. So in 2019, we spun off from Tsinghua. At that time, successfully supported by the university, we established the company Zhipu, and I am now the Chief Scientist at Zhipu. We have also open-sourced a lot; you can see many open-source projects here, and on the left, there are many things about large model API calls.

I have been at Tsinghua for about 20 years. I graduated in 2006, and this year makes exactly 20 years. Actually, looking back at what I've been doing, I summarized it into just two things: first, back then I built the AMiner system; second, now I am working on large models. I always hold a view, and I am quite influenced by it myself, I call it doing things with a "coffee-like spirit". Actually, that matter is very relevant to a guest present today, Professor Yang Qiang.

I remember when I first graduated and went to HKUST. Those who have been there know that HKUST is basically one building. Meeting rooms are in it, classrooms are in it, laboratories are in it, cafes are in it, dining places, basketball courts, are all in this one building. At that time, we ran into each other often. Once I met him in the cafe, and I said I had been drinking a lot of coffee these past two days, asking if I should quit for a bit, otherwise it might be bad for my health. Professor Yang's first sentence was "Yes, you should quit a bit", then he said no, if we do research like you are addicted to coffee, would our research be done very well?

Wait, that addiction to coffee thing touched me very deeply at once, and it has influenced me from 2008 until now. That is, doing things might require focus, to keep doing it. This time, luckily encountering the AGI matter, it happens to be something that needs long-term investment and long-term work. It is not short, flat, and fast, where I do it today, it blossoms tomorrow, and ends the day after. It is very long-term, and precisely worth investing in.

In 2019, our laboratory was actually doing okay internationally in graph neural networks and knowledge graphs, but at that time we firmly paused these two directions, temporarily stopped doing them. Everyone turned to work on large models, everyone started research related to large models. To this day, we have done a little bit of work.

Evolution of Large Model Intelligence Levels

Everyone knows about globalization. Actually, this chart is from February 2025. In the entire history of large model development, we call it intelligence level, and this intelligence level has greatly improved. From the early 2020s, actually we saw some very simple problems like MMU and QA, which were already quite good at the time, to today where we can basically achieve a near-perfect score.

Slowly, from the earliest simple questions, to 2021 and 2022 where it started doing some math problems, some problems requiring reasoning—that is, problems that can only be done correctly with addition, subtraction, multiplication, and division. At this time, we can see that through post-training, the model slowly filled in these gaps, and its capabilities also greatly improved.

Then to 2023 and 2024, everyone saw the development of models from originally just some knowledge memory, to simple mathematical reasoning, to more complex ones, even able to do some graduate-level problems, and even starting to answer some of our real-world problems. For example, in SWE Bench, it has actually done many real-world programming problems.

At this time, we can see the model's capabilities and intelligence level becoming more and more complex, just like a human growing up—at first, we read more books in elementary school, slowly do math problems, slowly reach middle and high school, we answer some complex reasoning problems for graduate students. Then after graduation, we start to complete some work problems, more difficult problems. By this year, everyone can see that the HLE (Human Ultimate Test) task is particularly difficult. If you look inside HLE, there are even some questions that Google can't find, for example, a certain bone of a certain bird in the world, Google can't even find this page, so the model is required to generalize. How to do this? There is no answer now, but everyone can see that its capabilities are rapidly improving in 2025.

From Scaling to Generalization

On the other hand, we can see this model, what is "from Scaling to Generalization"? We humans always hope that machines have generalization abilities. I teach it a little bit, and it can draw inferences about other cases from one instance, actually just like humans. When we teach a child, we always hope that if we teach the child three problems, he will know the fourth, the tenth, and even know ones that weren't taught before. How do we do this? Until today, our goal is to give it stronger generalization capabilities through Scaling, but until today its generalization capabilities still need to be greatly improved, and we are improving it at different levels.

In the earliest days, we trained a model with Transformer to memorize all knowledge. The more data we train, the more computing power we train with, the stronger its long-term knowledge memory ability, meaning it memorizes all the knowledge in the world, and has a certain generalization ability, can abstract, and can do simple reasoning. So if you ask a question, what is the capital of China? At this time, the model doesn't need reasoning, it just retrieves it from the knowledge base.

The second layer is aligning and reasoning with this model, allowing this model to have more complex reasoning capabilities and understand our intentions. We need continuous Scaling SFT, and even reinforcement learning. Through massive human feedback data, we are Scaling feedback data, allowing this model to become smarter and more accurate.

This year is the breakout year for RLVR (Reinforcement Learning with Verifiable Rewards). This year, through verifiable reinforcement learning... why was this thing difficult to do originally? Because originally we did it through human feedback, we could only interpret through human feedback data, but the noise in human feedback data is also very high, and the scenarios are also very single. But if we have a verifiable environment, at this time we can let the machine explore by itself, discover this feedback data by itself, and grow by itself.

Within this difficulty lies a difficulty broadly... everyone knows as soon as they hear it, what does verifiable mean? For example, verifiable, math maybe can be verified, programming probably can be verified, but more broadly, for example, if we made a web page, is this web page good-looking? At this time, it might not be easy to verify, it needs humans to judge. So, what problem does our current verifiable RLVR face? Originally verifiable scenarios may gradually become insufficient. Can we go into some semi-automatically verifiable, or even unverifiable scenarios, to make this model more general? This is a challenge we face.

In the future, machines will slowly start doing some real tasks in the physical world. For these real tasks, how do we build an agent environment? There are more challenges faced here. Everyone can see that in recent years, AI has been moving along these aspects, not just a simple Transformer, actually the entire AI has become a large system, an intelligent system.

From Chat to Doing Things: The Start of a New Paradigm

From originally mostly some reasoning in math, physics, and chemistry, from simple elementary, middle, and high school to more complex GPQA physics, chemistry, and biology complex problems, to harder problems like Olympiad gold medals, to this year everyone can see HLE very high-difficulty intelligence evaluation benchmarks, now starting to improve rapidly.

On the other hand, in real environments, like today many people are saying coding ability is particularly strong, and can complete a lot of real code. But in fact, coding models existed in 2021. At that time, we had a lot of cooperation with Junyang and Kimi Zhilin, and we also made many such models. Actually, the Coding models at that time could also program, but the programming ability at that time was far inferior to now, maybe programming ten programs at that time and getting one right, but now maybe programming one program, often it can run naturally, and it is a very complex task. To this day, we have now started using code to help senior engineers complete more complex tasks.

Everyone might ask, is intelligence getting stronger and stronger, should we just keep training the model endlessly? Actually no. Everyone knows what happened at the beginning of 2025. DeepSeek came out at the beginning of 2025. Many times it's called "born out of nowhere", I think this term is used very well, really born out of nowhere. It might be for our research community, for the industry, and even for many people, because originally everyone in this academic community and industry didn't expect DeepSeek to suddenly come out, and indeed the performance is very strong, and it suddenly shocked many people.

Later, at the beginning of 2025, we were thinking about a question. Maybe under the DeepSeek paradigm, this Chat era is basically considered solved. That is to say, no matter how well we do, maybe on Chat problems we might end up about the same as DeepSeek, perhaps we can be a bit more personalized on top of it, becoming an emotional Chat, or a bit more complex. But generally speaking, this paradigm may basically be coming to an end here, and what remains are more engineering and technical problems.

At that time, we faced such a choice, how do we make this AI develop in which direction next? Our idea at the time was perhaps the new paradigm is enabling everyone to use AI to do a thing. This might be the next paradigm. Originally it was Chat, now it is really doing things, so a new paradigm has opened.

Choice of Technical Route: Thinking + Agentic + Coding

We also faced a choice, because this paradigm opened, there are many ways to open it. I remember at the beginning of the year, I remember there were two problems: one is simple programming, doing Coding, doing Agent; the second is we can use AI to help us do research, similar to DeepResearch, or even write complex research reports. These two paths might still be somewhat different, and this is also the result of a choice.

One aspect is doing Thinking, we add some Coding scenarios; on the other hand, it might be interacting with the environment, making this model more interactive, more vivid. How to do it? Later we chose the left path, we let it have Thinking ability. But we didn't give up on the right side either. We did one thing around July 28th, relatively speaking it was quite successful, integrating Coding, Agentic, and Reasoning capabilities together.

Integrating them together might not be that easy. Originally, generally speaking, when everyone makes models, Coding is relatively taken out separately, Coding becomes Coding, reasoning becomes reasoning, and even sometimes math becomes math, but this approach often loses other capabilities. So at that time, we basically combined these three capabilities together, keeping the three capabilities relatively balanced. On July 28th we released version 4.5. This version, using 12 Benchmarks at the time, we basically ran a result that was considered quite good on agents, reasoning, and code. All models, we in China, including Qwen and Kimi today, are actually chasing each other. Sometimes this one is in front, sometimes that one is in front. On that day back then, we were in front.

Challenges and Breakthroughs in Real Environments

But soon we released this 4.5 for everyone to use, saying go ahead and program, our capability is quite good now. Since we chose Coding and Agent, it can do many programming tasks, so we let it program these very complex scenarios. The result was users gave us feedback saying, for example, we want to program a Plants vs. Zombies, this model can't program it.

Because real environments are often very complex. This game is automatically generated using a Prompt, including the entire game being playable, users can click to score, choose what plants and how to fight zombies, zombies coming from the right, including the interface, including the backend logic, all written automatically in one sentence by this program. At that time, 4.5 couldn't do it in this scenario, many bugs appeared, what happened?

Later we found that in real programming environments, there are many problems inside, for example, many problems need to be solved in this editing environment above. At this time, it precisely utilized RLVR, the verifiable reinforcement learning environment. So we collected a large number of programming environments here, using programming environments as reinforcement, plus some SFT data, making this part interactive on both sides, improving the model's effect.

On the other hand, we also did some work on the Web side, utilizing Web capabilities in the Web environment, adding some feedback, adding environment verifiability. Generally speaking, it is exploring through verification. So at that time, we got a very good score on SWE Bench, including recently we also got a very good score. But the model's benchmark score is one thing, entering the main model is another very big challenge. Many people have a Benchmark, saying my Benchmark score is high, but for this capability to truly enter the main model faces even more challenges, and in real user experience, user experience is not necessarily good.

Another challenge, since there are so many massive RL tasks, how to train them all together uniformly? Because different tasks have different lengths, time lengths are also different. So we developed a fully asynchronous reinforcement learning framework at that time, how to make it start running asynchronously. This is a work in another framework we open-sourced this year. This also allowed Agent and Coding capabilities to be greatly improved. The final result, our recently released 4.7, compared to the original 4.6 and 4.5, has greatly improved in Agent and Coding aspects.

It is more important in terms of user experience, why? Because after you really release the Coding model, what users use is not exactly the same as your benchmark score. Today it might be his own program. I might implement a sorting algorithm on my data with this program. Is the effect good? Is the experience good? What he uses is this result, not how high the score is. So under real running scores, we also conducted detailed evaluations. This evaluation was done entirely manually, finding very many programming experts to do evaluations. Of course, this is not solved yet, facing many problems to solve. Finally, we integrated these capabilities together. At the end of 2025, we achieved a not-bad score on the Artificial Analysis list, getting an okay score.

Device Use: From Programming to Controlling Devices

On another aspect, as we develop further, you want to make this problem truly used on a large scale in an Agent environment. Everyone can see Agent's most basic capability, what is the most basic capability? Programming. After the computer finishes programming, it can execute, equivalent to one action or two actions in an Agent. But if you want to do something more complex, on the left is computer use released by Claude, in the middle is Doubao mobile, and on the right is asynchronous ultra-long tasks done by Manus.

Suppose you want this machine to help you do tasks of dozens or hundreds of steps, or even you say "Please help me collect all the discussions about Tsinghua University on Xiaohongshu today, after discussing, organize everything about so-and-so, and generate related documents for me", at this time AI has to monitor Xiaohongshu for a day. It is automatic, completely asynchronous. You can't turn on the phone and stare at it, it works asynchronously, it is a very complex task. Such a very complex task, in short, can turn the problem just now into a Device Use, that is, how do we do it on the entire device.

There is a bigger challenge inside here. Some people say is it more about collecting data? Actually, the bigger problem is that many applications simply don't have data, it's all code, all cold start. What to do at this time? Of course, we hope more that we can generalize at once through this data. So the earliest indeed was that we collected a large amount of data, thousands of data points, we integrated them, including SFT, including reinforcement in specific domains, making it achieve good results in certain domains.

But more often you will find that the original iPhone use was clicking buttons, but more often AI interaction is not human. We originally all treated AI as a person, asking if AI can help us operate the phone, but think about it, actually this AI doesn't need to operate the phone, it is more API. But now you can't turn the phone into a purely API system, removing the buttons, so what to do? We adopt a hybrid approach, mixing API and GUI together. When it is relatively friendly to AI, use the API way; sometimes when it is friendly to humans, let AI simulate humans to do GUI operations. So integrating these two together, we extracted a huge amount of data in a large number of environments, and performed fully asynchronous reinforcement learning, thus integrating the whole thing together, giving this AI a certain generalization ability.

I just said a certain generalization ability, the reason is that until today this generalization ability is still lacking a lot, still very far, but it has some generalization capability. More importantly, how do we overcome some problems brought by cold start? For example, if our data is not enough, we might lead it into a trap through reinforcement learning. In the end of this reinforcement learning, after it learns the whole thing, this model is like splitting hairs, it gets stuck in a dead end, saying I just want it this way, and the effect goes off track at once.

How to pull it back at this time? So we interspersed a step of SFT in the middle, allowing this model to reinforce for a while, then do some SFT, then reinforce a bit more, becoming an alternating process, giving it a certain fault tolerance and ability to pull it back, becoming a scalable training algorithm. In the mobile environment, we achieved good improvements in effect on Android. Additionally, on multi-task large model reinforcement learning, we also did some work. Algorithmically mainly adopting multi-round reinforcement learning; in engineering, essentially it is Scaling, letting it go down on a larger scale.

AutoGLM Open Source

This year, around December, we open-sourced AutoGLM, open-sourcing everything inside. Everyone note that the model we open-sourced is a 9B model, not a super-large model. The reason is that 9B can act very fast in human-computer interaction, execution speed is exceptionally fast. If it is especially large, its execution speed will be very slow. So we open-sourced a 9B model. Once this model was open-sourced, it got more than twenty thousand stars at that time, and obtained more than ten thousand stars in three days, which is considered not bad.

This is an example. For instance, we want to go to Changchun for fun next week, help us summarize some attractions recommended on the current page, then star these attractions on Amap, including checking ticket prices, then go to 12306 to book a high-speed train ticket from Beijing to Changchun at 10 o'clock, organize relevant information for me. This model will execute 40 steps in the background. It will call different APPs, open different APPs, then input relevant information, related queries, execution. After the entire operation of 40 steps is completed, it gives you everything. It's equivalent to this AI doing something similar to your secretary, executing the whole thing completely.

More importantly, in all Device-use there are several lists, including OSWorld, Browser use, Mobile use related Benches, we all achieved very good results. Actually, you can imagine this model as having used a lot of Agent data in training. We used a lot of Agent data training on the 9B model, actually it might lower many original language capabilities and reasoning capabilities. Meaning it is no longer a purely general model, it might be strong in Agent capabilities, but might weaken in other aspects. So it brings us a new problem, how to make it not degrade on such super-large scale Agent models in the future, this becomes a new problem.

2025: GLM Open Source Year and Contribution of Chinese Open Source Models

Our 2025 is also the GLM open source year. We open-sourced many models roughly from January to December, including language models, agent models, and also our multimodal models, GLM-4.6, 4.6V, 4.5V and other related models. And more importantly, we can see the contribution made by Chinese open source models in 2025. Here the blue ones are open source models, black are closed source models. We can see on Artificial Analysis, the top five blue ones are basically all Chinese models, meaning we in China have made a lot of contributions in open source large models. We can see compared to the beginning of 2025, or 2024, open source in the US, including Meta LLaMA, still held an absolute advantage. With a year of development, China is slowly in the top five, basically now becoming Chinese models. The chart on the right is the large model blind test list, that is, results through manual evaluation, I took a screenshot of it.

Sober Understanding: The Gap May Be Widening

Next question, can we continue Scaling in the next step? What is our next AGI paradigm? We face more challenges. We just did some open sourcing, maybe some people will feel very excited, feeling that China's large models seem to have surpassed the US. Actually, the real answer might be that our gap might still be widening, because large models in the US are mostly still closed source. We are playing on open source making ourselves feel happy, our gap is not narrowing as we imagined. In some places we might be doing not bad, but we still have to admit the challenges and gaps we face.

Future Thinking: Referencing Human Brain Cognitive Learning Process

What should we do next? I have some simple thoughts here. I think looking at the entire development history of large models, it is actually referencing the learning process of human brain cognition. From the earliest large models, having to memorize all long-term world knowledge, just like children, reading books from a young age, memorizing all knowledge first, then slowly learning reasoning, learning math problems, learning more deduction and abstraction. For the future, it is the same principle. For human brain cognitive learning, what abilities will future AI have that large models don't have now, but humans far surpass us in:

First, 2025 might be the adaptation year for multimodality. Why do I say this? Maybe globally except for a few models that suddenly attracted a lot of attention, many multimodal models including ours didn't attract much attention from many people. Everyone is doing more on text intelligence improvement. For large models, how to collect multimodal information and perceive it uniformly, which is what we often call native multimodal models. Later I thought about it, native multimodal models are very similar to human "sensory integration". Human sensory integration is I collect some visual information here, also collect some sound information, and also collect some tactile information, how do I integrate this information together to perceive a thing. Like us humans sometimes have problems with the brain, many times it is sensory integration insufficiency, problems appearing due to sensory integration disorder. for models, how to do the next multimodal sensory integration capability?

Second, the model's current memory capability and continuous learning capability are not enough. Humans have several levels of memory systems, we have short-term memory, working memory, long-term memory. I even chatted with our classmates and people in our lab before, I said appearing to be a person's long-term memory doesn't necessarily represent knowledge, why? Because only when we humans truly record this knowledge, for example for me, if my knowledge cannot be recorded on Wikipedia, maybe after 100 years I will also perish, and I have no contribution to this world, it doesn't seem to be called knowledge either. It seems when training human large models in the future, my knowledge is useless too, all becoming noise. How do we record our entire memory system from a single person's three levels to humanity's fourth level? The entire memory system is what we humanity need to build for large models in the future.

Finally, reflection and self-awareness. Actually, models now have certain reflection capabilities, but self-awareness in the future is a difficult problem. Many people doubt whether large models have self-awareness capabilities. There are also many experts from foundational model labs present, some support it, some oppose it. I somewhat support it, I think this is possible, and worth our exploration.

System 1 and System 2

Human cognition is a dual system, System 1 and System 2. System 1 completes 95% of tasks. For example, if a human asks a question, what is the capital of China? Everyone's answer is System 1, because you memorized it. Or you say are you eating dinner tonight? You say yes, that's also System 1, these are all System 1 memorized. Only more complex reasoning problems, for example, I want to treat a friend from Sichuan to a big meal tonight, where to eat? At this time it becomes System 2, it has to ponder where this Sichuan friend is from, where do we go for a big meal, that is what System 2 does. System 2 only accounts for 5% in our daily life.

The same principle applies to large models. In 2020 we drew such a diagram. We were saying what an AI system referencing humans should look like. There is human System 1, human System 2, and also a self-learning. Why did we think of self-learning at that time? At that time I thought this way: First, System 1 can build a large model, letting it answer based on matching, solving System 1 problems; System 2 can add some knowledge fusion, such as instruction fine-tuning and Chain of Thought; third is if anyone has studied cognition, the human brain learns unconsciously while sleeping at night. If humans don't sleep at night, they won't become smarter. At that time in 2020, we said there must be AI self-learning mechanisms and self-learning Chain of Thought in the future, but we didn't know how to learn, just throwing the problem out first.

For System 1, we are constantly Scaling. If we constantly Scale data, this brings the improvement of the intelligence upper bound. At the same time we are also Scaling reasoning, enabling the machine to think longer, using more computation and more search to find a more accurate solution. The third aspect is we are Scaling the self-learning environment, giving this machine more opportunities to interact with the outside world and get more feedback. So through these three Scalings, we can let the machine reference human learning paradigms to get more learning opportunities.

Challenges of Transformer and Novel Architectures

For System 1, if we already have Transformer, does it mean we just need to add data, add larger parameters and we are done? 30T isn't enough, is 50T? If 50T isn't enough then 100T, finally adding parameters from 100B to 1T to 3T to 5T or even larger. But we are now facing another problem, what problem? The computational complexity of Transformer is O(N²), making it so that when we increase context, the increase in video memory and reasoning efficiency capability will become lower and lower. Many problems are faced here.

Recently there are some novel models, including some linear models trying to use linear methods, referencing the human brain where I use smaller brain capacity to store larger knowledge. Or even a more essential question is, is it possible, because Transformers were trained larger and larger originally, including at the earliest time, when we discussed we didn't say we must make the model small, getting bigger and bigger was earlier. But recently I am also reflecting, can we find better knowledge compression methods to compress knowledge into smaller spaces? This is a new problem. Two problems are faced here: logic problem, is there a way in engineering? Second problem, is there a way in methodology? So recently including many people are discussing, perhaps our large models need to return to research, and cannot simply Scale like before.

Scaling is a good method, but Scaling might be the easiest method. It is a method for us humans to be lazy. We just Scale Up directly, it is a lazy method. But for more essential methods, maybe we need to find new things. The second is the new Scaling paradigm. Scaling might be a very important path, but how do we find a new paradigm, giving this machine the opportunity to Scale? Reading is an opportunity, communicating with people is also an opportunity. We need to find a new way to let this machine Scale independently.

Some people might say we add large data. Adding large data is forced by us humans. This machine must find a way to pass itself, define some reward functions itself, define some interaction methods or even training tasks itself to do Scaling. This is what System 2 does. More importantly, after we have the previous two, we also need to complete ultra-long tasks in more real scenarios. How to do this part? Let this machine PLAN like a human, do a bit, check a bit, feedback again. Humans work like this, is it possible for machines to do so? How to complete an ultra-long task?

For example, we have a little article coming out this year. At the beginning of the year, I told our team partners, you must write an article for me by the end of the year, but it wasn't realized, and finally wasn't made. Anyway, until now, everyone knows that some articles on the internet have begun to try, the idea was also generated by the model, the experiment was also done by the model, the report was also done by the model, finally capable of doing a Workshop, but in fact it hasn't been done yet. Here is an example of a task in a real ultra-long environment. We hope to define what future AI will look like on this basis. This is some of our thinking.

Five Levels of Intelligence

Early before this large model, most machine learning was mapping F(X) to Y. I learn a function so that sample X can be mapped to Y. After large models came, we turned this problem into mapping F(X) to X. Maybe what is mapped is not strictly X either, but we let it use fully self-supervised learning to do multi-task self-learning.

Additionally, the second layer, after we added this data, allow these models to learn how to reason, how to activate underlying intelligence. Further back, we are teaching this machine to have self-reflection and self-learning capabilities. Through this, the machine can constantly self-criticize, able to learn what things I should do, what things act serve better to do. In the future, we also want to teach this machine to learn more, for example, learning self-awareness, letting this machine be able to explain its own behavior, for example, AI generated a large amount of content and can explain itself, why did I generate this content, what am I, what is my goal. Ultimately maybe one day, AI will also have consciousness. We define roughly these five layers of thinking.

Three Core Capabilities of Computers

From the perspective of computers, computers won't define it so complicatedly. In my view, computers have three capabilities: First, Computer Representation and Calculation. Representing data, it can do calculation. Second, Programming. Only programming is the computer's interaction with the outside world. Third, essentially Search. But combining these abilities together: First is having representation and calculation, which can make storage capability far exceed humans. Second is programming can create some logic more complex than humans. Third, search can be done faster than humans. These are the three capabilities of computers superimposed together, possibly bringing so-called "Super Intelligence", maybe surpassing some human capabilities.

AGI-Next 30: Vision for the Next 30 Years

I suddenly remembered 2019, this PPT was actually from when we were cooperating with Alibaba. At that time I was asked to produce a PPT slide, and I produced this slide at the time, which is AGI-Next 30, what we should do in the next 30 years. This image was screenshot by me, Next AI. We said in 2019 that in the next 30 years, we should make machines have reasoning capability, memory capability, and consciousness. We are now about here having done certain reasoning capability, everyone should have a little consensus. Memory capability has a part, but consciousness not yet, this is what we are striving for. In the future we are also reflecting, if referencing human brain cognition, future AI might have "what am I", "why am I", constructing a meaning system for this model, also individual agent goals, and entire agent group goals, so we achieve exploration of the unknown.

Some people might say this is completely impossible, but everyone remember, the ultimate meaning of us humans is that we are constantly exploring unknown knowledge. What we feel is impossible, is precisely perhaps what we need to explore on the road to AGI in the future.

2026 Outlook

2026 for me is more importantly about focus and doing some relatively new things. First, we might continue doing Scaling, but Scaling the known is us constantly adding data, constantly exploring the upper limit. And Scaling the unknown, is we don't know what the new paradigm is. Second, technological innovation. We will do brand new model architecture innovations, solving ultra-long context, and also more efficient knowledge compression problems, as well as realizing knowledge memory and continuous learning. These two aspects added together might be an opportunity in the future to realize machines stronger than human capabilities by a little bit. Third, multimodal sensory integration, this year is a hot spot and key point. Because with this capability, we enable AI to realize entering long tasks like inside machines, long-duration tasks. In our human working environment, for example inside phones, inside computers, it can complete our long tasks. When completing our long tasks, AI realizes a job type, AI becomes like us humans, able to help us achieve. Only in this way, can AI realize embodiment, and enter the physical world. I believe this year might be a breakout year for AI for Science, because many capabilities are greatly improved, we can do more things.

That concludes my report, thank you everyone!

Value and Aesthetics in Large Models

Speaker: Yang Zhilin (Founder of Moonshot AI/Kimi)

Everyone calls me Dark Side of the Moon, Yang Zhilin. I'm very happy to be back at Tsinghua. First, I want to declare that I didn't prepare any slides today. The main reason is that I saw Tang Jie didn't prepare slides either, so I didn't prepare any. Another reason is that in the past year, or the year before, or even earlier, I gave many similar tech talks sharing many technical details. But today I want to share something different, some of our recent, relatively new thoughts.

Scaling Law and Token Efficiency

In 2024, the hottest topic discussed by everyone domestically and internationally was about Scaling Law, whether Pre-train has hit a wall? The views are divided into two factions, one faction thinks it has hit a wall, the other faction thinks it hasn't. The faction that thinks it hit a wall has a very strong argument, which is that high-quality data on the Internet, especially reasoning data like Math and Code, has been exhausted. There's no more data, so naturally, the model's intelligence cannot be improved further. Actually, DeepSeek gave a very good answer to this recently, which is through RL approaches, allowing the model to self-evolve, breaking the ceiling of static data. This is actually a very classic method, AlphaGo Zero proved this several years ago. So I won't expand too much on this point today. I want to talk about another perspective, which is Token Efficiency, meaning how much intelligence is contained in each Token generated by the model, or how much intelligence can be learned from each Token the model learns.

Since the Transformer came out in 2017, for seven or eight years, the entire industry has been using the Adam optimizer. This optimizer is very good, very stable, and everyone is used to it. But recently we released a new optimizer called Muon. We found that under the same amount of data, using the Muon optimizer to teach the model, the intelligence learned is double that of Adam. This is a very frightening number. It means looking forward to 2025, or even further, if we just improve the learning efficiency of the model itself, there is still huge room. And this improvement in efficiency brings a direct result, which is the substantial reduction in training costs.

Everyone knows that recently the pricing of our Kimi k1.5 has been significantly reduced. Why can we reduce the price? It is because our model training costs have dropped significantly. This cost reduction is not just because of cheaper graphics cards or cheaper electricity, but more importantly, it is the improvement of the algorithm itself. We use fewer resources to train smarter models. This allows us to give back the cost advantage to users and developers, accelerating the arrival of AGI.

Long Context and the Definition of Agent

Another point I want to share is about Long Context. Everyone knows Kimi was the first to do long context. At the beginning, many people didn't understand, thinking what's the use of such a long context? Just doing RAG (Retrieval-Augmented Generation) is fine. But now everyone finds that long context is becoming standard for all models. Why? Because context is memory. Without memory, there is no intelligence.

In the past two days, everyone is discussing Agent again. What is an Agent? My definition is actually very simple to define: an Agent is a model that can complete tasks in a complex, dynamic environment. And to survive in a complex environment, the most important ability is to define the environment. How to define the environment? It relies on Context. If a model can only remember five sentences, it cannot understand a complex project code, it cannot understand a complete case file, it cannot understand a person's life experience.

So I think Long Context is the infrastructure of the Agent era. Only with sufficiently long Context can the model truly understand the environment, truly understand users, and thus truly solve problems. In 2025, I think we will see the combination of RL and Long Context. Through reinforcement learning, letting the model learn how to use long context more effectively, how to extract key information from massive information, how to conduct long-chain reasoning in long context. This will be an important breakout point.

Technology, Data, and Aesthetics

The third point I want to share is about the relationship between technology, data, and aesthetics. We often say Scaling Law is adding data and computing power. But simply piling up data is not enough. We found that the quality of data determines the upper limit of the model. And what determines the quality of data? It's our "Taste", our aesthetics.

For example, on the Internet now, there is a lot of AI-generated content, the kind of "marketing account style" articles, full of nonsense, using words like "delve", "tapestry", "landscape". If we feed all this data to the model, the trained model will also speak in this "AI flavor". This is not what we want. We want the model to have connotation, to have logic, to speak like a real person, a cultured person.

This requires us to be very picky when doing data selection and cleaning. We need to define what is good data. This definition process is actually imparting our values and aesthetics to the model. So I think Scaling is not just a cold calculation formula, it is a co-evolution of technology, data, and human aesthetics. We create models, and models in turn reshape our data and culture. We hope Kimi is a model with "temperature", and this temperature comes from our persistence in high-quality data.

Outlook for 2026 and Definition of AGI

Finally, talking about the outlook for the future. Tang Jie reached 2026, I'll also talk about it. I think by 2026, we might see the prototype of real AGI. This AGI is not just about passing an exam or writing a piece of code, but about being able to help us expand the boundaries of knowledge.

What is AGI? I think there are three levels of definition. The first level is the capability level, able to complete any intelligence task humans can complete. This we are quickly approaching. The second level is the value level, able to improve the overall efficiency of human society and improve human quality of life. This is what we are doing, like Kimi helping everyone read papers and organize meeting minutes. The third level, which is what I value most, is the civilization level. AGI should be an extension of human life, an expansion of human civilization. It can help us understand the unknown, solve problems we couldn't solve before, like curing cancer and exploring the universe.

Kimi's answer is quite inspiring. It thinks AGI is not just a tool, but a key to raising the upper limit of human civilization and extending the boundaries of human cognition. Many difficult problems we face today, such as cancer, energy crisis, social problems, etc., might find answers through it. It is an important key for us to explore the unknown world. So, despite risks, its answer is to still choose to continue development. Because giving up development means giving up the upper limit of human civilization. We should not fear the risks of technology, but should go further to breakthrough, and control risks well in the process. All technological breakthroughs are accompanied by risks, but we cannot stagnate because of fear. Therefore, we hope in the coming ten, twenty years, to continue making K4, K5 to K100 better. Thank you everyone.

AGI-Next Panel Discussion Transcript

Host: Li Guangmi

Li Guangmi: I am the host of the next Panel, Guangmi. Listening from the audience just now, I have several feelings. First, Professor Tang's appeal is very strong. Tsinghua's talent is excellent, not only domestic but also overseas, the proportion of Tsinghua people is very high. I feel this group seems to have widened the gap with domestic schools in this wave of AI. Second, listening to a few Talks just now, I felt that everyone is not just following, not just open-sourcing, not just Coding, everyone is exploring their own product forms. 2025 is a year where Chinese open-source models shine, a year where the "Open Source Four" shine globally, and a year where Coding has grown 10-20 times in the past year. Overseas is also asking where Scaling has gone, has a new paradigm emerged? The next Panel discussing how to proceed next is particularly interesting.

Next, inviting several guests: Professor Yang Qiang, Professor Tang Jie, Junyang, and Shunyu. Let's start with the first interesting topic, several Silicon Valley companies are obviously differentiating. We can start chatting from the theme of differentiation. Spec actually gave a very big inspiration to Chinese models. Silicon Valley's competition is so fierce, it didn't completely follow and do everything, but focused on enterprise, focused on Coding, focused on Agent. I am also thinking about what directions Chinese models will differentiate into? I think the theme of differentiation is quite interesting. Shunyu is online, Shunyu, start us off, including what you have been doing recently.

Yao Shunyu: Hello everyone, am I a giant face in the venue now?

Yao Shunyu: Sorry I can't come to Beijing in person today, but I am very happy to participate in this event. Recently I've been busy making models and products... Yes, I think it is just a very normal state... It feels quite good to be back in China. Eating is much better.

Li Guangmi: Shunyu, can you expand on your thoughts on the theme of model differentiation? Silicon Valley is differentiating, Chinese models are also open-sourcing. For example, Anthropic did Coding, Google Gemini didn't do everything but made full modality stand out first, your old employer (OpenAI) focuses on To C. Your own experience spans China and the US, what is your perception?

Yao Shunyu: I have two big feelings. First, the path of technology integration, and the path of separation between models and applications, have also begun to differentiate. Let me talk about the differentiation between To C and To B first. When everyone thinks of AI Super Apps, now there are two: ChatGPT and Claude, which can be considered models for To C and To B. Interestingly, our feeling of using ChatGPT today, compared to last year, the change is not that strong for most people. But conversely, a year ago the Coding revolution hadn't started, this year, to exaggerate, Claude is reshaping the way of doing things in the entire computer industry. People are no longer writing code, but communicating with computers in English.

The core lies in that for To C, most people most of the time actually don't need to use such strong intelligence. Maybe the model's ability to solve abstract algebra has become stronger, but most people can't feel it, everyone mostly still treats it as an enhanced version of a search engine. But in To B, the higher the intelligence, the higher the productivity, and the more money can be earned. There is also an obvious point, in the To B market, many people are willing to pay a premium for the strongest model. A model for $200/month, the second strongest for $50/month, many Americans are willing to spend this premium because it can help him improve work efficiency. A very strong model like OpenAI 4.5 might do eight or nine out of 10 tasks correctly directly, a slightly worse model might only do five or six correctly. The extra problem is that you have to spend extra energy to monitor it, you don't know which five or six are correct. So, I found a very interesting phenomenon is that in the To B market, the differentiation between strong models and weak models will become more and more obvious.

The second observation is the differentiation between vertical integration and model-application separation. In the past, everyone thought that having vertical integration capabilities would do better, but today not necessarily. The model layer and application layer need different capabilities. For To B productivity scenarios, larger pre-training models are key, which is difficult for product companies to do. Conversely, to use a good model well, or to say the model has overflow capabilities, also requires doing a lot of things on the application side and environment side. We will find that in To C applications, vertical integration holds true. Whether it is ChatGPT or Doubao, models and products are strongly coupled and tightly iterated. But for To B, the trend seems to be the opposite: model companies focus on making models stronger and stronger; similarly, the application layer wants to utilize the best models to empower different productivity links.

Li Guangmi: You have a new identity recently. In the Chinese market, what is your ideal bet? What distinctive characteristics or keywords can you share?

Yao Shunyu: Tencent is a company with stronger To C genes. We will think about how to let large models provide more value to users. We found that often the bottleneck for To C is not larger models or stronger reinforcement learning, but extra context and environment. I often give an example, you ask the model "what should I eat today", whether asking ChatGPT last year or this year, the result might be very poor. For this question to become better, what is needed is not a stronger model or search engine, but more extra input. If the model knows it is very cold today, I want to eat something warm; knows my wife is in another place, what she wants to eat... With this context, the quality of the answer will be completely different. For example, we can forward WeChat chat records to Yuanbao, giving the model more useful input, which will bring a lot of extra value to users.

As for To B, it is indeed a very difficult thing in China. Many companies doing Coding Agents are actually going to attack overseas markets. In this regard, we will think about how to serve ourselves well first. A difference between big companies doing Coding and startups is that big companies themselves have various application scenarios and needs to improve productivity. If our models can do better in these internal scenarios, not only will the model have unique advantages, the company can develop better, and more importantly, it can capture more diverse scenario data in the real world. Like Anthropic and OpenAI are startups, they need to find data vendors to label data, but the people data vendors can hire and the scenarios they can think of are always limited, diversity will be limited. But if you are a company of 100,000, there might be many interesting attempts to truly utilize real-world data well, rather than just relying on labelers or distillation.

Li Guangmi: Junyang, how do you view the ecological niche of Qwen next?

Lin Junyang: Companies don't necessarily have that many gene distinctions, maybe they are shaped by generations of people. For example, after Shunyu went to Tencent, Tencent might become a company with Shunyu genes (laughs). Today To B and To C are both serving real humans. So the essence of this question is: What should be done to make the human world better? Even To C products will differentiate again, for example, leaning more towards medical, more towards law. I am willing to believe Anthropic (can do better) not because its Coding is very powerful, but because they communicate very much with the B-end. I communicate with many API vendors in the US, they didn't expect the Token consumption of Coding to be so large. In China, Coding's Token consumption is actually not that large yet. Today Anthropic is doing more finance-related stuff, which is also an opportunity they saw in communication with customers. So everyone's differentiation might be natural differentiation. I am more willing to believe/trust AGI, and let nature take its course.

Li Guangmi: How does Professor Yang Qiang view the problem of differentiation?

Yang Qiang: All along, academia has been an observer, industry is leading the crazy run ahead, leading to many people in academia now doing industry things. This is a good thing. When astrophysics first started, it was mainly observation, then theory appeared. When numerous large models enter a steady state, academia should catch up. Academia needs to solve problems that industry hasn't had time to solve yet, such as where is the intelligence upper limit? Given certain resources, how well can you do? More specifically, how are resources allocated? Which are allocated to training, which to inference? I did a small experiment in the early 90s, if there is a certain investment in memory, to what extent can memory help reasoning? Will this help become reverse? Will too much memory become noise instead? Is there a balance point? These methodological problems are still applicable today.

I have also been thinking about another problem recently. Computers have an important theorem called "Godel's Incompleteness Theorem", roughly meaning a system (large model) cannot prove its own innocence, it must have some indelible hallucinations. So the question comes: How much resources can be exchanged for how much reduction in hallucinations? Or reduction in error rate? There is a balance point in the middle. This balance point is very like the balance of risk and return in economics, also called "No Free Lunch Theorem". These problems are particularly suitable for academia and industry to research together.

Professor Tang Jie also mentioned continuous learning just now, it involves the concept of time. In the process of continuous learning of large models, how to ensure learning ability does not decline? Humans have a method: sleep. I suggest everyone read a book called "Why We Sleep", written by two MIT professors. It mentions that sleeping every night is actually cleaning up noise, making the learning accuracy the next day continuously improve, not becoming a superposition of two error rates. These theoretical researches nurture new computing modes. We might be more concerned about Transformer Agent Computing today. But it is necessary to make some new explorations, industry and academia need to align.

Li Guangmi: Zhipu today looks more like it took Anthropic's route, Coding is very strong. Professor Tang Jie, what are your views on the theme of differentiation?

Tang Jie: In 2023, we were the first to make a Chat system, so our first thought at that time was to hurry up and put Chat online. But when it went online in August and September 2023, a dozen large models all went online together, and each company didn't have that many users. Of course today (users) are more severely differentiated. Later, after a year of thinking, the reason lies in Chat is not really solving problems. In our original prediction, Chat would replace search. To this day, I believe many people have started using models to replace search, but haven't replaced Google. Google conversely revolutionized its own search. From this perspective, the battle of Chat, since DeepSeek came out, has ended. We should think about what the next Bet is. At the beginning of (2025), our team debated for a long time, decided to bet on Coding, later we put all our energy into Coding.

Li Guangmi: Betting is a particularly interesting thing. My feeling is that in the past year, China was not only strong in open source, but everyone had their own Bet, and next it is possible to differentiate. Because everyone is not just pursuing general capabilities, but at the same time has their own resource endowments, doing what they are good at better. Today, pre-training has passed three years, RL has also become a consensus, Silicon Valley is discussing the next new paradigm, autonomous learning. Shunyu stayed at OpenAI, OpenAI promoted two paradigms, Transformer and RL. How do you think about the next paradigm?

Yao Shunyu: Autonomous learning is a very hot word now. In Silicon Valley streets and cafes, everyone is talking about it, forming a consensus. According to my observation, everyone's definition and view of this thing is different. I'll engage two points:

First, autonomous learning is not a methodology, but data or tasks. Under what kind of scenario, based on what kind of reward function to do autonomous learning? When you are chatting, becoming more and more personalized is a type of autonomous learning; when writing code, becoming more and more familiar with each company's unique environment or documentation is also a type of autonomous learning; you explore new science, in this process from understanding what organic chemistry is to becoming an expert in this field, is also a type of autonomous learning. The challenge, or methodology, of each type of autonomous learning is not quite the same.

Second, I don't know if this phenomenon is non-consensus, but it has already happened. ChatGPT is already using user data to constantly bridge the style of human chat, is this a type of self-learning? Today 95% of the code for the Claude project is already written by Claude itself, it is helping itself become better, is this a type of self-learning? In 2022, 2023, I went to Silicon Valley to promote work. I wrote the first slide at that time saying the most important point of AGI is autonomous learning. AI systems essentially have two parts, first it is a model, second it has a code base, how do you use this model? Is it used for reasoning or as an Agent? Both have corresponding code bases. We see the Claude system today essentially has two parts, one part is the code for the deployment environment, the other part is a huge pile of code for operations. These examples of autonomous learning might still be limited to each specific scenario, not making people feel very powerful. My personal view is that autonomous learning is more like a gradual change, not a mutation.

Li Guangmi: What signals do you think can be seen for autonomous learning in 2026? What practical problems still need to be broken through?

Yao Shunyu: Many people say seeing signals of autonomous learning in 2026, signals were seen in 2025. Cursor uses the latest user data to learn every few hours, including new models, also using these data in real environments for training. People feel these progresses are not yet earth-shattering because limited by their lack of pre-training capabilities, their model effects are indeed not as good as OpenAI, but obviously this is a signal of autonomous learning.

The biggest problem is imagination. We can easily imagine what reinforcement learning or reasoning paradigms would look like if realized. We can imagine OpenAI o1, originally 10 points on math problems, now became 80 points. Through reinforcement learning, o1 can have a very strong chain of thought to do math problems. If in 2026 or 2027, a new model or new system realizes self-learning, what tasks should we use, what effect should it be, to make you believe it is realized. Is it a money-making trading system? Or solved scientific problems humans couldn't solve before? We might need to imagine what it looks like first.

Li Guangmi: OpenAI already has two paradigm innovations. Do you think if there is a new paradigm coming out in 26 or 27, globally, which company do you feel has the highest probability to continue leading this paradigm innovation?

Yao Shunyu: Maybe OpenAI has a higher probability. But because of its commercialization and various changes, its innovation gene has been weakened. But it might still be the place most likely to birth a new paradigm.

Lin Junyang: From a more practical perspective, the RL paradigm is also still in the early stage, RL's compute hasn't scaled that fully yet, many potentials haven't been unleashed, we can also see many Infra problems occurring. But globally, similar problems also exist. Regarding the next generation paradigm, I think one is autonomous learning. Chatting with a friend before, saying "humans can't make AI more powerful", for example, you interact with AI constantly, only making its context longer and longer, then AI becomes stupider and stupider. This is a very annoying thing.

Can Test-time scaling truly happen, spitting more tokens then becoming stronger, this is worth our thinking. I at least think the o series achieved this to a certain extent. Is it possible that doing transcending things today is difficult, but maybe feasible through Coding. Today people doing that kind of AI scientist thing is actually quite meaningful, because you are challenging some difficult things, even things humans haven't done, is it possible to realize it in three days? From this perspective, AI definitely needs this autonomous evolution, but whether you need to update parameters? This varies from person to person, maybe everyone has different technical means to realize this thing.

The second point is, is it possible for AI to achieve stronger initiative. Now AI must be prompted by humans to start, in the future is it possible for the environment to prompt it, letting it think autonomously to do things? But this triggers a new problem, which is the safety problem. What I am very worried about is not AI saying things it shouldn't say. The most worrying thing is it doing things it shouldn't do, for example, today it actively generates an idea, throwing a bomb into this venue, such things. We definitely don't want these unsafe things to happen, but like raising a child, we might need to inject some correct directions into it. But active learning might be quite an important paradigm.

Li Guangmi: Yes, Junyang mentioned initiative (of AI learning) again, initiative might be a very critical bet in 26. If autonomous learning sees signals in 26, what tasks do you feel might see it first? Model training model, strongest model can improve itself? Or will there be automated AI researchers?

Lin Junyang: I think automated AI researchers don't even need autonomous learning that much. Humans might soon realize AI training AI. Watching what our classmates are doing every day, I feel they can be replaced soon. But it might be more continuous understanding of users, for example, personalization is quite important. In the past when doing recommendation systems, user information was continuously input, making your whole system stronger. But when AI covers all aspects of human life, what is the true metric for personalization? We actually don't know much. So the bigger technical challenge is that we don't know how to do today's evaluation.

Li Guangmi: If "Memory" is realized, will it be a leap in technology breakthrough in 2026?

Lin Junyang: My personal view, a large number of so-called technology breakthroughs, are some observation problems, they actually develop linearly, just humans feel it very strongly. Including the appearance of ChatGPT, actually to us making large models, it belongs to linear growth. Is the current technical solution for Memory correct? Many solutions don't have right or wrong distinctions, but the effect produced, at least I take our own to show my incompetence. That is our own memory, it looks like it knows what I did in the past, but it just remembered past things, doesn't appear very smart. But when memory reaches a certain critical point, will it really be like a living person, or like the movie "Her", by understanding your memory, knowing human feelings, more or less still needs a year. Many times technology actually doesn't develop that fast, just everyone is relatively convoluted (involution), feeling there are new things every day, but actually technology is developing linearly. Looking at what we do every day, really quite earthy, those Bugs are really embarrassing to tell everyone. If doing this way, we have already achieved such results, maybe in the future when algorithms and infra combine better, there will be much more to do.

Li Guangmi: Promoting Professor Yang Qiang.

Yang Qiang: I have always been doing federated learning. The main idea of federated learning is multiple centers, everyone collaborating. I see more and more now, many have insufficient local resources, but local data has many privacy and security requirements. We can imagine, now large model capabilities are getting stronger, how do general large models and local specialized small models, or domain expert models, collaborate? Such collaboration is becoming more and more possible. Like Zoom in the US, the AI system made by Huang Xuedong, he made a large base, everyone can plug into this base. It can, in a Decentralized state, both protect privacy and communicate and collaborate effectively with general large models. This open source mode is particularly good, one is open source of knowledge, one is open source of Code, model stage. Especially in scenarios like medical and finance, we will see this phenomenon happening more and more.

Li Guangmi: Promoting Professor Tang.

Tang Jie: Continuous learning, Memory, even multi-modality, all might have new paradigm changes. Why are such paradigms generated? Originally industry actually ran far faster than academia. I remember returning to Tsinghua last year and the year before, many professors had almost zero cards. Industry has 10,000 cards, the school has 0 or 1 card, the multiple is 10,000 times. But now, many schools have many cards, and many professors have done a lot of research related to large models, including in Silicon Valley, many professors have started research on model architecture, continuous learning. Originally we always felt industry was dominating these, by the end of 2025 to beginning of 2026, this phenomenon may largely cease to exist. There might still be a 10-fold difference between school and industry, but it has hatched seeds.

First, academia has the gene for innovation. Second, the appearance of an innovation is definitely because of massive investment in something, and efficiency bottlenecks appeared. Now large models have huge investment, but efficiency is not high. Continuing Scaling now definitely has returns, maybe needing 10T data at the beginning of 2025, now needing 30T, even we can Scale to 100T, but after Scaling, how much is your return? How much is the computing cost? If you don't innovate, spending 1 billion, spending 2 billion, but return is small, it's not worth it. On the other hand, for new intelligence innovation, suppose every time we have to re-train a base, re-train RL, return efficiency will become small. In the future we might define a new paradigm measuring return, on one hand since we want to raise the intelligence upper limit, the stupidest way is Scaling. On the other hand, should define Intelligence efficiency, efficiency of intelligence, using less Scaling to obtain the same intelligence improvement. So paradigm change in 2026 will definitely happen, we are also striving, hoping this change happens to us.

Li Guangmi: I am also very optimistic like Professor Tang. Every leading model company, computing volume is about 10 times every year, everyone has more computing resources, and talents are flowing in more and more. Everyone has more cards, maybe a certain experimental project, a certain point will come out. Everyone has a big expectation for Agent in 2026, which is it can automate human workload of one to two weeks, and no longer be just a tool. This might be a key year for Agent to create economic value. Several companies in Silicon Valley, all did end-to-end from model to AGI. Shunyu, you spent a lot of time doing Agent research, in 2026, can Agent really help humans automate 1-2 weeks of work? From the starting point of a model company, how do you think about this problem?

Yao Shunyu: To B and To C might be different. In To B aspect, Agent is on a constantly rising curve, currently no trend of slowing down. Anthropic is very interesting, it doesn't do flashy innovations, just making pre-training big, doing RL well, then solving real-world tasks, the model will become smarter and smarter, bringing more value. Doing To B, actually all goals are more consistent: the higher the model intelligence, the more tasks solved, the bigger the revenue. This is different from To C. We all know OpenAI's To C problem, To C's DAU and model intelligence, are often irrelevant, even having inverse relationships. This is another very important reason Anthropic can focus: just make the model better and better. Then his revenue gets higher and higher, all things are all very aligned.

Currently besides the model itself, there are two bottlenecks. One is environment and Deployment problem. Before OpenAI, I interned at a To B customer service company, gained quite a lot. Even if the model doesn't get better today, just deploy existing models to various companies in the world, might bring 10 times or 100 times revenue today, impacting GDP by 5%-10%, but today, its impact on GDP is far less than 1%.

Another very important thing is education. The gap between people is widening, not AI replacing people, but people who can use AI tools replacing people who can't. Just like when computers were just invented, if you turned to learn programming, you are still using slide rules, using algorithms, that difference is huge. Perhaps one of the biggest meaningful things China can do today is actually better education, teaching everyone how to better use products like Claude or ChatGPT. Of course, Claude might not be usable in China, but we can use domestic models like Kimi or Zhipu.

Li Guangmi: Junyang, Qwen also has an ecosystem, contrasting doing Agent and supporting ecosystem general Agents, can you share?

Lin Junyang: This might involve product philosophy problems. Of course, products like Manus are indeed very successful. Is wrapping a shell the future? Indeed a question. I agree more with the view "Model is Product". I chat with some people from TML (Thinking Machine Lab), they have a view that Researcher is Product. Many researchers can become product managers themselves, doing things end-to-end. Today our own internal researchers all want to do more things facing the real world. I believe the upcoming Agents can do the things just mentioned, strongly related to self-involvement and active learning just mentioned. For example, he can work for such a long time, he actually has to evolve in this process, and he also has to decide what to do. Because the instruction he receives is a very general task, so our current agents actually start to look more and more like that kind of managed agent, rather than the form of I constantly interact back and forth, this requires high model capability, for example the model is this agent, agent is this product itself. If they are all this integrated. From this perspective, if constantly raising the upper limit of model capability, including doing Test Time Scaling up. He indeed can do this thing.

Another point is environment interaction. We are interacting with computer environments now, not complex enough. I have a friend doing AI for science, for example you do AlphaFold stuff, making medicine, even if you use today's AI, maybe can't help you that much, because you have to do experiments, you can't just do it in the computer, have to direct robots to do experiments to get feedback. According to current human efficiency, actually it is very low, we even have to hire many outsourcers to do experiments in this environment. If AI can interact with the real physical world, that is the scenario I imagine Agent being able to do long-time work, rather than just in the computer. Some things done in computer environments, I feel might be completed very soon this year, but in the next three to five years, Agent tasks to be completed might combine with embodied intelligence, this will be more interesting.

Li Guangmi: I want to follow up with a sharper question, from your perspective, is the general Agent opportunity for entrepreneurs?

Lin Junyang: I can't be an entrepreneurship mentor just because I make basic models. I can only borrow that sentence from a successful person, Peak (Manus co-founder) said, the most interesting thing about General Agent lies in solving long-tail problems, or say the greater charm of AI today is in the long tail. Head problems are actually easy to solve, when doing recommendations back then, actually we saw, that recommendation was very concentrated, goods were all in the head, but we actually wanted to push tail things, but when I did it back then it was very disastrous. As a person doing NLP and multi-modality, going to solve the Matthew effect, basically headed for a dead end. I think today's so-called AGI is actually solving this problem. A user, I searched everywhere and couldn't find anyone to help me solve this problem. But right at that moment, I felt the ability of AI, that is any corner of the world, I searched everywhere and couldn't find, but you could help me solve. Maybe this is the greatest charm of AI. Do you want to do general Agent? If you are a "shell wrapping" expert, wrapping better than model companies, then you can do it. But if without this confidence, this problem might still be left for model companies to do themselves. Because when they encounter problems, just train the model a bit, burn some cards, maybe the problem is solved, so it's a matter of opinion.

Li Guangmi: Actually solving long-tail problems, model companies say computing power plus data, seems you solve it quite fast too, right.

Lin Junyang: The most interesting place for RL (Reinforcement Learning) today is that we found fixing problems is easier than before. Before fixing problems was hard. Let me give a B-end customer case, they said we want to do SFT (Supervised Fine-Tuning) ourselves. Can you tell me the ratio of general data? Every time we have a headache, because we feel the other party doesn't really know how to do SFT, his data is actually bad, but he might feel his data is useful. Now with RL, you might really just need a very small data point, even you don't need labeling, you just have this query, have this reward (reward function), train it a bit, then merging it is actually very easy.

Yang Qiang: I think the emergence of Agent should have four stages, depending on whether the goal and planning are defined by humans or automatically defined by AI. We are now at the most elementary stage: goal is defined by humans, planning is also done by humans. So current Agent definition, these software systems, are basically deeper Prompt language. I predict in the future, large models will observe human work, utilize human process data, eventually realizing goal and planning are both defined by large models, Agent should be a system inherent to large models.

Li Guangmi: Promoting Professor Tang Jie.

Tang Jie: Several aspects determine the future trend of Agent: First, has Agent itself solved human things, and is this thing valuable, how valuable? For example, original Agents, like GPTs also made many Agents, at that time you would find that Agent very simple, finally found prompt solved it, at this time most Agents slowly died. So, first is how valuable solving Agent this thing is, and really being able to help people. Second, how big is our Cost to do this thing. If Cost is particularly big, this is also a problem, just like Junyang said, maybe calling an API can solve this problem. But conversely, suppose calling API can solve, this API itself might feel when this thing is very valuable, it will make it into it, this is a contradiction, very contradictory, base applications are always contradictory. Finally, speed of making applications. If I have a time window, able to pull a half-year time window, quickly satisfy this application, half a year later, either iterate, or how to connect, how to move forward is also an aspect. Large models until now are more competing in speed, competing in time, maybe our code is correct, maybe we will go further in this aspect, but maybe fail after half a year, half a year is gone. This year we just did a little bit in Coding, in Agent, now our Coding call volume is quite good, more is also a direction, doing Agent future is also a direction.

Li Guangmi: Thanks, because in the past model companies had to chase general capabilities, maybe its priority didn't spend that much energy to explore, after general capabilities caught up, we expect more in 2026 Zhipu, Qwen have more of their own Claude moments, and Memory moments, this is very worth expecting. Fourth question, also need to look forward to the future, I quite want to ask, in three and five years later, how high is the probability that the global leading AI company is a Chinese team, from today's follower to future leader, what key conditions are needed? Shunyu experienced Silicon Valley and Chinese markets, what is your judgment on probability and what key conditions needed?

Yao Shunyu: Probability is quite high, I am still quite optimistic. Currently looking, any thing once discovered, can be quickly reproduced in China, doing better in many locals, including previous examples like manufacturing, electric vehicles have constantly happened. I think there might be several relatively key points, one might be whether China's lithography machine can breakthrough, if finally computing power becomes a Bottleneck, can we solve the computing power problem. Currently looking, we have very good power advantages, good infrastructure advantages. Major bottlenecks, one is production capacity, including lithography machines, and software ecology, solving this problem will be a big help. Another problem, besides To C, can there be a more mature or better To B market, or is there an opportunity to compete in international business environments. Today we see many models or applications doing productivity or To B, still will be born in the US, because willingness to pay is stronger, culture is better. Today doing this in China is difficult, so everyone will choose to go overseas or internationalization things, these two are relatively big objective factors. More important is subjective concept, recently I am chatting with many people, our feeling is there are very many very strong talents in China, any thing as long as proved to be doable, many people will try very positively, and want to do better. People in China wanting to breakthrough new paradigms, or do very risky things might not be enough, here includes economic environment, business environment including cultural factors. If adding one more point, subjectively need more people with entrepreneurial spirit or adventurous spirit, really wanting to do frontier exploration or new paradigm breakthrough things. Currently looking, once a paradigm happens, we can use very few cards, very high efficiency to do better locally, can we lead new paradigms? This might be the only problem China needs to solve today. Because all other things, whether business, or industrial design, or doing engineering, we have already done better than the US to some extent.

Li Guangmi: I Follow Shunyu one more question, do you have anything to call for regarding research culture in Chinese Labs? You also experienced OpenAI, DeepMind, what are the differences between Chinese and American research cultures? As an AI Native company, what fundamental impact does this have? Any calls and suggestions?

Yao Shunyu: Research culture in every place is very different, differences between American labs might be bigger than between Chinese and American labs, same in China. I personally feel two points, one is in China, everyone still prefers doing safer things, for example today pre-training this thing has been proved doable, actually this thing is also very hard to do, many technical problems to solve, but as long as this thing once proved doable, we are all very confident to figure this problem out in a few months or a period of time. But if today asking a person to explore a long-term memory or continuous learning, this thing everyone doesn't know how to do, doesn't know if it can be done, this thing is still relatively difficult. Maybe not just everyone prefers doing certain things, less willing to do innovative things. Very important point is cultural accumulation or overall cognition, actually requires time precipitation. OpenAI started doing this in 2022, domestic started in 2023, understanding of this thing will have some differences, or say China is not that big. Many is just time problem, when you accumulate deeper culture or foundation, subtle influence degree might affect people's way of doing things. But it is very subtle, hard to reflect through rankings. China places more weight on brushing rankings or numbers, including DeepSeek doing relatively well is, they might not pay that much attention to ranking numbers, might pay more attention to: first, what is the right thing; second, what is good or bad experienced by yourself. This is still quite interesting, because you see Claude model might not be the highest in programming or software engineering rankings, but everyone knows this thing is best to use. This requires everyone to walk out of the constraints of these rankings, able to persist in the process they feel is right.

Li Guangmi: Thanks Shunyu. Please Junyang talk about probability and challenges.

Lin Junyang: This question is a dangerous question, theoretically cannot pour cold water on this occasion. If talking about probability, I might want to talk about differences I felt between China and US, for example American Compute might be 1-2 orders of magnitude larger than us overall, but I see whether OpenAI or others, their massive Compute is invested into next generation Research. We today relatively speaking, are stretched, just delivery might have already occupied the vast majority of our Compute, this will be a relatively big difference, this might be a problem existing since history. Is innovation happening in rich people's hands, or poor people's hands, poor people are not without opportunities, we feel these rich brothers really waste cards, maybe trained a lot but useless, but today being poor, for example today so-called algorithm Infra joint optimization things, if you are very rich, there is no motivation to do this thing. Just now Shunyu mentioned lithography machine problem, possibly another point in the future, if from perspective of soft-hard combination, is it really possible to make next generation model and chip, is it possible to make them together? When I was making large models in 2021, because Alibaba makes chips, their people found me, saying can you predict if this model is Transformer architecture three years later, is model multi-modal three years later? Why three years? He said we need three years to tape out. My answer at that time was, three years later, I don't know if I'm still at Alibaba. But today I am still at Alibaba, sure enough still Transformer, sure enough still multi-modal, I am very regretful, why didn't urge him to do it back then. At that time our communication was very chicken talking to duck, he told me a bunch of stuff, I completely didn't understand, I told him, he also didn't know what we were doing, just missed this opportunity. Is it possible for this opportunity to come again? Although we are a group of poor people, but poverty gives rise to change, will innovation opportunity happen here? Today our education is getting better, I belong to earlier 90s, Shunyu belongs to later 90s, our team has many 00s, I feel everyone's adventurous spirit is becoming stronger. Americans naturally have very strong adventurous spirit. A very typical example is when electric cars just came out, even under condition of roof leaking, even driving might cause accidental death, there are still many tycoons willing to do this thing. But in China, I believe tycoons won't do this thing, everyone will do some very safe things. Today everyone's adventurous spirit starts to become better, Chinese business environment is also becoming better, I think it is possible to bring some innovations. Probability not that big, but really possible.

Li Guangmi: If picking a number? Three to five years later, the most leading company in China being a Chinese company probability.

Lin Junyang: I think 20%, 20% is already very optimistic, because there are really many historical accumulation reasons here.

Li Guangmi: I Follow one more question, for example gap between Chinese models and American models, some places catching up, some places their computing power is widening, is your fear of Gap widening strong?

Lin Junyang: Today doing this line you can't fear, must have very strong mentality, for our mentality, being able to do this line is already very good, being able to do large model this thing is already very lucky. Still depends on what your initial intention is. Just now Shunyu mentioned a point, your model doesn't necessarily have to be that strong, it is OK in C end. I might switch to another angle to think about this problem, what value has our model brought to human society? As long as I believe my thing can bring sufficient value to human society, can help humans, even if not the strongest, I am willing to accept.

Li Guangmi: Thanks Junyang. Promoting Professor Yang, because you experienced many AI cycles, also saw many Chinese AI companies become world strongest, what is your judgment on this problem?

Yang Qiang: We can look back at development of Internet. Started from US at first too, but China caught up very quickly, and applications like WeChat, are world number one. I think, AI is a technology, it is not a terminal product, but we in China have a lot of intelligence and wisdom, will play this product to the extreme, whether To B or To C, but I might favor To C more, because hundred flowers blooming, Chinese people pool ideas. To B might have some restrictions, like willingness to pay, corporate culture etc. also changing. I am also observing business direction recently, discussing with some business school students, for example US has a company called Palantir, one of its concepts is, no matter what stage AI develops to now, I can always find some good things in AI to apply to enterprises, there is definitely a gap in middle, we need to bridge it. It has a method called ontology. I observed a bit, roughly idea is transfer learning we did before, applying a general Solution to a specific practice, using ontology for knowledge transfer, this method is very clever. Of course it is solved through an engineering method, called Forward Deployed Engineer (FDE). Anyway, like this is very worth our learning, Chinese enterprises like AI Native companies should develop such To B Solutions, I believe they will. So To C definitely hundred flowers blooming, To B will also catch up very quickly.

Li Guangmi: Thanks Professor Yang. Promoting Professor Tang.

Tang Jie: First indeed have to admit in China and US, whether doing research, especially AI Labs in industry, there is gap with US, this is first. China is now slowly becoming better and better, especially enterprises of 90s, 00s generation, far better than before. Once I said in a meeting our generation is most unlucky, previous generation is also continuing to work, we are also working, so we still don't have day of emerging, unfortunately next generation has come out, world has been handed to next generation, has seamlessly skipped our generation. This is joking. China's chance: First, a group of smart people really dare to do especially risky things, now exist, 00s generation, including 90s generation exist, including Junyang, Kimi, Shunyu are very willing to take risks to do such things. Second, our environment might be better, whether national environment, for example competition between huge enterprises and small enterprises, problems between startup enterprises, including our business environment. Like Junyang said just now, I am still doing delivery, if building this environment better, letting a group of smart people daring to take risks have more time to do innovation, maybe it is things our government, including our country can help improve. Third, back to each of ourselves, is can we persist. Can we be willing to dare to do, dare to take risk on a road, and environment is not bad. Environment definitely won't be best, never think environment is best, we are precisely lucky, experienced era of environment from originally not that good, to slowly becoming better, we are witnesses, maybe people with most wealth, harvest. If we persist stupidly, maybe we are the ones walking to the end. Thank you everyone!

Li Guangmi: Thank Professor Tang. We also want to call for, should invest more resources funds into China's AGI industry, have more computing power, make more AI young researchers rub cards, possibly rub for three five years, China will have several own Ilya Sutskever, this is what we look forward to very much in future three five years.

AGI-Next Summit Transcript: Tang Jie, Yang Zhilin, Lin Junyang, and Yao Shunyu Debate the Future

Table of Contents