In the past three months, I embarked on an ambitious project in Project Zomboid to build an intelligent society and civilization made up of AI agents (more context here). During this endeavor, I began reading all the critical papers in this field. I learned far more than I expected; previous research has revealed a wonderful world and shown me just how beautiful “intelligence” can be.
So, I want to share some of these papers here and list my key takeaways.
This is perhaps one of the most significant papers from the “pre-training era.” It shows that more training compute, larger datasets, and larger parameter counts form a sort of “chemical reaction formula” that gave rise to larger, smarter models and led to the earliest large language models.
Although this is just a Twitter thread rather than a formal paper, it points toward an important direction. The “chemistry” of pre-training scaling can’t go on forever because of data limitations, so researchers are exploring ways to overcome this bottleneck. A series of studies (centered on “o1,” among others) suggests that both train-time compute (e.g., more RL and self-play) and test-time compute (e.g., more reasoning time) can significantly boost performance in the post-training phase. This opens up a promising new phase for post-training strategies.
This part covering the original Transformer and BERT architectures, as well as the GPT series (GPT1 through GPT4), has been one of my favorite readings. These papers highlight how the GPT model evolved and how research directions shifted over time, helping me better understand the fundamental LLM architecture.
I also highly recommend reading about Llama 3.1 and DeepSeek-v3.
1) The Llama 3.1 report details how they parsed their data, their full training process, and their hardware infrastructure—very helpful for practical engineering.
2) DeepSeek-v3 uses an MoE architecture alongside a corresponding reward model, and the actual cost of the full training is astonishing, given the team’s resource constraints.
I put these two papers together because they both essentially used Human Feedback data to train PPO models to fine-tune the benchmark model, which I think had a huge impact on LLM at the time and turned GPT from a small toy into an actual usable productivity tool.
Also this was important for LLM safety, such as getting the model to output safe content, values aligned with humans, etc. (Anthropic even released a constitutional model used to correct other models)
Also they are especially important because today we find these methods useful for training domain-specific models, and if fine-tuned with high-quality INSTRUCTION and DATA, we can find that small 1b/2b models can outperform super-models like llama 405b or even o1 in a very specific domain.
Codex (the paper of GitHub Copilot) exemplifies the power of fine-tuning. By training GPT specifically for code tasks, it achieved code-generation capabilities far beyond those of the original model. This paper shows why fine-tuning is so powerful and why it will likely play an increasingly important role. In real-world domains, users often need specialized capabilities (e.g., coding, finance, biology), rather than a purely general model.
LoRA has become one of the most widely adopted fine-tuning techniques. Its core idea is to freeze most parameters of the base model and introduce low-rank matrices in selected layers to learn new properties for a specific task. This dramatically reduces training and storage overhead.
HydraLoRA, presented at NeurIPS 2024 (oral), extends LoRA with a MoE approach. Instead of training a separate LoRA module for each task, HydraLoRA assigns different task types to different modules. I think this is especially important for areas like robotics or game NPCs, where an MoE approach suggests more generalizable capabilities.
Transformer^2, proposed by Sakana AI Labs, is the most fascinating recent paper I’ve read. While it can split user-input tasks similar to an MoE approach (e.g., handling code and math in different blocks), it differs from LoRA by leveraging SVD (Singular Value Decomposition) and SVF (Singular Value Fine-Tuning).
Its goal is even broader: to adapt to unfamiliar external environments over the model’s entire lifecycle by dynamically updating its own weights, ultimately achieving the ability to learn and evolve. This concept of continuous update could become one of the main fine-tuning techniques for our PZ project.
These three papers focus on prompting techniques. I’ve read many other papers on the topic, but most are less clear than a blog post by lilianweng. If you’re interested, I highly recommend it.
This research is quite interesting. Its memory management system lets the agents record and reflect on their behaviors through a set of long- and short-term memory data.
I believe this paper spurred many subsequent projects on game agents and agent research. Another noteworthy project name project sid uses a Minecraft environment for a similar world simulation, showing agents collaborating to achieve fascinating outcomes.
This overview by lilianweng is my go-to reference whenever someone asks about LLM-based agents. If you’re curious about agent approaches, this is an excellent place to start.
Although asking GPT for o1’s technical secrets triggers a blocking message, everyone is working on finding those secrets. In Scaling of Search and Learning, you can find a summary of how o1 uses RL and MCTS (Monte Carlo Tree Search) methods.
Early in my research, I was inspired by AlphaStar, so I spent more time studying CNN and RL-based algorithms, especially ResNet and PPO.
However, when I moved into LLMs, I decided to abandon the purely PPO-driven approach, switching instead to an LLM-centric method that leverages PPO fine-tuning and self-play for in-game improvements.
Game AI is one of the most fascinating parts of the field imo. As mentioned before, game environments can serve as perfect testing grounds for researchers to freely try out even the craziest ideas.
My main takeaways are as follows:
1) The core techniques involved. Specifically RL algorithms, self-play, embedding tricks, etc. along with experimental setups and benchmarking methods.
2) The history of AI in gaming tracks the overall development of AI: for example, DQN, AlphaGo, and AlphaStar are all major milestones. More researchers focus on open-world game today, many rely on Minecraft, but we chose the PZ environment.
3) Trying something new or setting ambitious goals can lead to extraordinary results. As Ilya mentioned in an interview with Jensen Huang, OpenAI’s challenge in Dota 2 (OpenAI Five) directly influenced the development of RLHF with PPO - a technique originally unrelated to next-word prediction that helped pave the way for a “ChatGPT moment.”
I believe this area will become one of the most critical and intriguing in the future. Chris Olah’s interview drew my attention to it.
(Shout-out to Chris Olah 🫡)
Interpretability is like view LLMs as living organisms, looking at which words their neural networks activate and which they don’t. In one of Olah’s examples, the network’s representation of the word “trump” is particularly notable, possibly because it’s one of the most frequent human name in the dataset.
Reading these articles made me appreciate the beauty of LLMs: just as we seek to uncover the secrets of the human brain, we find ourselves exploring another fascinating, complex “species” by examining these networks closely.
One of the most interesting aspects for my ongoing Project Zomboid research is crafting unique personalities for agents and observing how they evolve over a game session. This can result in fascinating “chemical reactions” as their behaviors and interactions unfold. I’m also curious whether multiple agents might form cultures (e.g., capitalist or communist leanings) or produce figures akin to humanity’s great leaders, following the guidance of a single agent.
However, to study this effectively, we need to revisit the ubiquitous intelligence around us—the human brain. There’s a wealth of insights to be found in that domain. For example, one relevant paper I read addressed the Big Five personality framework, while another examined identical twins raised in different environments to explore whether personality is innate or acquired.
My Paper List from the Last Three Months
In the past three months, I embarked on an ambitious project in Project Zomboid to build an intelligent society and civilization made up of AI agents (more context here). During this endeavor, I began reading all the critical papers in this field. I learned far more than I expected; previous research has revealed a wonderful world and shown me just how beautiful “intelligence” can be.
So, I want to share some of these papers here and list my key takeaways.
1. Scaling Law
This is perhaps one of the most significant papers from the “pre-training era.” It shows that more training compute, larger datasets, and larger parameter counts form a sort of “chemical reaction formula” that gave rise to larger, smarter models and led to the earliest large language models.
Although this is just a Twitter thread rather than a formal paper, it points toward an important direction. The “chemistry” of pre-training scaling can’t go on forever because of data limitations, so researchers are exploring ways to overcome this bottleneck. A series of studies (centered on “o1,” among others) suggests that both train-time compute (e.g., more RL and self-play) and test-time compute (e.g., more reasoning time) can significantly boost performance in the post-training phase. This opens up a promising new phase for post-training strategies.
2. Models
This part covering the original Transformer and BERT architectures, as well as the GPT series (GPT1 through GPT4), has been one of my favorite readings. These papers highlight how the GPT model evolved and how research directions shifted over time, helping me better understand the fundamental LLM architecture.
I also highly recommend reading about Llama 3.1 and DeepSeek-v3.
1) The Llama 3.1 report details how they parsed their data, their full training process, and their hardware infrastructure—very helpful for practical engineering.
2) DeepSeek-v3 uses an MoE architecture alongside a corresponding reward model, and the actual cost of the full training is astonishing, given the team’s resource constraints.
3. Post-Training
I put these two papers together because they both essentially used Human Feedback data to train PPO models to fine-tune the benchmark model, which I think had a huge impact on LLM at the time and turned GPT from a small toy into an actual usable productivity tool.
Also this was important for LLM safety, such as getting the model to output safe content, values aligned with humans, etc. (Anthropic even released a constitutional model used to correct other models)
Also they are especially important because today we find these methods useful for training domain-specific models, and if fine-tuned with high-quality INSTRUCTION and DATA, we can find that small 1b/2b models can outperform super-models like llama 405b or even o1 in a very specific domain.
4. Fine-Tuning
Codex (the paper of GitHub Copilot) exemplifies the power of fine-tuning. By training GPT specifically for code tasks, it achieved code-generation capabilities far beyond those of the original model. This paper shows why fine-tuning is so powerful and why it will likely play an increasingly important role. In real-world domains, users often need specialized capabilities (e.g., coding, finance, biology), rather than a purely general model.
LoRA has become one of the most widely adopted fine-tuning techniques. Its core idea is to freeze most parameters of the base model and introduce low-rank matrices in selected layers to learn new properties for a specific task. This dramatically reduces training and storage overhead.
HydraLoRA, presented at NeurIPS 2024 (oral), extends LoRA with a MoE approach. Instead of training a separate LoRA module for each task, HydraLoRA assigns different task types to different modules. I think this is especially important for areas like robotics or game NPCs, where an MoE approach suggests more generalizable capabilities.
Transformer^2, proposed by Sakana AI Labs, is the most fascinating recent paper I’ve read. While it can split user-input tasks similar to an MoE approach (e.g., handling code and math in different blocks), it differs from LoRA by leveraging SVD (Singular Value Decomposition) and SVF (Singular Value Fine-Tuning).
Its goal is even broader: to adapt to unfamiliar external environments over the model’s entire lifecycle by dynamically updating its own weights, ultimately achieving the ability to learn and evolve. This concept of continuous update could become one of the main fine-tuning techniques for our PZ project.
5. Prompt Engineering
These three papers focus on prompting techniques. I’ve read many other papers on the topic, but most are less clear than a blog post by lilianweng. If you’re interested, I highly recommend it.
6. Agent
This research is quite interesting. Its memory management system lets the agents record and reflect on their behaviors through a set of long- and short-term memory data.
I believe this paper spurred many subsequent projects on game agents and agent research. Another noteworthy project name project sid uses a Minecraft environment for a similar world simulation, showing agents collaborating to achieve fascinating outcomes.
This overview by lilianweng is my go-to reference whenever someone asks about LLM-based agents. If you’re curious about agent approaches, this is an excellent place to start.
7. o1
Although asking GPT for o1’s technical secrets triggers a blocking message, everyone is working on finding those secrets. In Scaling of Search and Learning, you can find a summary of how o1 uses RL and MCTS (Monte Carlo Tree Search) methods.
8. CNN and RL
Early in my research, I was inspired by AlphaStar, so I spent more time studying CNN and RL-based algorithms, especially ResNet and PPO.
However, when I moved into LLMs, I decided to abandon the purely PPO-driven approach, switching instead to an LLM-centric method that leverages PPO fine-tuning and self-play for in-game improvements.
9. AGame Survey
Game AI is one of the most fascinating parts of the field imo. As mentioned before, game environments can serve as perfect testing grounds for researchers to freely try out even the craziest ideas.
My main takeaways are as follows:
1) The core techniques involved. Specifically RL algorithms, self-play, embedding tricks, etc. along with experimental setups and benchmarking methods.
2) The history of AI in gaming tracks the overall development of AI: for example, DQN, AlphaGo, and AlphaStar are all major milestones. More researchers focus on open-world game today, many rely on Minecraft, but we chose the PZ environment.
3) Trying something new or setting ambitious goals can lead to extraordinary results. As Ilya mentioned in an interview with Jensen Huang, OpenAI’s challenge in Dota 2 (OpenAI Five) directly influenced the development of RLHF with PPO - a technique originally unrelated to next-word prediction that helped pave the way for a “ChatGPT moment.”
10. LLM Interpretability
I believe this area will become one of the most critical and intriguing in the future. Chris Olah’s interview drew my attention to it.
(Shout-out to Chris Olah 🫡)
Interpretability is like view LLMs as living organisms, looking at which words their neural networks activate and which they don’t. In one of Olah’s examples, the network’s representation of the word “trump” is particularly notable, possibly because it’s one of the most frequent human name in the dataset.
Reading these articles made me appreciate the beauty of LLMs: just as we seek to uncover the secrets of the human brain, we find ourselves exploring another fascinating, complex “species” by examining these networks closely.
11. Human Personality
One of the most interesting aspects for my ongoing Project Zomboid research is crafting unique personalities for agents and observing how they evolve over a game session. This can result in fascinating “chemical reactions” as their behaviors and interactions unfold. I’m also curious whether multiple agents might form cultures (e.g., capitalist or communist leanings) or produce figures akin to humanity’s great leaders, following the guidance of a single agent.
However, to study this effectively, we need to revisit the ubiquitous intelligence around us—the human brain. There’s a wealth of insights to be found in that domain. For example, one relevant paper I read addressed the Big Five personality framework, while another examined identical twins raised in different environments to explore whether personality is innate or acquired.