OpenAI o1 Replication Progress: DeepSeek-R1

DeepSeek AI recently released DeepSeek-R1 (DeepSeek-AI, 2025), whose reasoning performance on multiple benchmarks approaches the level of OpenAI’s o1 (OpenAI, 2024), marking a significant step for the open-source community in successfully replicating o1. Relevant code for R1 can be found in the huggingface’s attempt to open-source replication project open-r1. While previous research has often relied on massive amounts of supervised data to enhance the performance of Large Language Models (LLMs), the success of DeepSeek-R1 and its earlier experiment, DeepSeek-R1-Zero, powerfully demonstrates the potential of purely large-scale reinforcement learning in improving the reasoning capabilities of LLMs. This success reinforces the profound insight proposed by Richard Sutton in “The Bitter Lesson”: ...

2025-01-27 · 48 min · 10156 words · Yue Shui

Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA

Background The Transformer (Vaswani et al., 2017) is a model based on the encoder-decoder architecture. This model has demonstrated outstanding performance in the field of natural language processing (NLP), leading to a series of optimized models based on it, such as BERT (Devlin et al., 2018) which uses only the encoder, GPT (Radford et al., 2018) series which uses only the decoder, and subsequent large language models (LLMs) like LLaMA (Touvron et al., 2023) and GPT-4 (OpenAI et al., 2024), most of which adopt a decoder-only architecture. ...

2025-01-16 · 29 min · 6139 words · Yue Shui

Building Domain-Specific LLMs

Background With the widespread application of Large Language Models (LLMs) across various industries, enterprises and research teams face an urgent need to adapt general-purpose models to specific domains. Foundational LLMs often fail to meet deep domain-specific requirements when handling specialized tasks. For example, in the application of closed-source programming languages, existing open-source models lack sufficient understanding of their syntax and semantics, leading to poor performance in tasks such as code generation and error correction. Therefore, injecting domain knowledge and training dedicated LLMs has become a key step in enhancing development efficiency and code quality. ...

2025-01-05 · 21 min · 4340 words · Yue Shui

Building a Home Deep Learning Rig with Dual RTX 4090 GPUs

Rent a GPU or Buy Your Own? Before setting up a deep learning environment, consider usage duration, budget, data privacy, and maintenance overhead. If you have long-term needs (e.g., over a year) and require strict data security, building your own GPU server often provides lower overall costs and a more controllable environment. On the other hand, for short-term projects or when data privacy is not critical, renting cloud GPUs (e.g., Azure, AWS, GCP) or using free platforms (Colab, Kaggle) offers greater flexibility. ...

2024-12-21 · 10 min · 1988 words · Yue Shui

Stock Price Prediction and Quantitative Strategy Based on Deep Learning

Abstract The stock market is a crucial component of the financial market. In recent years, with its vigorous development, research on stock price prediction and quantitative investment strategies has attracted scholars from various fields. With the advancement of Artificial Intelligence (AI) and Machine Learning (ML) in recent years, researchers have shifted from traditional statistical models to AI algorithms. Particularly after the deep learning boom, neural networks have achieved remarkable results in stock price prediction and quantitative investment strategy research. The objective of deep learning is to learn multi-level features, constructing abstract high-level features by combining low-level ones, thereby mining the distributed feature representations of data. This approach enables complex nonlinear modeling to accomplish prediction tasks. Recurrent Neural Networks (RNNs) have been widely applied to sequential data, such as natural language and speech. Daily stock prices and trading information are sequential data, leading many researchers to use RNNs for stock price prediction. However, basic RNNs suffer from gradient vanishing issues when the number of layers is excessive. The advent of Long Short-Term Memory (LSTM) networks addressed this problem, followed by variants such as Gated Recurrent Units (GRUs), Peephole LSTMs, and Bidirectional LSTMs (BiLSTMs). Traditional stock prediction models often overlook temporal factors or only consider unidirectional temporal relationships. Therefore, this paper employs the BiLSTM model for stock price prediction. From a model principle perspective, the BiLSTM model fully leverages the contextual relationships in both forward and backward temporal directions of time series data. It also avoids gradient vanishing and explosion problems in long sequences, enabling better learning of information with long-term dependencies. ...

2021-04-21 · 65 min · 13702 words · Yue Shui