QwQ-32B：拥抱强化学习的力量

QwQ-32B：拥抱强化学习的力量
QwQ-32B: Embracing the Power of Reinforcement Learning

原始链接: https://qwenlm.github.io/blog/qwq-32b/

QwQ-32B是一个拥有320亿参数的大型语言模型，它展示了强化学习（RL）在增强大型语言模型方面的强大能力，其性能可与更大的DeepSeek-R1相媲美。通过利用强大的预训练基础模型和基于结果的强化学习，QwQ-32B在数学推理、代码能力和通用问题解决方面表现出色。强化学习过程分两个阶段进行：第一阶段侧重于数学和代码，使用准确性验证器和代码执行服务器；第二阶段则扩展到通用能力，使用奖励模型和基于规则的验证器。这种方法改进了指令遵循、与人类偏好的对齐以及智能体的性能，同时没有牺牲数学和代码能力。QwQ-32B集成了智能体能力，能够进行批判性思维、工具利用和基于环境反馈的自适应推理。QwQ-32B已在Apache 2.0许可下开源，可通过Hugging Face、ModelScope和Qwen Chat访问，展示了强化学习在推进人工智能通用化（AGI）方面的潜力。未来的工作将集中在更强大的基础模型、由规模化计算驱动的强化学习以及集成智能体以实现长程推理。

这个Hacker News帖子讨论了QwQ-32B，这是一个基于强化学习的语言模型，具有13万token的大上下文窗口。用户对其推理能力印象深刻，尤其是在数学和编码任务方面，并认为其潜力可以与DeepSeek R1等更大的模型相媲美。然而，一些用户报告说它存在过度“思考”导致性能缓慢以及在长链推理中出现“灾难性遗忘”的问题。讨论还涵盖了使用Ollama在本地运行QwQ-32B的实际方面。用户指出Ollama的默认上下文长度（2048个token）具有误导性，需要手动调整。他们还讨论了最佳参数设置和量化级别。一些用户批评Ollama的极简实现方法。总的来说，该帖子将QwQ-32B 突出为一个很有前景的模型，但也指出了其本地部署中的挑战以及对更好的AI模型测试和评估基础设施的需求。该帖子还涉及更广泛的主题，例如人工智能领域的国际竞争、关税的影响以及未来计算资源的走向。

（评论） 2024-07-17

（评论） 2024-05-07

米斯特拉尔尼莫 2024-07-19

DBRX：新的开放式法学硕士 2024-03-28

原文

QWEN CHAT Hugging Face ModelScope DEMO DISCORD

Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods. Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models. For instance, DeepSeek R1 has achieved state-of-the-art performance by integrating cold-start data and multi-stage training, enabling deep thinking and complex reasoning.

Our research explores the scalability of Reinforcement Learning (RL) and its impact on enhancing the intelligence of large language models. We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge. Furthermore, we have integrated agent-related capabilities into the reasoning model, enabling it to think critically while utilizing tools and adapting its reasoning based on environmental feedback. These advancements not only demonstrate the transformative potential of RL but also pave the way for further innovations in the pursuit of artificial general intelligence.

QwQ-32B is open-weight in Hugging Face and ModelScope under the Apache 2.0 license and is accessible via Qwen Chat.

Performance

QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

Reinforcement Learning

We began with a cold-start checkpoint and implemented a reinforcement learning (RL) scaling approach driven by outcome-based rewards. In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases. As training episodes progress, performance in both domains shows continuous improvement. After the first stage, we add another stage of RL for general capabilities. It is trained with rewards from general reward model and some rule-based verifiers. We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding.

Use QwQ-32B

Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

from openai import OpenAI
import os

# Initialize OpenAI client
client = OpenAI(
    # If the environment variable is not configured, replace with your API Key: api_key="sk-xxx"
    # How to get an API Key：https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""
content = ""

is_answering = False

completion = client.chat.completions.create(
    model="qwq-32b",
    messages=[
        {"role": "user", "content": "Which is larger, 9.9 or 9.11?"}
    ],
    stream=True,
    # Uncomment the following line to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + "reasoning content" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print reasoning content
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "content" + "=" * 20 + "\n")
                is_answering = True
            # Print content
            print(delta.content, end='', flush=True)
            content += delta.content

Future Work

This marks Qwen’s initial step in scaling Reinforcement Learning (RL) to enhance reasoning capabilities. Through this journey, we have not only witnessed the immense potential of scaled RL but also recognized the untapped possibilities within pretrained language models. As we work towards developing the next generation of Qwen, we are confident that combining stronger foundation models with RL powered by scaled computational resources will propel us closer to achieving Artificial General Intelligence (AGI). Additionally, we are actively exploring the integration of agents with RL to enable long-horizon reasoning, aiming to unlock greater intelligence with inference time scaling.