苹果的新研究表明，大型语言模型可以通过音频和运动数据来判断你在做什么。

苹果的新研究表明，大型语言模型可以通过音频和运动数据来判断你在做什么。
New Apple Study Shows LLMs Can Tell What You're Doing from Audio and Motion Data

原始链接: https://9to5mac.com/2025/11/21/apple-research-llm-study-audio-motion-activity/

## 苹果探索使用LLM进行活动识别苹果研究人员展示了大型语言模型（LLM）在准确推断用户活动方面的潜力，即使*无需*特定训练。他们的研究“使用LLM进行活动识别的后期多模态传感器融合”利用LLM结合音频描述和运动追踪（通过IMU）的见解——而非原始数据本身——来识别烹饪、运动或看电视等活动。该研究使用Ego4D数据集，表明LLM在零样本（无先例）和单样本（单个示例）分类场景中均取得了显著高于偶然的准确率。这种“后期融合”方法——将专业模型的输出与LLM结合——在训练数据有限时尤其有价值。苹果强调这可以提高活动分析的精确度，尤其是在传感器数据不完整时。值得注意的是，研究人员已公开发布他们的实验数据，以鼓励该领域的进一步研究，从而可能为更细致和上下文感知的健康和活动追踪功能铺平道路。

一项新的苹果研究在Hacker News上被重点介绍，表明大型语言模型（LLM）仅通过音频和运动数据就能推断用户活动。该系统不直接处理原始数据，而是首先将音频和运动转换为自然语言描述，*然后*将其输入LLM进行分析。这引发了关于日益增强的监控能力的讨论，评论员们将其与奥威尔的《1984》相提并论，并回忆起之前对Facebook等应用程序收集数据的担忧。一个关键点是，现在收集的数据，即使当时看起来无害，也可能随着技术的进步而被重新利用——例如，未来量子计算的进步可能会解密这些数据。用户对个人难以保护自己免受这种日益增长的监控，以及数据存储的长期影响表示担忧。这项研究在技术上很有趣，但也引发了隐私问题。

原文

Apple researchers have published a study that looks into how LLMs can analyze audio and motion data to get a better overview of the user’s activities. Here are the details.

They’re good at it, but not in a creepy way

A new paper titled “Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition” offers insight into how Apple may be considering incorporating LLM analysis alongside traditional sensor data to gain a more precise understanding of user activity.

This, they argue, has great potential to make activity analysis more precise, even in situations where there isn’t enough sensor data.

From the researchers:

“Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.”

In other words, LLMs are actually pretty good at inferring what a user is doing from basic audio and motion signals, even when they’re not specifically trained for that. Moreover, when given just a single example, their accuracy improves even further.

One important distinction is that in this study, the LLM wasn’t fed the actual audio recording, but rather, short text descriptions generated by audio models and an IMU-based motion model (which tracks movement through accelerometer and gyroscope data), as shown below:

Diving a bit deeper

In the paper, the researchers explain that they used Ego4D, a massive dataset of media shot in first-person perspective. The data contains thousands of hours of real-world environments and situations, from household tasks to outdoor activities.

From the study:

“We curated a dataset of day-to-day activities from the Ego4D dataset by searching for activities of daily living within the provided narrative descriptions. The curated dataset includes 20 second samples from twelve high-level activities: vacuum cleaning, cooking, doing laundry, eating, playing basketball, playing soccer, playing with pets, reading a book, using a computer, washing dishes, watching TV, workout/weightlifting. These activities were selected to span a range of household and fitness tasks, and based on their prevalence in the larger dataset.”

The researchers ran the audio and motion data through smaller models that generated text captions and class predictions, then fed those outputs into different LLMs (Gemini-2.5-pro and Qwen-32B) to see how well they could identify the activity.

Then, Apple compared the performance of these models in two different situations: one in which they were given the list of the 12 possible activities to choose from (closed-set), and another where they weren’t given any options (open-ended).

For each test, they were given different combinations of audio captions, audio labels, IMU activity prediction data, and extra context, and this is how they did:

In the end, the researchers note that the results of this study offer interesting insights into how combining multiple models can benefit activity and health data, especially in cases where raw sensor data alone is insufficient to provide a clear picture of the user’s activity.

Perhaps more importantly, Apple published supplemental materials alongside the study, including the Ego4D segment IDs, timestamps, prompts, and one-shot examples used in the experiments, to assist researchers interested in reproducing the results.

Accessory deals on Amazon

FTC: We use income earning auto affiliate links. More.