SATO: Stable Text-to-Motion Framework

ACM MULTIMEDIA 2024

Wenshuo Chen*, Hongru Xiao*, Erhang Zhang*, Lijie Hu, Lei Wang, Mengyuan Liu, Chen Chen
Video Abstract Existing Challenges Motivation Our Approach Comparsions to Prior Work

ABSTRACT

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.

Existing Challenges

Text-to-Motion task challenge: Variability of textual inputs

A fundamental challenge inherent in text-to-motion tasks stems from the variability of textual inputs . Even when conveying similar or the same meanings and intentions, texts can exhibit considerable variations in vocabulary and structure due to individual user preferences or linguistic nuances. Despite the considerable advancements made in these models, we find a notable weakness: all of them demonstrate instability in prediction when encountering minor textual perturbations, such as synonym substitutions. We establish a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. The stability of the model manifests in the consistency of textual attention and its ability to handle perturbations in text features, highlighting its pivotal role in mitigating such errors.

Motivation

JSD measures the difference between vectors, the smaller the better.

The intensity of the color represents the magnitude of attention weight.

The model's inconsistent outputs are accompanied by unstable attention patterns. We further elucidate the aforementioned experimental findings: When perturbed text is inputted, the model exhibits unstable attention, often neglecting critical text elements necessary for accurate motion prediction. This instability further complicates the encoding of text into consistent embeddings, leading to a cascade of consecutive temporal motion generation errors.

Our Approach

Attention Stability. For the original text input, we can easily observe the model's attention vector for the text. This attention vector reflects the model's attentional ranking of the text, indicating the importance of each word to the text encoder's prediction. We hope a stable attention vector maintains a consistent ranking even after perturbations.

Prediction Robustness. Even with stable attention, we still cannot achieve stable results due to the change in text embeddings when facing perturbations, even with similar attention vectors. This requires us to impose further restrictions on the model's predictions. Specifically, in the face of perturbations, the model's prediction should remain consistent with the original distribution, meaning the model's output should be robust to perturbations.

Balancing Accuracy and Robustness Trade-off. Accuracy and robustness are naturally in a trade-off relationship. Our objective is to bolster stability while minimizing the decline in model accuracy, thereby mitigating catastrophic errors arising from input perturbations. Consequently, we require a mechanism to uphold the model's performance concerning the original input.

Comparsions to Prior Work

Quantitative evaluation on the HumanML3D and KIT-ML.

Visual Comparison to the State-of-the-art Approaches

(T2M-GPT, MDM, MoMask)

(1) Synonym Replacement: Replacing one or more words or phrases within a sentence.(SATO(T2M-GPT) represents fine-tuning based on T2M-GPT)
Example 1
Original text: a man kicks something or someone with his left leg.
Perturbed text: a human boots something or someone with his left leg.

Explanation: T2M-GPT, MDM, and MoMask all lack boot behavior. This is a catastrophic error.

Example 2
Original text: Walking forward in an even pace.
Perturbed text: Going ahead in an even pace.

Explanation: T2M-GPT lacks the action of moving forward, while MoMask's forward movement is uneven.

Example 3
Original text: A person uses his right arm to help himself to stand up.
Perturbed text: A human utilizes his right arm to help himself to stand up.

Explanation: T2M-GPT, MDM, and MoMask all lack the action of transitioning from squatting to standing up, resulting in a catastrophic error.

Example 4
Original text: Person is walking normally in a circle.
Perturbed text: Human is walking usually in a loop.

Explanation: T2M-GPT, MDM, and MoMask all don't walk in a loop.

(2) Cross dataset evaluation(unseen text data): Train on HumanML3D, Test on KIT-ML.
Example
Original text: A human walks a quarter of a circle to the right.
Perturbed text: A native motions a quarter of a loop to the right.

Explanation: T2M-GPT, MDM, and MoMask all remained stationary without motion a quarter of a loop to the right, resulting in a catastrophic error.

(3) Component Deletion(sentense structure perturbation): Removing a part of the sentence.
Example1
Original text: A person runs and jumps forward.
Perturbed text: Runs and jumps forward.

Explanation: MDM has no action, while T2M-GPT and MoMask lack the run action.

Example2
Original text: A person leaps forward then stands straight.
Perturbed text: Leaps forward then stands straight.

Explanation: T2M-GPT, MDM, and MoMask all took a step forward, without leaping.