LogAction: Consistent Cross-system Anomaly Detection through Logs via Active Domain Adaptation

Chiming Duan¹, Minghua He¹, Pei Xiao¹, Tong Jia¹, Xin Zhang², Zhewei Zhong², Xiang Luo², Yan Niu², Lingzhe Zhang¹, Yifan Wu¹, Siyu Yu¹, Weijie Hong¹, Ying Li¹ and Gang Huang¹

¹Peking University, ²Bytedance

Summary

Issues: Cold start problems and High labeling efforts in log-based anomaly detection
On one hand, during the initial deployment phase of software systems, it is impossible to accumulate sufficient logs to train an effective anomaly detection model. On the other hand, the massive and complex log data is difficult to accurately label, requiring substantial labeling costs.

Key technologies: Incorporate Active learning and Transfer learning
Transfer learning is employed to address the cold-start problem, LogAction training models on mature systems and transferring them to new systems with insufficient data accumulation. Meanwhile, active learning is utilized to reduce the amount of labels required during model transfer process, thereby lowering manual labeling costs.

Our propose method: LogAction
As a result, we pose an idea that transfer learning and active learning should be combined together to solve the label lacking problem. We define this scenario as consistent cross-system anomaly detection (CCAD), that is, leveraging the features extracted from abundant historical labeled logs of mature systems (source systems) to build anomaly detection models for new systems (target systems) and consistently optimize the models with online human labels on the target systems. In this paper, we focus on the CCAD scenario and aim to build a high performance anomaly detection model without or with very few anomalous labels. We propose LogAction, a consistent cross-system anomaly detection model via active domain adaptation.

Key Contributions

We propose LogAction, a novel generalizable anomaly detection approach via active domain adaptation.
We utilize contrastive learning techniques to mitigate the data distribution gap, while employing active learning to reduce human labeling efforts, effectively addressing the two primary challenges in CCAD task.
Evaluation results on three open datasets show the significant effectiveness of our approach.

Insight

Two log sequences from different systems (BGL and ThunderBird) exhibit distinct formats but convey the same error: "file or directory does not exist." In other words, they share the same semantics that can be transferred from one system to another.

Main Ideas

LogAction includes three main phases: Log Parser, Encoding and Active Domain Adaptation.

Log Parser: In the Log Parser phase, we employ the log parsing method Drain to obtain templates of log events. Subsequently, we utilize the pre-trained BART model to extract the semantic information from these log event templates, transforming them into word vectors.
Encoding: After log parsing, diverse formats of raw logs are parsed into log sequences within the same feature space, yet they still exhibit different distributions. We utilize contrastive learning to map the log sequences from the source and target systems into log vectors, aligning their distributions to further mitigate the data distribution gaps.
Active Domain Adaptation: After encoding, we train a base anomaly detection model (Classifier(Source)) using source system log vectors. To optimize Classifier(Source) with target system data while reducing labeling efforts, we apply active learning to select high-information log vector subsets. Valuable vectors are those unseen by the model, differing from both source-labeled and previously active-learned vectors. We use free energy-based and uncertainty-based sampling strategies to achieve this.

Experiment Results

Main Results

LogAction outperforms state-of-the-art methods and show effective performance in CCAD task.

Ablations

All components play important roles to improve performance of LogAction.

Parameter Sensitivity Analysis

LogAction demonstrates relatively stable performance when the param fluctuates.

Visualization

We use t-SNE to visualize data distributions of logs before and after encoding. As shown in figure, blue/red represent normal/anomalous logs, and triangles/circles denote source/target systems. Before encoding, source and target logs (normal/anomalous) form separate clusters, showing distinct distributions. After encoding, clusters of both systems' normal and anomalous logs converge, indicating reduced distribution gaps and improved alignment.