Automatic recognition of depression based on audio and video: A review

doi:10.5498/wjp.v14.i2.225

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 14, Issue 2

This Article

Academic Content and Language Evaluation of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Number of Hits and Downloads for This Article

Total Article Views (6790)

All Articles published online

The chart showing PDF series, WORD series, HTML series, Tables (1-1) series.

Item

Count

PDF

144

WORD

HTML

3847

Tables (1-1)

512

Sum=4531

Featured Article

The chart showing Browse series, Download series.

Item

Count

Browse

354

Download

917

Sum=1271

Publishing Process of This Article

Item

Count

Browse

230

Download

556

Sum=786

Feb 19, 2024 (publication date) through Jul 6, 2025

Times Cited of This Article

Times Cited (3)

Journal Information of This Article

Publication Name

World Journal of Psychiatry

ISSN

2220-3206

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Minireviews Open Access

World J Psychiatry. Feb 19, 2024; 14(2): 225-233
Published online Feb 19, 2024. doi: 10.5498/wjp.v14.i2.225

Automatic recognition of depression based on audio and video: A review

Meng-Meng Han, Xing-Yun Li, Xin-Yu Yi, Yun-Shao Zheng, Wei-Li Xia, Ya-Fei Liu, Qing-Xiang Wang

Meng-Meng Han, Wei-Li Xia, Ya-Fei Liu, Qing-Xiang Wang, Shandong Mental Health Center, Shandong University, Jinan 250014, Shandong Province, China

Meng-Meng Han, Xing-Yun Li, Xin-Yu Yi, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, Shandong Province, China

Xing-Yun Li, Xin-Yu Yi, Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, Shandong Province, China

Xing-Yun Li, Xin-Yu Yi, Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan 250353, Shandong Province, China

Yun-Shao Zheng, Department of Ward Two, Shandong Mental Health Center, Shandong University, Jinan 250014, Shandong Province, China

ORCID number: Meng-Meng Han (0009-0003-1259-5048); Xing-Yun Li (0000-0002-5125-281X); Xin-Yu Yi (0009-0004-8734-075X); Qing-Xiang Wang (0000-0002-8159-7739).

Author contributions: Han MM, Li XY, Yi XY, Zheng YS and Wang QX designed the research study; Xia WL and Liu YF conducted literature retrieval; Han MM, Li XY, Yi XY, Zheng YS, and Wang QX summarized and analyzed relevant literature; Zheng YS provided medical knowledge; Han MM, Li XY, Yi XY, and Wang QX were responsible for writing and revising the manuscript; Wang QX reviewed the manuscript and approved its publication. All authors have read and approve the final manuscript.

Supported by Shandong Province Key R and D Program, No. 2021SFGC0504; Shandong Provincial Natural Science Foundation, No. ZR2021MF079; and Science and Technology Development Plan of Jinan (Clinical Medicine Science and Technology Innovation Plan), No. 202225054.

Conflict-of-interest statement: There is no conflict of interest associated with any of the senior author or other coauthors contributed their efforts in this manuscript.

Open-Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/

Corresponding author: Qing-Xiang Wang, PhD, Associate Professor, Shandong Mental Health Center, Shandong University, No. 49 Wenhua East Road, Jinan 250014, Shandong Province, China. wangqx@qlu.edu.cn

Received: November 25, 2023
Peer-review started: November 25, 2023
First decision: December 6, 2023
Revised: December 18, 2023
Accepted: January 24, 2024
Article in press: January 24, 2024
Published online: February 19, 2024
Processing time: 72 Days and 15.2 Hours

Abstract

Depression is a common mental health disorder. With current depression detection methods, specialized physicians often engage in conversations and physiological examinations based on standardized scales as auxiliary measures for depression assessment. Non-biological markers-typically classified as verbal or non-verbal and deemed crucial evaluation criteria for depression-have not been effectively utilized. Specialized physicians usually require extensive training and experience to capture changes in these features. Advancements in deep learning technology have provided technical support for capturing non-biological markers. Several researchers have proposed automatic depression estimation (ADE) systems based on sounds and videos to assist physicians in capturing these features and conducting depression screening. This article summarizes commonly used public datasets and recent research on audio- and video-based ADE based on three perspectives: Datasets, deficiencies in existing research, and future development directions.

Key Words: Depression recognition; Deep learning; Automatic depression estimation System; Audio processing; Image processing; Feature fusion; Future development

Core Tip: The automatic recognition of depression based on deep learning has gradually become a research hotspot. Researchers have proposed automatic depression estimation (ADE) systems utilizing sound and video data to assist physicians in screening for depression. This article provides an overview of the latest research on ADE systems, focusing on sound and video datasets, current research challenges, and future directions.

Citation: Han MM, Li XY, Yi XY, Zheng YS, Xia WL, Liu YF, Wang QX. Automatic recognition of depression based on audio and video: A review. World J Psychiatry 2024; 14(2): 225-233
URL: https://www.wjgnet.com/2220-3206/full/v14/i2/225.htm
DOI: https://dx.doi.org/10.5498/wjp.v14.i2.225

INTRODUCTION

With societal developments, the diagnosis and treatment of depression have become increasingly crucial. Depression is a prevalent psychological disorder characterized by symptoms such as low mood, diminished appetite, and insomnia in affected individuals[1]. Patients with severe depression may also exhibit a tendency towards suicide. In the field of medicine, researchers aspire to conduct comprehensive investigations of depression from both biological and non-biological perspectives. Li et al[2] summarized biological markers, revealing associations between depression and indicators, such as gamma-glutamyl transferase, glucose, triglycerides, albumin, and total bilirubin. Non-biological markers can be broadly categorized into verbal and non-verbal features. Verbal features typically pertain to a subject’s intonation, speech rate, and emotional expressions in speech extracted from audio recordings. Early studies by Cannizzaro et al[3] and Leff et al[4] identified differences in the speech of individuals with psychiatric disorders compared to the general population. Non-verbal features typically refer to the facial expressions and body movements commonly embedded in video files. The Facial Action Coding System[5], a frequently employed tool for facial expression analysis, decomposes facial muscles into multiple action units (AUs) with corresponding numerical identifiers. For instance, AU1 and AU2 represent inner brow raise and outer brow raise, respectively. A graphical representation of the AU can be accessed through the link indicated in the footnote (https://imotions.com/blog/learning/research-fundamentals/facial-action-coding-system/)[1]. While Girard et al[6] found differences in AU 10, 12, 14, and 15 between individuals with depression and the general population, a unified research framework for bodily changes is yet to be established, with the core challenge lying in quantifying alterations in body movements. Joshi et al[7] demonstrated the potential of studying body movements for ADE using a method based on space-time interest points and a bag of words to analyze patients’upper-body movements.

During clinical assessments, specialized physicians detect and treat depression based on diagnostic criteria manuals issued by the relevant organizations. For instance, the World Health Organization released the 11^th revision of the International Classification of Diseases in 2022, providing detailed classifications of various mental disorders. The American Psychiatric Association published the Diagnostic and Statistical Manual of Mental Disorders (DSM)-4[8], in 1994, and its updated version, DSM-5[9], in 2013. In 2001, China released the Chinese Classification and Diagnostic Criteria of Mental Disorders, Third Edition. Guided by diagnostic manuals, specialized physicians assessed the severity of depression in the participants based on the scores obtained from these scales. Rating scales are typically categorized into self-report and observer-report scales. The patient health questionnaire[10] is a lightweight self-report scale, whereas the Hamilton depression rating scale (HAMD)[11] is a common observer-report scale. Observer-report scales require specialized physicians to interview patients and score the details based on the scale. Completing an interview based on the HAMD scale typically takes 15-20 min.

In addition to detecting clinical depression based on rating scales, biological markers have been employed to assist with the assessment. Physicians use biochemical indicators extracted through techniques, such as blood tests, to aid their judgment. With the advancements in detection technologies, biological markers can be quantitatively measured, allowing specialized physicians to directly refer to numerical values to determine the clinical significance of a test. However, non-biological markers, which are crucial features of depression, have not been extensively utilized, attributed to several factors. First, changes in non-biological markers, such as facial expressions and intonation, are often subtle. Specialized physicians require extensive training and accumulated experience to capture these changes; such training is typically time-consuming and inefficient. Second, unlike biological markers, systematic patterns of change in non-biological markers depend on their ability to capture spatial and temporal information, a challenging task for early computer technologies. The development of deep-learning technology and the computational capabilities of computers provide an opportunity to address these challenges. Deep-learning, with its robust capability of capturing temporal and spatial information, offers new avenues for constructing assistive systems. Automatic depression estimation (ADE) has become a significant research direction in the field of computational medicine, resulting in several ADE methods being proposed.

A complete ADE study typically comprises three steps. The first step involves data collection, categorized based on the free or need for specific emotional stimulus experiments. The former typically utilizes devices such as cameras and microphones to capture audio-visual information of subjects during medical consultations or in natural states. The latter requires the design of specific emotional paradigms, followed by recording subjects’ audio-visual information under emotional stimuli. The second step involves constructing deep-learning models for ADE. In this phase, researchers designed different deep-learning architectures based on data characteristics to capture information for ADE. Finally, the model undergoes training and testing for ADE to essentially perform two tasks: classification, i.e., distinguishing whether the individual is a patient with depression or further categorizing the severity (non-depressed, mild, moderate, and severe), and scoring tasks, i.e., predicting the assessment scale scores of the subjects. Depending on the task, researchers choose different evaluation metrics to train and test the effectiveness of the model.

The initial ADE typically requires manual feature extraction and the application of machine learning methods such as decision trees and support vector machines for feature classification. Peng et al[12] initially constructed a sentiment lexicon, counted word frequencies, and then input these features into a support vector machine for ADE. Alghowinem et al[13] first used the openSMILE tool to extract audio features and then employed machine learning methods for ADE. Wen et al[14] extracted dynamic feature descriptors from facial region sub-volumes and used sparse coding to implicitly organize the extracted feature descriptors for depression diagnosis. With the development of deep learning and computational capabilities, deep models can perform feature extraction from complex data, eliminating manual feature extraction. Notably, owing to the specificity of audio information, certain manual feature extraction steps still exist. Therefore, a series of deep learning-based ADE methods, such as, have been proposed. In this review, we focus primarily on recent ADE methods based on deep learning approaches. We first introduce commonly used publicly available ADE datasets and then provide an overview and summary of recent outstanding audio-visual ADE models. All articles are summarized in Table 1. Finally, we summarize the existing challenges and future directions of ADE.

Table 1 Summary of advanced automatic depression estimation methods.

Method	Year	Framework	Dataset	Modal	Evaluation Criterion
Method	Year	Framework	Dataset	Modal	MAE	RMSE	Accuracy	F1-score
He and Cao[18]	2018	2DCNN	AVEC2013	A	8.201	10.001	-	-
He and Cao[18]	2018	2DCNN	AVEC2014	A	8.191	9.999	-	-
SIDD	2023	-	DAIC-WOZ	A	-	-	-	0.601
MSCDR	2022	1DCNN/ RNN	DAIC-WOZ	A	-	-	0.771	0.746
DALF	2023	2DCNN	DAIC-WOZ	A	-	-	-	0.784
STFN	2023	1DCNN	DAIC-WOZ	A	5.38	6.36	0.780	-
Speech Former++	2023	Transformer	DAIC-WOZ	A	-	-	0.733	-
Mao et al[24]	2022	CNN RNN	DAIC-WOZ (5)	A	-	-	-	0.958
LGA-CNN	2020	2DCNN	AVEC2013	V-F	6.59	8.39	-	-
LGA-CNN	2020	2DCNN	AVEC2014	V-F	6.51	8.30	-	-
SAN	2022	2DCNN	AVEC2013	V-F	7.02	9.37	-	-
SAN	2022	2DCNN	AVEC2014	V-F	6.59	9.24	-	-
Zhao et al[27]	2023	2DCNN	AVEC2013	V-F	5.97	7.36
Zhao et al[27]	2023	2DCNN	AVEC2014	V-F	5.85	7.23
PRA-Net	2023	2DCNN	AVEC2013	V-F	6.08	7.59	-	-
PRA-Net	2023	2DCNN	AVEC2014	V-F	6.04	7.98	-	-
Yuan and Wang[32]	2019	MLP	Private	V-E	-	-	0.831	-
EnSA	2022	Transformer	Private	V-E	-	-	0.955	-
SATCN	2022	1DCNN	Private	V-B	-	-	0.758	-
Zhao and Wang[35]	2022	Transformer	Private	V-B	-	-	0.729	-
ULCDL	2023	RNN	DAIC-WOZ	A + V	-	-	0.830	0.900
Niu et al[37]	2020	2D/3DCNN	AVEC2013	A + V	6.14	8.16	-	-
Niu et al[37]	2020	2D/3DCNN	AVEC2014	A + V	5.21	7.03	-	-
Shao et al[38]	2021	RNN/CNN	Private	V + V	-	-	0.854	-
TAMFN	2022	2DCNN	D-Vlog	A + V	-	-	-	0.750
Uddin et al[41]	2022	2DCNN	AVEC2013	A + V	5.38	6.83	-	-
Uddin et al[41]	2022	2DCNN	AVEC2014	A + V	5.03	6.16	-	-

V: Video data; A: Audio data; F: Facial information; B: Body information; MAE: Mean absolute error; RMSE: Root mean squared error; MSCDR: Machine speech chain model for depression recognition; SIDD: Speaker-invariant depression detector; DALF: Depression classification; STFN: Spatial-temporal feature network; LGA-CNN: Local global attention convolutional neural network; SAN: Self-adaptation network; PRA-Net: Part-and-Relation Attention Network; SATCN: Spatial attention-dilated temporal convolutional network; ULCDL: Uncertainty-aware label contrastive and distribution learning; TAMFN: Tme-aware attention-based multi-modal fusion depression detection network; DAIC-WOZ: Distress Analysis Interview Corpus/Wizard-of-Oz set; AVEC: Audio-visual emotion recognition challenges; DCNN: Dilated Convolutional Neural Network; CNN: Convolutional neural network; RNN: Recurrent neural networks.

DATASETS

While data form the foundation of ADE research, owing to the inherent challenges in collecting depression data, such as strong privacy concerns, lengthy collection periods, and limited data volumes, obtaining subject authorization for public sharing is difficult, resulting in a scarcity of publicly available audio-visual datasets. Commonly utilized public datasets primarily originate from audio-visual emotion recognition challenges (AVEC), specifically the AVEC2013[15], AVEC2014[16], and Distress Analysis Interview Corpus/Wizard-of-Oz set (DAIC-WOZ)[17] datasets.

AVEC2013

The AVEC2013 dataset was released as part of the third AVEC Challenge. This dataset comprises 340 video segments collected from 292 participants. AVEC2013 required participants to perform tasks such as vowel phonation, reading, recounting memories, and narrating a story based on a picture with their audio-visual information recorded. The Beck Depression Inventory (BDI) scores served as labels for AVEC2013.

AVEC2014

The AVEC2014 dataset was released as part of the fourth AVEC. This dataset comprises 150 audio-video data segments involving a total of 84 subjects. As a subset of AVEC2013, AVEC2014 required each participant to complete two tasks, Northwind and Freeform, which involved reading excerpts from articles and answering specific questions. Similar to AVEC2013, AVEC2014 also utilizes BDI scores as data labels.

DAIC-WOZ

This dataset encompasses the audio-visual information of subjects collected through various interview formats, with each data type being independent. The video information, which included a maximum of 263 audio-visual data points, was based on facial features (e.g., annotated directions, facial key points, and AUs features) after conversion.

AUDIO-BASED DEPRESSION ESTIMATION

Audio-based methods are crucial for ADE. In this process, participants often combine manual features with deep features for ADE. Manual features typically include time- and frequency-domains. Deep features are typically obtained from spectrograms using deep-learning models. These spectrograms often represent the waveform, spectrogram, Mel spectrogram, or processed data of raw audio graphically.

He and Cao[18] combined manually extracted audio features with deep-learning features for ADE. They divided the model into two parts. The first part employed a deep network to extract deep features from spectrograms and raw speech waveforms. The other part extracts median robust extended local binary patterns from spectrograms and low-level descriptors from raw speech. Finally, these features were fused using a fusion model to make the final decision. This approach achieved root mean squared error (RMSE) and mean absolute error (MAE) values of 10.001 and 8.201 on the AVEC2013 dataset and 9.999 and 8.191 on the AVEC2014 dataset. Zuo and Mak[19] recognized the potential performance decline associated with limited audio data. With a smaller dataset, capturing the patterns of depressive expressions becomes challenging, and deep models tend to learn audio features specific to individual subjects, leading to overfitting. To address this issue, they proposed a speaker-invariant depression detector, which achieved an F1 score of 0.601 on the DAIC-WOZ dataset. Du et al[20] incorporated patients' vocal tract changes into conventional speech perceptual features and developed a machine speech chain model for depression recognition (MSCDR) for ADE. The MSCDR extracts speech features from both generation and perception aspects and uses recurrent neural networks (RNN) to extract time-domain features for depression detection. The MSCDR achieved accuracy and F1 scores of 0.771 and 0.746, respectively, on the DAIC-WOZ dataset.

Yang et al[21] observed that many ADE models based on manually designed features lack good interpretability, with features not fully utilized. Therefore, the depression classification (DALF) was proposed. Learnable filters in DALF can more effectively decompose audio signals and retain effective features. Analyzing the automatically learned filters allows for a deeper understanding of the focus areas of the model. This method achieved an F1 score of 0.784 on the DAIC-WOZ dataset. Han et al[22] introduced a spatial-temporal feature network (STFN) to capture audio features. The STFN initially captured the deep features of audio information and then used a novel mechanism called hierarchical contrastive predictive coding loss, replacing the commonly used RNN to capture temporal information. This approach reduces the parameter count of the model, making it more trainable. As such, the STFN achieved accuracy, RMSE, and MAE values of 0.780, 6.36, and 5.38, respectively, on the DAIC-WOZ dataset. Chen et al[23] focused on integrating the Transformer architecture with audio features. Their proposed model, SpeechFormer++, utilized prior knowledge to guide feature extraction, achieving an accuracy of 0.733% on the DAIC-WOZ dataset. Mao et al[24] recognized that text features in audio are also important for capturing the patterns of depressive expressions. Consequently, they proposed an attention-based fused representation of text and speech features. This approach initially inputs text information and low-level features of raw speech into an encoder for encoding and subsequently employs the encoded features for depression detection, achieving an F1 score of 0.958 in a five-class classification task using the DAIC-WOZ dataset.

Overall, the design of ADE models based on audio relies on the initial feature selection. Because audio information cannot be utilized directly by deep models, it is typically transformed before being extracted by deep models. These transformations are diverse, including directly using the raw waveform, applying Fourier transform or Fast Fourier Transform to transform the time-frequency domain information, converting audio into Mel spectrograms, and directly extracting audio features such as frame intensity, frame energy, and fundamental frequency. Diverse feature selection methods provide various possibilities for ADE, leading to discussions regarding which audio representation is more beneficial for ADE. The construction of the model must be aligned with the selected features for an effective feature extraction. Given that depression datasets are often small, methods to limit the learning of individual features by the model, as demonstrated by Zuo and Mak[19], should be carefully considered.

VIDEO-BASED DEPRESSION ESTIMATION

Video information often preserves changes in participants' facial expressions during exposure to stimulus paradigms. Facial expressions include both facial and bodily expressions. In medical research, video-based ADE models typically incorporate various attention mechanisms to enhance local facial features. He et al[25] proposed an ADE framework called the deep local global attention convolutional neural network (DLGA-CNN). In the DLGA-CNN, multiple attention mechanisms are introduced and utilized for extracting multiscale local and global features. Finally, these multiscale features are fused and employed for depression detection. The DLGA-CNN achieved RMSE and MAE values of 8.39 and 6.59 on AVEC2013 and 8.30 and 6.51 on AVEC2014. He et al[26] also recognized the presence of annotation noise in depression datasets, which could negatively affect feature extraction and result in suboptimal ADE performance. Therefore, they proposed a self-adaptation network (SAN) to relabel erroneous annotations in the datasets. SAN achieved RMSE and MAE values of 9.37 and 7.02 on AVEC2013 and 9.24 and 6.95 on AVEC2014.

Zhao et al[27] acknowledged the significance of local and global information and proposed an ADE architecture based on facial images. To enhance the quality of facial images, the architecture initially utilizes the Gamma Correction[28] and DeblurGAN-v2[29] algorithms to balance brightness and contrast and improve image clarity. The architecture employs ConvFFN[30] as the main framework and designs the Hi-Lo attention module to enhance the features in different facial regions. Ultimately, this method achieved RMSE and MAE values of 7.36 and 5.97 on the AVEC2013 dataset and 7.23 and 5.85 on the AVEC2014 dataset. Liu et al[31] introduced another approach, Part-and-Relation Attention Network (PRA-Net), for feature extraction from facial regions for ADE. PRA-Net initially segments the extracted facial feature maps by region; these segmented regions are fed into a self-attention mechanism to capture interregional correlations. The classifier merges the regional feature maps with weights for the final decision. PRA-Net achieved RMSE and MAE values of 7.59 and 6.08 on the AVEC2013 dataset and 7.98 and 6.04 on the AVEC2014 dataset.

In addition to extracting features from the entire face, Yuan and Wang[32] explored the use of gaze features for ADE. They employed a fully connected network to extract gaze features from the participants and achieved an accuracy value of 0.831. Subsequently, Zhao and Wang[33] designed an attention-based architecture, EnSA, for ADE that achieved an accuracy of 0.955.

In addition to using facial expressions for ADE, utilizing body expressions is also an important approach. Yu et al[34] initially captured the participants' body skeleton change sequences using Kinect. Subsequently, they constructed a spatial attention-dilated temporal convolutional network (SATCN) based on an improved temporal convolutional network. SATCN achieved a maximum accuracy of 0.758 for binary classification tasks and a maximum accuracy value of 0.643 for multiclass datasets. Similarly, Zhao and Wang[35] employed body skeletal information for ADE. They observed differences in reaction times between the case and control groups for specific tasks. Consequently, they used the reaction time as prior knowledge along with skeletal information and input them into a Transformer for ADE. This approach achieved an accuracy value of 0.729. Compared to the abundance of facial-based ADE models, the number of models based on body expressions is relatively limited, warranting further research and exploration.

Unlike audio information, the advancement of convolutional networks enables the direct utilization of image information. Consequently, the construction of end-to-end ADE models has become mainstream in recent years. The inputs for these models do not require complex preprocessing and typically involve region cropping and lighting balancing. Extracting local information has become a crucial aspect of model construction and has emerged as a primary research direction for ADE based on video information.

FUSION OF AUDIO- AND VISUAL-BASED DEPRESSION ESTIMATION

In addition to using unimodal information for depression prediction, depression-detection models that jointly utilize multiple modalities are being developed. Various methods of complementing information enhance the accuracy of multi-modal models compared to unimodal models, with the combination of audio and visual information a commonly used approach.

Yang et al[36] designed uncertainty-aware label contrastive and distribution learning (ULCDL) to integrate facial, audio, and text information for ADE. ULCDL introduces a contrastive learning framework into ADE to enhance a model's learning capability, achieving an accuracy value of 0.830 and F1 score of 0.900 on the DAIC-WOZ dataset. Niu et al[37] combined facial sequences with audio spectrograms to detect ADE. Leveraging the characteristics of both features, they proposed spatiotemporal attention and multi-modal attention feature fusion networks to enhance and obtain cross-modal attention for the two features. This architecture achieved RMSE and MAE values of 8.16 and 6.14 on AVEC2013 and 7.03 and 5.21 on AVEC2014. Shao et al[38] observed that different features from the same data can be complementary. They combined the participants’ RGB images of the body and body skeleton images for the ADE, achieving an accuracy value of 0.854 on a dataset comprising 200 participants. Zhou et al[39] approached ADE from the perspective of video blogs. Their proposed time-aware attention-based multi-modal fusion depression detection network (TAMFN) extracts and fuses multi-modal information from three aspects: Global features, inter-modal correlations, and temporal changes. TAMFN obtained an F1 score of 0.75 on the D-Vlog[40] dataset. Uddin et al[41] initially segmented audio and video into equally sized segments before using volume local directional structural patterns and temporal attention pooling to encode facial and audio information to obtain the importance of each video and audio segment. The next step involved formatting video and audio segments. Finally, multi-modal factorized bilinear pooling was employed to fuse the features and make decisions. This method achieved RMSE and MAE values of 6.83 and 5.38 on AVEC2013 and 6.16 and 5.03 on AVEC2014.

Multimodality is a new approach to ADE. Multimodal information mimics the patterns of diagnosis and treatment from multiple perspectives in clinical examinations. The most crucial aspect of multimodal information is the exploration and integration of hidden relationships among the various types of information. Initially, feature and decision fusions were the primary methods for combining features. However, these two approaches are simple and do not consider deep feature integration. With further research, ADE will demand multiscale and deep fusion of multimodal features. Cross-modal fusion methods are no longer limited to feature and decision fusions. When constructing new fusion methods, identifying relationships between different types of information and methods to capture these relationships become crucial.

DISCUSSION

Facial information has been favored by most researchers for ADE methods based on video information. Studies such as[6] subdivided faces into multiple AUs for investigation. Inspired by these studies, researchers recognized the importance of local facial information. A series of attention mechanisms were proposed and employed to facilitate the model’s focus on local information. Although researches[34,35] has explored aspects such as gait and body movements in individuals with depression, compared with the excessive attention paid to ADE methods based on facial information, methods based on body expressions for ADE appear to be relatively scarce. Notably, databases analyzing the body movements of individuals with depression are often not publicly accessible. In addition, publicly available depression databases rarely contain body information. The lack of visibility and the difficulty in data collection are significant reasons for the limited development of ADE methods based on body expressions. Despite these challenges, we believe that this is an important research direction as facial expressions. We hope to develop more ADE methods based on the proposed body changes. Research related to human motion recognition, keypoint capture, and skeleton tracking may serve as valuable references for constructing ADE models based on body expressions.

For audio-based ADE methods, current approaches primarily involve combining handcrafted features or their transformed versions with deep features. Unlike facial expressions, audio information possesses richer individual characteristics, making feature selection more difficult. Finding a unified and effective feature selection pattern, along with deep learning architectural methods, remains a crucial task for future research.

The integration of multi-modal features is crucial for future depression detection. In clinical assessments, specialized doctors evaluate the subjects from various perspectives. Similarly, ADE based on deep learning should mimic this approach by extracting and merging features from multiple perspectives and modalities. In particular, methods for feature fusion should be carefully designed by considering common tendencies, temporal synchronicity, and the dynamic nature of modalities. With ongoing enhancements in computational power, ADE methods based on large models will continue to be proposed.

However, data collection and availability remain significant limitations for the development of ADE. First, owing to privacy policies and research ethics, existing open-source datasets are scarce. Second, multi-modal data are rarely used in open-source datasets. While the DAIC-WOZ dataset provides transcripts, audio, and desensitized video information, datasets that offer other features potentially relevant to depression detection are lacking. Third, most of the current research on AI-based depression diagnosis and treatment has a relatively small sample size, making it challenging to accurately reflect the characteristics of the overall population with depression. Fourth, data collection by the different research groups did not follow a unified standard.

In practical applications, deep learning-based ADE methods are still in the early stages of development. Nemesure et al[42] assessed the mental well-being of student populations by combining electronic health records with machine learning methods. Aguilera et al[43] developed applications and applied them to primary care. We believe that the interpretability of deep learning is a major limitation in its application. Future research should focus on two directions to enhance model credibility. The first is the construction of knowledge-guided ADE models. The research framework proposed by Hitzler and Sarker[44] is a novel research direction. The second is the incorporation of relevant analyses for model interpretability. Researchers can analyze the operating mechanism of a model using techniques such as visualization and feature capture.

In summary, researchers should focus on ADE based on bodily expressions. Additionally, unified and effective methods for audio feature extraction should continue to be explored. When constructing ADE models, special attention should be paid to the interpretability of the models. We hope that future research will introduce new perspectives and methods to address this aspect. Regarding data collection, research groups should consider publicly sharing their research paradigms, psychological effect evaluations, and desensitization data to help considerably advance the construction of large models and research progress in ADE.

CONCLUSION

In this paper, we provided an overview of prominent audio- and video-based ADE models in recent years, covering the aspects of audio, video, and fusion. An analysis of the relevant research revealed a lack of exploration of the body expressions of individuals with depression. We encourage researchers to delve further into audio feature extraction. In addition, we believe that the construction of large models is crucial for future research. We hope that researchers will develop outstanding ADE models in the future.

Footnotes

Provenance and peer review: Invited article; Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Psychiatry

Country/Territory of origin: China

Peer-review report’s scientific quality classification

Grade A (Excellent): 0

Grade B (Very good): B

Grade C (Good): 0

Grade D (Fair): 0

Grade E (Poor): 0

P-Reviewer: Horkaew P, Thailand S-Editor: Qu XL L-Editor: A P-Editor: Zheng XM

References

1.	Benazzi F. Various forms of depression. Dialogues Clin Neurosci. 2006;8:151-161. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 38] [Cited by in RCA: 56] [Article Influence: 2.9] [Reference Citation Analysis (0)]

2.	Li X, Mao Y, Zhu S, Ma J, Gao S, Jin X, Wei Z, Geng Y. Relationship between depressive disorders and biochemical indicators in adult men and women. BMC Psychiatry. 2023;23:49. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 17] [Reference Citation Analysis (0)]

Cannizzaro M, Harel B, Reilly N, Chappell P, Snyder PJ. Voice acoustical measurement of the severity of major depression. Brain Cogn. 2004;56:30-35. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 108] [Cited by in RCA: 96] [Article Influence: 4.6] [Reference Citation Analysis (0)]

4.	Leff J, Abberton E. Voice pitch measurements in schizophrenia and depression. Psychol Med. 1981;11:849-852. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 29] [Cited by in RCA: 30] [Article Influence: 0.7] [Reference Citation Analysis (0)]

5.	Ekman P. What the face reveals: Basic and applied studies of spontaneous expression using the facial action coding system (facs). 2^nd ed. Rosenberg E, editor. Cary, NC: Oxford University Press, 2005. [PubMed] [DOI]

Girard JM, Cohn JF, Mahoor MH, Mavadati S, Rosenwald DP. Social Risk and Depression: Evidence from Manual and Automatic Facial Expression Analysis. Proc Int Conf Autom Face Gesture Recognit. 2013;1-8. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 94] [Cited by in RCA: 40] [Article Influence: 3.3] [Reference Citation Analysis (0)]

7.	Joshi J, Goecke R, Parker G, Breakspear M. Can body expressions contribute to automatic depression analysis? IEEE Xplore. . [PubMed] [DOI]

8.	American Psychiatric Association. Dsm-iv-tr: Diagnostic and statistical manual of mental disorders. 4^th ed. Arlington, TX: American Psychiatric Press; 2000. [PubMed] [DOI]

9.	First MB. Diagnostic and statistical manual of mental disorders, 5th edition, and clinical utility. J Nerv Ment Dis. 2013;201:727-729. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 99] [Cited by in RCA: 267] [Article Influence: 22.3] [Reference Citation Analysis (0)]

10.	Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16:606-613. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 21545] [Cited by in RCA: 28515] [Article Influence: 1188.1] [Reference Citation Analysis (0)]

11.	HAMILTON M. A rating scale for depression. J Neurol Neurosurg Psychiatry. 1960;23:56-62. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 21041] [Cited by in RCA: 22760] [Article Influence: 350.2] [Reference Citation Analysis (0)]

12.	Peng Z, Hu Q, Dang J. Multi-kernel SVM based depression recognition using social media data. Int J Mach Learn Cybern. 2019;10:43-57. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 37] [Cited by in RCA: 15] [Article Influence: 1.9] [Reference Citation Analysis (0)]

13.	Alghowinem S, Goecke R, Wagner M, Epps J, Gedeon T, Breakspear M, Parker G. A comparative study of different classifiers for detecting depression from spontaneous speech. IEEE Xplore. 2013;. [PubMed] [DOI] [Full Text]

14.	Wen L, Li X, Guo G, Zhu Y. Automated depression diagnosis based on facial dynamic analysis and sparse coding. IEEE Trans Inf Forensics Secur. 2015;10:1432-1441. [PubMed] [DOI] [Full Text]

15.

Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schnieder S, Cowie R, Pantic M. AVEC 2013: The continuous audio/visual emotion and depression recognition challenge. Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM. 2013;. [DOI] [Full Text]

16.	Valstar M, Schuller B, Smith K, Almaev T, Eyben F, Krajewski J, Cowie R, Pantic M. AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM. 2014;. [PubMed] [DOI] [Full Text]

17.	Gratch J, Artstein R, Lucas GM, Stratou G, Scherer S, Nazarian A, Wood R, Boberg J, Devault D, Marsella S, Traum DR. The Distress Analysis Interview Corpus of Human and Computer Interviews. Proc of LREC. 2014;3123-3128. [PubMed] [DOI]

18.	He L, Cao C. Automated depression analysis using convolutional neural networks from speech. J Biomed Inform. 2018;83:103-111. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 72] [Cited by in RCA: 56] [Article Influence: 8.0] [Reference Citation Analysis (0)]

19.	Zuo L, Mak MW. Avoiding dominance of speaker features in speech-based Depression detection. Pattern Recognit Lett. 2023;50-56. [PubMed] [DOI] [Full Text]

20.	Du M, Liu S, Wang T, Zhang W, Ke Y, Chen L, Ming D. Depression recognition using a proposed speech chain model fusing speech production and perception features. J Affect Disord. 2023;323:299-308. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 8] [Reference Citation Analysis (0)]

21.	Yang W, Liu J, Cao P, Zhu R, Wang Y, Liu JK, Wang F, Zhang X. Attention guided learnable time-domain filterbanks for speech depression detection. Neural Netw. 2023;165:135-149. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 4] [Reference Citation Analysis (0)]

22.	Han ZJ, Shang YY, Shao ZH, Liu JY, Guo GD, Liu T, Ding H, Hu Q. Spatial-Temporal Feature Network for Speech-Based Depression Recognition. IEEE Trans Cogn Dev Syst. 2023;. [PubMed] [DOI] [Full Text]

23.	Chen W, Xing X, Xu X, Pang J, Du L. SpeechFormer++: A hierarchical efficient framework for paralinguistic speech processing. IEEE ACM Trans Audio Speech Lang Process. 2023;31:775-788. [PubMed] [DOI] [Full Text]

24.	Mao K, Zhang W, Wang DB, Li A, Jiao R, Zhu Y, Wu B, Zheng T, Qian L, Lyu W, Ye M, Chen J. Prediction of depression severity based on the prosodic and semantic features with bidirectional LSTM and time distributed CNN. IEEE Trans Affect Comput. 1-1. [PubMed] [DOI] [Full Text]

25.	He L, Chan JCW, Wang Z. Automatic depression recognition using CNN with attention mechanism from videos. Neurocomputing. 2021;422:165-175. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 17] [Cited by in RCA: 18] [Article Influence: 4.5] [Reference Citation Analysis (0)]

26.	He L, Tiwari P, Lv C, Wu W, Guo L. Reducing noisy annotations for depression estimation from facial images. Neural Netw. 2022;153:120-129. [RCA] [PubMed] [DOI] [Full Text] [Reference Citation Analysis (0)]

27.	Zhao J, Zhang L, Cui Y, Shi J, He L. A novel Image-Data-Driven and Frequency-Based method for depression detection. Biomed Signal Process Control. 2023;86:105248. [PubMed] [DOI] [Full Text]

28.

Huang SC, Cheng FC, Chiu YS. Efficient contrast enhancement using adaptive gamma correction with weighting distribution. IEEE Trans Image Process. 2013;22:1032-1041. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 633] [Cited by in RCA: 139] [Article Influence: 11.6] [Reference Citation Analysis (0)]

29.	Kupyn O, Martyniuk T, Wu J, Wang Z. DeblurGAN-v2: Deblurring (orders-of-magnitude) faster and better. Proc IEEE Int Conf Comput Vis. 2019;. [PubMed] [DOI] [Full Text]

30.	Ding X, Zhang X, Zhou Y, Han J, Ding G, Sun J. Scaling up your kernels to 31x31: Revisiting large kernel design in CNNs. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2022;. [PubMed] [DOI] [Full Text]

31.	Liu Z, Yuan X, Li Y, Shangguan Z, Zhou L, Hu B. PRA-Net: Part-and-Relation Attention Network for depression recognition from facial expression. Comput Biol Med. 2023;157:106589. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 12] [Reference Citation Analysis (0)]

32.	Yuan Y, Wang Q. Detection model of depression based on eye movement trajectory. IEEE Proc Int Conf Data Sci Adv Anal. 2019;. [PubMed] [DOI] [Full Text]

33.	Zhao J, Wang Q. Eye movement attention based depression detection model. IEEE Proc Int Conf Data Sci Adv Anal. 2022;. [PubMed] [DOI] [Full Text]

34.

Yu Y, Li W, Zhao Y, Ye J, Zheng Y, Liu X, Wang Q. Depression and Severity Detection Based on Body Kinematic Features: Using Kinect Recorded Skeleton Data of Simple Action. Front Neurol. 2022;13:905917. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 2] [Reference Citation Analysis (0)]

35.	Zhao X, Wang Q. Depression detection based on human simple kinematic skeletal data. IEEE Proc Int Conf Data Sci Adv Anal. 2022;. [PubMed] [DOI] [Full Text]

36.	Yang B, Wang P, Cao M, Zhu X, Wang S, Ni R, Yang C. Uncertainty-Aware Label Contrastive Distribution Learning for Automatic Depression Detection. IEEE Trans Comput Soc Syst. 2023;. [PubMed] [DOI] [Full Text]

37.	Niu M, Tao J, Liu B, Huang J, Lian Z. Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans Affect Comput. 2023;14:294-307. [PubMed] [DOI] [Full Text]

38.

Shao W, You Z, Liang L, Hu X, Li C, Wang W, Hu B. A Multi-Modal Gait Analysis-Based Detection System of the Risk of Depression. IEEE J Biomed Health Inform. 2022;26:4859-4868. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 2] [Cited by in RCA: 1] [Article Influence: 0.3] [Reference Citation Analysis (0)]

39.	Zhou L, Liu Z, Shangguan Z, Yuan X, Li Y, Hu B. TAMFN: Time-Aware Attention Multimodal Fusion Network for Depression Detection. IEEE Trans Neural Syst Rehabil Eng. 2023;31:669-679. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 3] [Reference Citation Analysis (0)]

40.	Yoon J, Kang C, Kim S, Han J. D-vlog: Multimodal Vlog Dataset for Depression Detection. Proc Conf AAAI Artif Intell. 2022;36:12226-12234. [PubMed] [DOI] [Full Text]

41.	Uddin MA, Joolee JB, Sohn KA. Deep multi-modal network based automated depression severity estimation. IEEE Trans Affect Comput. 2022;1-1. [PubMed] [DOI] [Full Text]

42.

Nemesure MD, Heinz MV, Huang R, Jacobson NC. Predictive modeling of depression and anxiety using electronic health records and a novel machine learning approach with artificial intelligence. Sci Rep. 2021;11:1980. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 32] [Cited by in RCA: 75] [Article Influence: 18.8] [Reference Citation Analysis (0)]

43.

Aguilera A, Figueroa CA, Hernandez-Ramos R, Sarkar U, Cemballi A, Gomez-Pathak L, Miramontes J, Yom-Tov E, Chakraborty B, Yan X, Xu J, Modiri A, Aggarwal J, Jay Williams J, Lyles CR. mHealth app using machine learning to increase physical activity in diabetes and depression: clinical trial protocol for the DIAMANTE Study. BMJ Open. 2020;10:e034723. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 38] [Cited by in RCA: 46] [Article Influence: 9.2] [Reference Citation Analysis (0)]

44.	Hitzler P, Sarker MK. Neuro-Symbolic Artificial Intelligence: The State of the Art. IOS Press. 2021. Available from: https://ebooks.iospress.nl/volume/neuro-symbolic-artificial-intelligence-the-state-of-the-art. [PubMed] [DOI]