Abstract: In recent years, as the forged face technology rapidly develops, the face synthesized has been extremely hard for the human eyes to identify, and the application of this technology by some criminals has badly threatened social stability and personal privacy, so the importance of forged face detection technology has become increasingly prominent. This review systematically discusses the current status of forged face detection technology, mainly from two aspects of forged face image detection and forged face video detection. In the aspect of forged face image detection, the methods based on the image spatial domain and frequency domain, identity consistency detection, and the application of face region localization technology are discussed. In the field of forged face video detection, the research focuses on the integration of spatio-temporal features, the utilization of physiological features, and the combination of audiovisual information. In addition, the study introduces the commonly used evaluation indicators and systematically analyzes a variety of important data sets, including their characteristics and application scenarios. At the same time, it also points out the limitations in the current literature, such as the lack of robustness of adversarial samples and the poor adaptability of detection methods to new forgery techniques. Based on these analyses, this study puts forward the possible research directions in the future, including the optimization of cross-domain detection technology, the exploration of new algorithms, and the study of the model interpretability. This review not only provides researchers with a comprehensive understanding of fake face detection technology but also points out the development direction for subsequent research, possessing high theoretical value and practical application significance.
Abstract: The neural radiation field (NeRF) has significant advantages in generating high-fidelity maps thanks to its neural implicit representation-based scene. The application of NeRF in simultaneous localization and mapping (SLAM), namely the NeRF-based SLAM method, enables continuous 3D modeling while achieving high-precision localization to enhance the quality and detail of the scene reconstruction by rendering new perspectives and predicting unknown regions. To track the latest research results in this field, this study reviews and summarizes the key algorithms of NeRF-based SLAM in recent years. Firstly, the core principle of NeRF technology is introduced and a comprehensive overview of the framework of NeRF-based SLAM methods is given, followed by focusing on the improvements and optimizations of NeRF-based SLAM, including improving the efficiency of neural implicit representation, solving the large-scale scene building problem, adding loopback and global optimization to achieve global consistency and solving the dynamic interference problem. Finally, an outlook on the NeRF-based SLAM method is presented to provide valuable references for related researchers to promote more innovative research.
Abstract: Aiming at the poor accuracy of monocular 3D object detection algorithms caused by the scale differences of objects with different depths in monocular images, a detection algorithm based on fused sampling and depth-scale constraints is proposed. Firstly, to enhance the ability of the sampled features to represent objects at different scales, a multi-scale fusion module (MFM) is constructed. It fuses the sampled features at different levels and scales through hierarchical aggregation and iterative aggregation, thereby improving the ability to extract implicit scale features of the objects. In addition, a depth-scale correlation module (DSCM) is constructed. It uses the linear projection constraint between depth and scale for compensatory scaling of objects at different scales to the same feature level, balancing the model's focus on objects at different distances. Quantitative results based on the KITTI dataset and Waymo dataset show that for both types of datasets, the proposed algorithm improves the overall average accuracy AP3D by 1.56 percentage points and 3.07 percentage points, respectively, compared to similar algorithms under multiple difficulties, which verifies the effectiveness and generalization of the algorithm. Meanwhile, qualitative results based on the two datasets validate that the algorithm significantly mitigates the impact of the object scale differences on detection performance.
Abstract: To address the issues of limited sample size and imbalanced categories in existing rural road image datasets, a data augmentation method based on an improved StyleGAN is proposed. This approach introduces a decoupled mapping network into the original StyleGAN framework to reduce the coupling degree of the W-space latent code. By integrating the advantages of convolution and Transformer, this study designs a convolution-coupled transfer block (CCTB). The core cross-window self-attention mechanism within this module enhances the network’s ability to capture complex context and spatial layouts. These two improvements significantly boost network performance. Ablation experiments comparing the original and improved StyleGAN networks show that the IS index increases from 42.38 to 77.31, and the FID value decreases from 25.09 to 12.42, demonstrating a substantial improvement in data generation quality and authenticity. To verify the impact of data augmentation on model performance, two classic and mainstream object detection algorithms are used for testing. Performance differences between the original and augmented datasets are compared, further confirming the effectiveness of the improved methods.
Abstract: Currently, there are various methods for identifying lies, including the use of lie detectors. However, these methods have limited effectiveness in execution, as they not only require contact with the subject being tested for lies but also require relevant personnel to possess professional knowledge, making them inconvenient and less effective. Psychological research shows that micro-expressions are subtle muscle movements on the face with an extremely short duration, which can reflect a person’s true inner state when they occur. Related studies show that micro-expression features can serve as clues for deception recognition. This study focuses on deception recognition based on micro-expression features. Firstly, a dataset called MED, which contains micro-expression data when people are lying, is constructed. Secondly, a micro-expression feature learning model named MEDR based on a multi-layer self-attention mechanism is designed. It can recognize lies based on the learned micro-expression features in both lying and non-lying situations. Finally, experimental comparisons between the proposed model and some existing models are conducted on the newly constructed dataset. Experimental results show that the proposed model achieves an accuracy of 94.33% on the self-made high-quality dataset, indicating its excellent performance in deception recognition.
Abstract: With the application of network video platform (NVP), network videos often face copyright infringement and cross-platform copyright detection issues when shared across different video platforms. Therefore, this study proposes a blockchain-based cross-platform network video copyright protection scheme (BCVCP), which aims to protect network video copyrights across platforms by means of blockchain and through ownership sequence (OS) generation and detection. This study includes identity authentication, keyframe extraction, ownership sequence generation and detection, and network video control management. Specifically, before operations such as video uploading or access, identity authentication needs to be carried out to ensure identity information security. Secondly, during the process of uploading network videos, an ownership sequence is generated and stored in distributed nodes. Then, the keyframes of the video are extracted and the generated ownership sequence is embedded into these keyframes. Finally, smart contracts are invoked for cross-platform ownership sequence detection and network video dissemination management to avoid infringement behaviors. In the experiments, the robustness of ownership encoding quality and ownership recognition during cross-platform network video transmission is verified, thereby protecting the copyright of network videos.
Abstract: The intelligent diagnosis of premium threaded connections (PTC) is crucial for ensuring the stability and sealing of oil pipes under high temperature, high pressure, and acidic gas conditions. Accurate diagnosis relies on analyzing PTC curves under different operating conditions to reflect the quality of the buckling, but obtaining a large amount of valid data in actual industrial inspections is challenging. This study introduces an end-to-end classification model that combines asynchronous optimized 2D deep convolutional generative adversarial network (AoT-DCGAN) and 2D convolutional neural network (P-CNN), aiming to improve classification performance with small sample sizes. The proposed method first utilizes AoT-DCGAN to identify the distribution pattern of original samples and generate corresponding synthetic samples. At the same time, a novel weight optimization strategy, asynchronous optimization (AO), is implemented to alleviate the gradient vanishing problem during the generator optimization phase. Subsequently, a novel P-CNN model is designed and trained on an expanded dataset to achieve automatic classification of PTC curves. The method is evaluated based on recall, specificity, F1 score, precision, and confusion matrix under different data augmentation ratios. The results indicate that as the dataset size increases, the model’s classification ability improves, peaking at a dataset size of 1200. In addition, within the same training set, the performance of the P-CNN model outperforms traditional machine learning and deep learning models, achieving optimal classification accuracies of 95.9%, 95.5%, and 96.7% on the AC, ATI, and NDT curves, respectively. Finally, research confirms that applying asynchronous optimization during the training process of DCGAN results in a more stable decrease in the loss function.
Abstract: Road damage poses a great threat to the service life and safety of roads. Early detection of road defects facilitates maintenance and repair. Traditional road defect detection methods typically rely on manual visual inspection and vehicle-mounted pavement monitoring systems. However, these methods are largely influenced by the experience of road maintenance personnel. With the advancement of deep learning, increasing numbers of researchers have applied it to road defect detection. Among these, the YOLO series of object detection methods and their various variants are the most common. However, most of these methods require post-processing operations, which hinder model optimization, impair robustness, and lead to delayed inference by the detector. To address these issues, as well as the multi-scale challenges in road defect detection, an improved RT-DETR model is proposed. The backbone network is fine-tuned, and the MSaE attention module is introduced. In the encoder, GhostConv convolution and DySample module are used to optimize upsampling, while the ADown module optimizes downsampling. Comparative experiments are conducted on the public SVRDD dataset. Experimental results show that the proposed improved method achieves a 72.5% mAP@50 on the SVRDD dataset, 3.8 percentage points higher than the benchmark RT-DETR-R18, significantly enhancing road defect detection performance.
Abstract: To address the low accuracy and high miss detection rates in pedestrian detection caused by complex background interference, this study proposes an adaptive dual-branch dense pedestrian detection algorithm, DACD-YOLO, incorporating improved attention mechanisms. First, the backbone network employs an adaptive dual-branch structure, which fuses different features through dynamic weighting while introducing depthwise separable convolution to reduce the computational cost, effectively mitigating the information loss present in traditional single-branch networks. Second, an adaptive vision center is proposed to enhance intra-layer feature extraction through dynamic optimization, with channel numbers reconfigured to balance accuracy and computational load. A coordinate dual-channel attention mechanism is then introduced, combining a heterogeneous convolution kernel design within a lightweight fusion module to reduce computational complexity and improve the capture of key features. Lastly, a dilation convolution detection head is utilized, fusing multi-scale features through convolutions with varying dilation rates, effectively enhancing feature extraction for small and occluded objects. Experimental results show that, compared to the original YOLOv8n, the proposed algorithm improves mAP@0.5 and mAP@0.5:0.95 by 2.3% and 2.2%, respectively, on the WiderPerson dataset, and by 3.5% and 4.6%, respectively, on the CrowdHuman dataset. The experiments demonstrate that the proposed algorithm significantly enhances accuracy in dense pedestrian detection compared to the original method.
Abstract: Sign language is a communication tool commonly used by people with hearing impairments or those who are unable to communicate verbally. It utilizes gestures to convey actions and simulate images or syllables that form specific meanings or words. With the continuous development of computer vision and deep learning, sign language recognition technology has emerged and continued to develop, making it possible for hearing individuals to communicate with the deaf or mute. However, the complexity and variability of dynamic sign language still pose challenges for its accurate detection and recognition. To promote research in this field, this study conducts an in-depth review of existing dynamic sign language recognition methods and technologies. First, the development history and current research status of dynamic sign language recognition technology, commonly used dynamic sign language datasets, and evaluation metrics for sign language recognition methods are reviewed. Second, deep learning models frequently used in dynamic sign language recognition are examined, and the challenges faced by dynamic sign language recognition technology, along with corresponding solutions, are discussed. Finally, based on the current status of sign language recognition, the challenges of dynamic sign language recognition are summarized, and an analysis and outlook are provided regarding the potential improvements to sign language recognition performance in the next stage.
Abstract: High-definition, low-latency display of Chinese paintings is essential for Chinese painting VR exhibition applications. Due to the limited memory and GPU resources on mobile VR headsets, it is challenging to display a large number of high-resolution Chinese painting textures simultaneously. Moreover, direct viewing of fine details is hindered by mipmap management and the low resolution of mobile VR devices. This study proposes an improved virtual texture method, optimizing both tile request calculation and tile loading stages based on existing virtual texture methods. In the tile request calculation phase, the method incorporates tile request computations for magnified perspectives. Compute Shader is utilized to parallelize the processing of tile request parameters, and hashing is applied to minimize overhead when constructing result caches. In the tile loading phase, lock-free queues are implemented to enhance loading efficiency. A direct loading strategy for request tiles, constrained by a quantity threshold, reduces display latency. The performance and texture display effects are evaluated in scenarios with single or multiple Chinese paintings, simulating user behavior. Results show that the proposed method supports high-definition, low-latency display of high-resolution Chinese painting textures on mobile VR devices. Magnification-assisted perspectives allow for clear viewing of the finest texture details. Compared to existing virtual texture methods, such as Unreal SVT, the proposed method achieves higher frame rates and reduces display latency for high-resolution texture tiles of multiple Chinese paintings.
Abstract: An improved YOLOv8 model (FCU-YOLOv8) is proposed to enhance the accuracy and efficiency of rice disease detection, addressing the challenges of diverse rice diseases, complex backgrounds, and subtle differences in characteristics between diseases. The FasterNeXt module is used to replace the C2f module in the YOLOv8 backbone network. By optimizing the network structure, the FasterNeXt module reduces computation and memory access while improving feature extraction efficiency, thus lowering the inference cost of the model. The C3K module (multi-scale convolution module) and CPSA module (convolutional attention mechanism) are designed to further enhance the model’s ability to perceive disease region features. The C3K module allows the model to adapt to disease characteristics at various scales through flexible convolutional kernel selection, while the CPSA module employs an attention mechanism to enhance the model’s ability to capture key information. To improve the quality of detection boxes and the detection performance of dense disease targets, the optimized unified intersection over union (UIoU) loss function is adopted. This function improves detection performance by balancing the accuracy and consistency of bounding boxes during the regression phase. On a custom-made image dataset of eight common rice diseases, FCU-YOLOv8 demonstrates significant improvements over the original YOLOv8 in several performance metrics. The mAP@0.5 index reaches 94.7%, a 2.4% improvement over the baseline model, and the mAP@0.5:0.95 index reaches 67.2%, a 3.3% improvement. The model’s parameters are reduced by 24.2%, and the calculated floating-point operations decrease by 28.7%, compared to the baseline model in terms of model lightweighting. Compared with mainstream algorithms, the proposed algorithm outperforms current leading algorithms, demonstrating the effectiveness of the network.
Abstract: Clickbait refers to the use of sensational or exaggerated headlines to attract users into clicking, a practice that has proliferated in recent years across online platforms such as news portals and social media. This trend has led to user dissatisfaction and, in some cases, facilitated online fraud. Large language models (LLM), known for their robust natural language understanding and text generation capabilities, have demonstrated outstanding performance across various natural language processing tasks. However, when faced with specific challenges like clickbait detection, where decision boundaries are often unclear, LLM are prone to hallucination. To address the issue, a method based on a dual-layer multi-agent large language model is proposed, which significantly enhances clickbait detection accuracy without the need to fine-tune the entire model. Specifically, internal voting within agents in the first layer and cross-voting among different agents in the second layer results in enhanced detection performance. Validation against three benchmark datasets shows that the proposed method outperforms state-of-the-art large-scale models and prompt learning techniques by nearly 13% and 10% in terms of accuracy, respectively.
Abstract: Large visual language model (LVLM) demonstrate remarkable capabilities in understanding visual information and generating verbal expressions. However, LVLM are often affected by the phenomenon of object hallucinations, where the outputs appear plausible but do not align with the visual information in the images. This discrepancy between the generated text and the images presents a significant challenge in achieving accurate image-to-text alignment. To address this issue, this study identifies the lack of object attention as a key factor contributing to object hallucinations. To mitigate this, the proposed image contrast enhancement (ICE) method is introduced. ICE is a simple, user-friendly approach that compares the output distributions from both the original and the augmented visual inputs. This method enhances the model’s ability to perceive images more accurately, ensuring that the generated content aligns closely with the visual input and produces contextually consistent outputs. Experimental results demonstrate that the ICE method effectively mitigates object hallucinations across various LVLM without requiring additional training or external tools. Furthermore, the method performs well on the MME benchmark test for large-scale visual language models, indicating its broad applicability and effectiveness. The code will be released at ChangGuiyong/ICE.
Abstract: Medical image segmentation serves as a fundamental and critical component in numerous clinical applications. Recent advancements in interactive segmentation methods have attracted significant attention due to their high accuracy and robustness in complex clinical tasks. However, current deep learning-based interactive segmentation methods exhibit limitations in leveraging user interactions, particularly in interactive encoding design and pixel classification. To address these limitations, this study proposes a hybrid interaction design incorporating “near-center points” and “outer-edge points”, which ensures low interaction costs while accurately capturing user intent. Additionally, the existing geodesic distance encoding method is enhanced by a Gaussian attenuation function to mitigate image noise interference and improve the robustness and accuracy of interaction encoding. Furthermore, a Gaussian process classification method based on a hybrid kernel function is integrated to fully exploit user interaction information during pixel classification, enhancing segmentation accuracy while endowing the model with interpretability. Extensive experiments on five segmentation tasks across four representative subsets of the medical segmentation decathlon (MSD) dataset demonstrate that the proposed method achieves consistently high segmentation accuracy. In particular, for complex tasks such as pancreas tumor and colon image segmentation, this method has significantly higher Dice coefficients and ASSD values than existing methods, showing its strengths in precise segmentation and boundary refinement.
Abstract: Rolling bearings are crucial components in mechanical systems. As low-frequency faults are less likely to occur, data samples related to those are rare, bringing difficulties to the collection and processing of related data. If not properly addressed, such faults can lead to severe safety hazards and substantial economic losses. To deal with this problem, this study proposes a dual-path fault diagnosis model that integrates traditional signal processing methods with convolutional neural network (CNN) and multilayer perceptron (MLP). In terms of feature extraction, the study employs a combination of discrete wavelet transform (DWT) and continuous wavelet transform (CWT), along with average downsampling techniques, to extract multi-scale time-frequency and time-domain features from the raw signals. The model contains two paths: one extracts time-frequency features of feature engineering by embedding the efficient channel attention (ECA) mechanism into the residual CNN, and the other uses MLP to process down-sampled multi-scale time-domain features, and finally fuses the two paths for classification. Small sample evaluation shows that the feature engineering method achieves an average diagnostic accuracy of 99.34% on the Case Western Reserve University (CWRU) dataset, which is higher than the 98.97% achieved by the traditional method. The hybrid CNN-MLP dual-path model achieves a high accuracy of 99.90% on the CWRU dataset and an accuracy of 98.38% on the Jiangnan University (JNU) dataset. It shows its application potential in small sample rolling bearing fault diagnosis.
Abstract: With the widespread adoption of electronic health record (EHR), retrieving similar cases has become a critical task in supporting clinical decision-making, such as in auxiliary diagnosis and treatment planning. However, EHR data is characterized by high dimensionality, heterogeneity, and large volume. To effectively integrate multimodal clinical data and achieve efficient retrieval, this study proposes a multimodal clinical data retrieval model for similar cases based on deep hashing—MCDF. This model employs different methods for feature extraction tailored to the characteristics of various modalities, utilizing multi-layer perceptron (MLP) for structured text data, BioBERT for unstructured text data, and BioMedCLIP for image data, followed by feature fusion through a self-attention mechanism. A triplet loss function guides the model to directly generate hash codes that effectively represent the samples, enabling rapid comparisons for sample retrieval. This not only enhances retrieval accuracy but also significantly improves efficiency. Using the publicly available MIMIC-III dataset, the MCDF model is evaluated against traditional hashing methods (such as spectral hashing) and advanced hashing methods (such as deep hashing network) using mean normalized discounted cumulative gain (MNDCG) and mean average precision (MAP) metrics for evaluation. Experimental results demonstrate that the MCDF model outperforms all baseline models, validating the superiority of the proposed approach.
Abstract: Smoke detection is very important in early fire warning. The existing detection algorithms are basically based on deterministic convolutional neural networks. However, deterministic neural networks tend to give very confident prediction results, even though they do not know whether there is a target object in some regions at all. In particular, the smoke edge region is more transparent, making it extremely easy for these areas to be confused with the surrounding environment. Therefore, the detection algorithm cannot make a good judgment on this region, producing a large number of false positives. So, an improved DeepLabV3+ algorithm is proposed. First, the algorithm optimizes DeepLabV3+ based on Bayesian ideas to output non-deterministic feature coding, so as to quantify the measurement of uncertainty in the predicted image and calibrate the learning process of the model. Secondly, feature coding is preprocessed based on the preprocessing idea to reduce the amount of information of irrelevant interfering features, and the feature fusion capability in the DeepLabV3+ network is strengthened to make full use of the multi-scale feature information extracted from the network. Finally, the upsampling operator in the DeepLabV3+ network is optimized as the CARAFE operator to reduce the loss of important information in the upsampling process. The model achieves good performance on the open SMOKE5K dataset, with the MIoU index reaching 92.41%.
Abstract: Land cover classification of remote sensing images is crucial for urban planning, land use, environmental monitoring, and land cover temperature inversion. This study proposes a U-type Transformer network, U-BiFormer to address the issues of misclassification among similar land cover types and the imbalance of land cover classes in remote sensing images. Building upon BiFormer, this model employs a U-shaped decoder and uses the outputs of the decoders in all stages to predict the segmentation map, thereby enhancing the model’s ability to capture details and contextual information in images, allowing for better segmentation of similar classes. An improvement is made to the unique hybrid attention module of the U-shaped decoder, increasing the proportion of features from the current stage in the mixed features. This modification enables the decoder to focus more on refining the features at the current stage, enhancing the model’s segmentation performance for similar classes. Additionally, the CE+Focal hybrid loss function is employed to replace the conventional cross-entropy loss function to address the issue of class distribution imbalance in remote sensing images. Experiments demonstrate that the proposed method achieves better segmentation results for similar classes on the GID large-scale remote sensing image dataset, outperforming current mainstream models with an accuracy (Acc) of 81.99% and a mean intersection over union (mIoU) of 71.04%.
Abstract: Existing trajectory generation methods based on generative adversarial imitation learning (GAIL) mostly use the Markov decision process (MDP) to model human movement patterns. With limited training data, it is difficult to learn the potential relationship between action selection and locations, and the distance constraints between locations are not taken into account in the calculation of the state transition function. Therefore, the quality of the generated trajectories needs to be improved. For this reason, this study proposes a trajectory generation method based on generative adversarial imitation learning. The method first incorporates priori knowledge of the location-related action distribution into the generator to help the model understand the change patterns of the actions at a specific location, guiding it to better model the policy function that conforms to the real scenario. In addition, distance constraints are introduced into the state transition function to ensure the rationality of the generated trajectories. Experiments conducted on two real datasets show that the proposed method achieves a Rank index of 0.0268, which is 39 % better than that of the best baseline method. In addition, the accuracy of the prediction in the next position prediction task is 6 % higher than that of the best baseline.
Abstract: Due to the complex background of fundus images, thin and blurred capillaries, and noise interference, traditional retinal vessel segmentation algorithms often experience issues of inaccurate recognition and disconnections. To address these problems, a retinal blood vessel segmentation algorithm based on improved U-Net and attention mechanism (MRAU-Net) is proposed. To resolve the issue of insufficient feature extraction, a multi-scale residual convolution block (MSRCB) is designed to replace the traditional convolution blocks of U-Net. To reduce information loss and noise interference, a dual-dimensional attention optimization module (DAOM) is embedded in the bottleneck layer. To further mitigate information loss during the encoding-decoding process, a new multi-scale dense convolution block (MDCB) is constructed and combined with traditional skip connections. Experiments conducted on two public datasets of DRIVE and CHASE_DB1 yield F1-scores of 82.92% and 83.75%, AUCs of 98.87% and 98.96%, sensitivities of 84.50% and 83.82%, and accuracies of 97.11% and 97.63%, respectively. These results show that MRAU-Net outperforms existing outstanding algorithms.
Abstract: To address the challenges in multi-organ segmentation of abdominal CT images, such as varying organ sizes and shapes, difficulties in distinguishing boundaries between adjacent organs, and low contrast, this study proposes a feature-enhanced dual-branch multi-organ segmentation model. The model adopts an encoder-decoder architecture, with a master-slave dual-branch structure in the encoder. The master branch leverages Mamba to capture global dependencies among organs, while the slave branch employs CNN to hierarchically extract local features of multiple organs. A cascade context module is introduced to transfer detailed local features from the slave branch to the master branch. In the decoder, a multi-scale feature fusion module integrates cross-level feature information to enhance boundary sharpness in multi-organ segmentation, and a deep feature enhancement module applies a cross-attention mechanism to improve the contrast between organ foregrounds and backgrounds, mitigating the interference of background noise. Experimental results on two public datasets, Synapse and ACDC, demonstrate that the proposed model achieves notable improvements in Dice and HD95 indexes compared to recent baseline models.
Abstract: Few-shot image classification aims to learn a classifier from a limited amount of labeled data. Despite significant progress made by existing methods, challenges remain in extracting useful features and accurately classifying images due to the limited number of training samples, large intra-class variance, and small inter-class variance, which lead to confusion between support and query samples. To address these issues, this study proposes a novel multi-embedding enhanced network. This lightweight and efficient network represents images by generating a set of feature embeddings, rather than relying solely on single-image-level features. It is capable of generating various hierarchical structures to learn richer feature representations, thereby reducing intra-class variance and increasing inter-class variance. In addition, the study proposes a set-based metric combined with a dynamic self-adaptive weighting mechanism to measure the similarity between query and support sets. Experimental results demonstrate the excellent performance of the proposed model on the miniImageNet, tieredImageNet, and CUB datasets. Using a 1-shot setting in the ResNet-12 network, the model achieves accuracies of 72.22%, 75.43%, and 85.02%, respectively, outperforming the baseline models by 1.09%, 2.93%, and 1.47%.
Abstract: In personalized explainable recommendation systems, user ID is an important identifier for personalization. Existing algorithms usually adopt the encoder-decoder architecture to generate personalized explainable recommendations, however, this approach increases the complexity and computational cost of the algorithm and limits the accuracy performance of the algorithm. To address this problem, this study proposes a personalized explainable recommendation algorithm (PERSP) that incorporates self-attention mechanism and prompt learning. The algorithm enhances the interpretability of the algorithm by introducing and fine-tuning prompt learning in the input layer of BERT. To overcome the inability of BERT to directly use user IDs for personalized recommendations, the algorithm uses a self-attentive mechanism to splice user IDs with other commands and feeds the sequences into the input layer of BERT for training and inference. To verify the effectiveness of the algorithm, comparative experiments are conducted on TripAdvisor, Amazon, and Yelp datasets. On the TripAdvisor dataset, the PERSP algorithm improves the root mean squared error (RMSE) and mean absolute error (MAE) by 3.7% and 4.7%, respectively, compared to other baseline algorithms; on the Amazon dataset, the improvements are 1.05% and 4.1% respectively; and on the Yelp dataset, the improvements are 1% and 2.5% respectively. The results show that the algorithm has better performance in personalized explainable recommendation tasks, effectively improving the accuracy and interpretability of recommendation systems.
Abstract: In offline-to-online reinforcement learning, though the agent can leverage pre-collected offline data for initial policy learning, the online fine-tuning phase often exhibits instability in the early stages, and the performance improvement after fine-tuning is relatively small. To address this issue, two key designs are proposed: 1) a simulated annealing-based dynamic offline-online replay buffer and 2) simulated annealing-based behavior constraint attenuation. The first design dynamically selects offline data or online interaction experiences during training using the simulated annealing concept to obtain an optimized update strategy, dynamically balancing the stability of online training and fine-tuning performance. The second design introduces a behavior cloning constraint with a cooling mechanism to mitigate the sharp performance drop caused by using online experience updates in the early fine-tuning stage, gradually relaxing the constraint in the later stage to enhance model performance. Experimental results demonstrate that the proposed dynamic replay buffer and time decaying constraints (DRB-TDC) algorithm improves performance by 45%, 65%, and 21% on the HalfCheetah, Hopper, and Walker2d tasks from the MuJoCo benchmark after online fine-tuning, respectively. The average normalization score of all tasks exceeds the best baseline algorithm by 10%.
Abstract: Accurate identification of tissues, organs, and lesion regions is one of the most important tasks in medical image analysis. Models based on the U-Net structure dominate the existing research on semantic segmentation of medical images. Combining the advantages of CNN and Transformer, TransUNet has superiority in capturing long-range dependencies and extracting local features, but it is still not accurate enough in extracting and recovering the locations of features. To address this problem, a medical image segmentation model MAF-TransUNet with a multi-attention fusion mechanism is proposed. The model first adds a multi-attention fusion module (MAF) before the Transformer layer to enhance the representation of location information. Then it combines the MAF again in the hopping connection so that the location information can be efficiently transmitted to the decoder side. Finally, the deep convolutional attention module (DCA) is used in the decoding stage to retain more spatial information. The experimental results show that MAF-TransUNet improves the Dice coefficients on the Synapse multi-organ segmentation dataset and ACDC automated cardiac diagnostic dataset by 3.54% and 0.88%, respectively, compared with TransUNet.
Abstract: Most of the current graph contrast learning-driven recommender models tend to rely on a single view for training, which inevitably limits the ability of the models to fully capture the features of complex data. To this end, a recommendation algorithm multi-view knowledge contrastive learning recommendation (MKCLR) integrating multi-view contrastive learning and knowledge graph is proposed in this study. First, three view enhancement methods, namely, random edge discarding, adding uniform noise perturbation, and random walk algorithm, are used to construct three contrasting views for the knowledge graph and user-item graph. Second, LightGCN is used to encode the knowledge graph and construct multiple contrastive learning tasks, aiming to maximize the extraction and utilization of the rich information in the multi-view data. Finally, the main recommendation task is combined with contrastive learning for joint training. Experiments conducted on MIND, Last-FM, and Alibaba-iFashion show an average increase of 5.78% and 8.68% of MKCLR in terms of Recall and NDCG indexes, respectively, validating the effectiveness of the proposed method.
Abstract: CBCT is widely used in image-guided radiation therapy due to its integration with modern linear accelerator systems. However, its inferior image quality compared to CT poses significant challenges in achieving optimal treatment planning. This study proposes a new model named DDFGAN (dual-domain feature fusion generative adversarial network), aiming at increasing the image quality of CBCT to that of CT as much as possible. The model adopts a dual-branch architecture: the first branch extracts multi-scale features in the spatial domain through the introduction of an RFB module; the second branch designs a frequency domain feature extraction module specifically for CBCT to CT synthesis. By fusing features from both branches, DDFGAN significantly enhances the imaging quality of CBCT. Additionally, the model incorporates a geometric consistency loss, transforming the traditional bidirectional generative network into a unidirectional one, which not only aligns more with clinical application requirements but also substantially reduces training time. Experimental results show that DDFGAN outperforms the other four comparative methods in generating synthetic CT images with fewer artifacts, and the HU values of synthetic images are closer to those of CT images, significantly improving the accuracy of adaptive radiation therapy.
Abstract: Multidimensional time series data are widely used across various fields, and their effective representation is critical for subsequent analysis and mining tasks. Traditional shapelet transform methods extract features by projecting the single-dimensional time series into the shapelet space and then fusing them without considering the complex coupling relationships between different dimensions. Moreover, the restriction on shapelet length hinders the acquisition of long-range dependencies on sequences. To address these issues, a multidimensional time series representation method, CDT-ShapeNet, coupling both dimensional dependencies and long-range dependencies is proposed in this study. In this method, the dimensional information representation module captures the dependencies between different dimensions through a dimensional attention mechanism, while the long-term information representation module learns long-term temporal dependencies using an attention mechanism and a long-short-term memory network. Experiments conducted on nine UEA datasets show that this method enhances the average accuracy by 6.8% in comparison with other methods, validating its effectiveness in multidimensional time series representation.
Abstract: Federated learning (FL) is an emerging distributed machine learning framework aimed at addressing issues of data privacy protection and efficient distributed computing. It allows multiple clients to collaboratively train a global model without sharing their data. However, due to the heterogeneity in the data distribution of each client, a single global model often fails to meet the personalized needs of different clients. To address this issue, this paper proposes a federated learning algorithm that combines self-distillation and decoupled knowledge distillation. The algorithm retains the client’s historical model as a teacher model to distill and guide the training of the local model, and after obtaining a new local model, it is uploaded to the server for weighted averaging and aggregation. In the knowledge distillation process, the decoupled distillation of target class knowledge and non-target class knowledge allows for a more thorough transmission of personalized knowledge. Experimental results show that the proposed method outperforms existing federated learning methods in classification accuracy on the CIFAR-10 and CIFAR-100 datasets.
Abstract: The main task of discourse element identification is to identify discourse element units and classify them. Aiming at the lack of understanding of context dependence in discourse element identification, this study proposes a discourse element identification model based on BiLSTM-Attention to improve the accuracy of discourse element identification in argumentative essays. The model uses sentence structure and positional encoding to identify sentence component relationships and further acquires deep context-related information through bidirectional long short-term memory (BiLSTM). Attention mechanism is introduced to optimize the model feature vectors and improve the accuracy of text classification. Finally, inter-sentence multi-head self-attention is used to obtain the relationships between the content and structure of sentences, so as to make up for the distant sentence dependence. Compared with baseline models such as HBiLSTM and BERT, the accuracy on Chinese and English datasets is improved by 1.3% and 3.6% respectively under the same parameters and the same environmental conditions, which verifies the effectiveness of the model in the discourse element identification task.
Abstract: To improve the accuracy of predicting and diagnosing glaucoma and avoid the accumulation of errors caused by manual screening, this study proposes an automatic glaucoma screening method guided by position attention. The proposed method includes two parts: attention prediction of fundus images and glaucoma disease classification. First, a U-shaped network based on the combination of deep understanding convolution kernels and channel excitation connection spatial pyramids is proposed to predict the attention of fundus images. Feature maps in the decoding process are used as spatial information to guide glaucoma classification. Second, a position attention mechanism used in the glaucoma classification model is proposed, which combines channel information and spatial information from different sources to dynamically adjust the feature maps from external encoders. The main branch of the glaucoma classification model stacks multiple position attention modules and residual modules to fulfill the classification task. At the same time, an auxiliary branch for segmentation tasks is designed to assist in model training and optimization to improve classification accuracy. The precision, recall, and AUC of the proposed method based on the glaucoma LAG dataset test reach 97.84%, 97.75%, and 98.57% respectively, which outperform all the comparative models. The model decision attention area obtained by visualizing the attention activation heat map is more accurate, assisting in locating the lesions in clinical diagnosis and providing an effective reference for the results of clinical diagnosis.
Abstract: Deep learning-based artificial intelligence diagnostic models rely heavily on high-quality and exhaustively annotated data for algorithm training, but they are affected by noise label information. To enhance the robustness of the model and prevent memorization of noise labels, this study proposes a noise label image classification method based on multi-dimensional contrastive learning. This method can effectively integrate multi-dimensional contrastive learning and semi-supervised learning to combat label noise. Specifically, the proposed method consists of three carefully designed components. A mixed feature embedding module with a momentum update mechanism is designed to extract abstract distributed feature representations using mixed augmented images as input. Simultaneously, the study adjusts the features in the feature space from different dimensions by employing a multi-dimensional contrastive learning module, which combines instance contrastive learning and inter-class contrastive learning. Additionally, a noise-robust loss function is utilized to ensure that samples with correct labels dominate the learning process. Experiments conducted on CIFAR-10 and CIFAR-100 datasets demonstrate that the proposed method achieves better results than existing methods.
Abstract: Currently, in traffic prediction, deep learning-based spatio temporal separation modeling methods have difficulty in expressing spatio-temporal coupling correlations in data effectively. Although spatio-temporal joint modeling methods can compensate for the shortcomings of spatio-temporal separation modeling to some extent, there are deficiencies such as insufficient express ability and high computational complexity in constructing spatio-temporal hypergraphs. To address these issues, this study proposes an improved spatio-temporal joint modeling method, window spatial-temporal attention network (W-STANet). W-STANet mainly comprises three parts: a data embedding layer, a spatio-temporal correlation modeling layer, and a prediction head. The spatio-temporal correlation modeling layer learns spatio-temporal correlation features of traffic data by stacking multiple spatio-temporal attention blocks. Meanwhile, by introducing the local window calculation method, data shifting and permutation operations, the computational complexity in the modeling process is greatly reduced, and the modeling from both local and global perspectives within the spatio temporal graph is achieved. Experimental results on five real traffic public datasets demonstrate superior prediction performance compared to other spatio-temporal joint modeling methods. Compared with spatio-temporal separation modeling methods, it has superior prediction performance on large-scale road network datasets.
Abstract: Since 2006, Deep Neural Network has achieved huge access in the area of Big Data Processing and Artificial Intelligence, such as image/video discriminations and autopilot. And unsupervised learning methods as the methods getting success in the depth neural network pre training play an important role in deep learning. So, this paper attempts to make a brief introduction and analysis of unsupervised learning methods in deep learning, mainly includs two types, Auto-Encoders based on determination theory and Contrastive Divergence for Restrict Boltzmann Machine based on probability theory. Secondly, the applications of the two methods in Deep Learning are introduced. At last a brief summary and prospect of the challenges faced by unsupervised learning methods in Deep Neural Networks are made.
Abstract: Based on study of current video transcoding solutions, we proposed a distributed transcoding system. Video resources are stored in HDFS(Hadoop Distributed File System) and transcoded by MapReduce program using FFMPEG. In this paper, video segmentation strategy on distributed storage and how they affect accessing time are discussed. We also defined metadata of video formats and transcoding parameters. The distributed transcoding framework is proposed on basis of MapReduce programming model. Segmented source videos are transcoding in map tasks and merged into target video in reduce task. Experimental results show that transcoding time is dependent on segmentation size and trascoding cluster size. Compared with single PC, the proposed distributed video transcoding system implemented on 8 PCs can decrease about 80% of the transcoding time.
Abstract: Although the deep learning method has made a huge breakthrough in machine learning, it requires a large amount of manual work for data annotation. Limited by labor costs, however, many applications are expected to reason and judge the instance labels that have never been encountered before. For this reason, zero-shot learning (ZSL) came into being. As a natural data structure that represents the connection between things, the graph is currently drawing more and more attention in ZSL. Therefore, this study reviews the methods of graph-based ZSL systematically. Firstly, the definitions of ZSL and graph learning are outlined, and the ideas of existing solutions for ZSL are summarized. Secondly, the current ZSL methods are classified according to different utilization ways of graphs. Thirdly, the evaluation criteria and datasets concerning graph-based ZSL are discussed. Finally, this study also specifies the problems to be solved in further research on graph-based ZSL and predicts the possible directions of its future development.
Abstract: The essential problem of open platform is the validation and authorization of users. Nowadays, OAuth is the international authorization method. Its characteristic is that users could apply to visit their protected resources without the need to enter their names and passwords in the third application. The latest version of OAuth is OAuth2.0 and its operation of validation and authorization are simpler and safer. This paper investigates the principle of OAuth2.0, analyzes the procedure of Refresh Token and offers a design proposal of OAuth2.0 server and specific application examples.
Abstract: According to the actual needs of intelligent household, environmental monitoring etc, this paper designed a wireless sensor node of long-distance communication system. This system used the second SoC CC2530 set in RF and controller chips as the core module and externally connected with CC2591 RF front-end power amplifier module. Based on ZigBee2006 in software agreement stack, it realized each application layer function based on ZStack. It also introduced wireless data acquisition networks based on the ZigBee agreement construction, and has given the hardware design schematic diagram and the software flow chart of sensor node, synchronizer node. The experiment proved that the node is good in performance and the communication is reliable. The communication distance has increased obviously compared with the first generation TI product.
Abstract: A knowledge graph is a knowledge base that represents objective concepts/entities and their relationships in the form of graph, which is one of the fundamental technologies for intelligent services such as semantic retrieval, intelligent answering, decision support, etc. Currently, the connotation of knowledge graph is not clear enough and the usage/reuse rate of existing knowledge graphs is relatively low due to lack of documentation. This paper clarifies the concept of knowledge graph through differentiating it from related concepts such as ontology in that the ontology is the schema layer and the logical basis of a knowledge graph while the knowledge graph is the instantiation of an ontology. Research results of ontologies can be used as the foundation of knowledge graph research to promote its developments and applications. Existing generic/domain knowledge graphs are briefly documented and analyzed in terms of building, storage, and retrieval methods. Moreover, future research directions are pointed out.
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.