VISIBLE AND INFRARED OBJECT TRACKING BASED ON MULTIMODAL HIERARCHICAL RELATIONSHIP MODELING

Visible RGB and Thermal infrared (RGBT) object tracking has emerged as a prominent area of focus within the realm of computer vision. Nevertheless, the majority of existing RGBT tracking methods, which predominantly rely on Transformers, primarily emphasize the enhancement of features extracted by convolutional neural networks. Unfortunately, the latent potential of Transformers in representation learning has been inadequately explored. Furthermore, most studies tend to overlook the significance of distinguishing between the importance of each modality in the context of multimodal tasks. In this paper, we address these two critical issues by introducing a novel RGBT tracking framework centered on multimodal hierarchical relationship modeling. Through the incorporation of multiple Transformer encoders and the deployment of self-attention mechanisms, we progressively aggregate and fuse multimodal image features at various stages of image feature learning. Throughout the process of multimodal interaction within the network, we employ a dynamic component feature fusion module at the patch-level to dynamically assess the relevance of visible information within each region of the tracking scene. Our extensive experimentation, conducted on benchmark datasets such as RGBT234, GTOT, and LasHeR, substantiates the commendable performance of our proposed approach in terms of accuracy, success rate, and tracking speed.


INTRODUCTION
Visible RGB and Thermal infrared (RGBT) object tracking is an emerging direction in the field of object tracking, aiming to exploit the complementary advantages of visible modality and infrared modality to overcome environmental interference and obtain richer feature representations.There are significant modality differences between visible and infrared images.
Due to visible images and infrared images being captured by different spectral cameras, they undergo significantly different imaging processes and possess distinct wavelength ranges, resulting in notable modality differences.The first critical challenge in RGBT object tracking research is how to overcome this heterogeneity between different modalities.Current RGBT tracking methods often utilize dual-branch networks based on convolutional neural networks and employ a fusion strategy to address this heterogeneity issue.There are roughly three categories of methods for fusing multimodal features: In the first category, as shown in Fig. 1(a), fusion is performed only on high-level features extracted from the two branches using a fusion strategy such as concatenation, element-wise addition, or attention mechanisms Li et al. (2019b); Mei et al. (2021) .In the second category, as shown in Fig. 1(b), a progressive fusion approach is used to fuse features from multiple layers alternately during feature extraction and feature fusion Xiao et al. (2022).However, these methods still have some limitations.Firstly, due to potential imperfect alignments in multimodal images, simple linear operations may lead to the loss of discriminative information in the features.Secondly, the fusion of visible and infrared features often focuses on feature-level fusion at higher layers, lacking early-stage interaction, which may lead to the loss of some important low-level semantic details.Additionally, existing Transformer-based methods typically use separate Transformer encoder and decoder layers for feature enhancement and interaction operations, simply stacking convolutional layers and Transformer layers without fully utilizing its advantages in modeling long-term dependencies.Addressing these issues, we propose a Multimodal Hierarchical Relationship Modeling (MHRM) method, as shown in Fig. 1(c).It utilizes a multi-layer Transformer encoder structure to establish a multi-directional and free information flow, connecting visible light-infrared image pairs and template-search image pairs.During early-stage feature learning in the image, feature interaction and fusion are performed simultaneously.In the interaction process, we model intra-modal fine-grained information relationships, extracting modality-related and discriminative features.The second key challenge in RGBT tracking is determining the importance of infrared modality information relative to visible modality information.The complementarity of infrared modality information to visible modality information varies depending on factors like lighting, object shape, size, and occlusion.The interaction between the two modalities is not always effective for visual tasks.Therefore, in practical scenarios, the importance of infrared and visible information differs across videos, frames, and even different regions within the same frame.Thus, discerning whether infrared information enhances visible information and predicting the extent of this enhancement is crucial.However, existing methods often treat both modalities equally and overlook the varying importance of the two modalities for tracking tasks.To address this challenge, we design a patch-level Dynamic Component Fusion Module (DCFM), dynamically solving the importance of visible information for each region in the tracking scene.It adaptively adjusts the interaction between visible and infrared information during tracking to better adapt to complex tracking scenarios.To quantify the importance, we introduced a illumination decoupling network to compute the illumination of visible images and used a trainable neural network to calculate weights for each block.These weights are assigned to describe the importance of visible information within each region.Through patch-level weight allocation, it assigns more reasonable weight proportions to regions containing objects, highlighting the object while reducing the background's impact.
In summary, the main contributions of this paper are as follows: -We propose a single-stream RGBT tracking framework based on multi-modal hierarchical relationship modeling.By stacking multiple layers of Transformer encoders, we establish a multidirectional and free information flow connecting visible and infrared image pairs, progressively aggregating and fusing multi-modal image features at multiple stages of feature learning.
-We design a patch-level dynamic compoment fusion module based to dynamically solve the importance of visible information for each region in the tracking scene.It adaptively adjusts the interaction between visible and infrared information during tracking to better adapt to complex tracking scenarios.
-Extensive experiments on the RGBT234, GTOT, and LasHeR datasets demonstrate that our method achieves competitive performance in terms of precision, success rate, and tracking speed.

RELATED WORK
RGBT OBJECT TRACKING

METHOD NETWORK ARCHITECTURE
The proposed RGBT tracking method consists of three stokenes: Multimodal Hierarchical Relationship Modeling (MHRM), Dynamic Component Fusion Module (DCFM) and prediction head.The specific structure is shown in Fig. 2. Multiple ViT Dosovitskiy et al. (2021) encoders are used to form the backbone of the Siamese network, which is used to perform feature learning and interaction between the template and the search image, and between the visible and infrared modalities.In the encoding and weighting addition phase, the DCFM is used to compute the weight of the visible modality for each region of the visible image, in order to distinguish the different levels of importance of the visible and infrared images.Finally, the obtained visible and infrared search region features are fused again and reshaped into spatial features for input to the prediction head for subsequent object classification and regression.

MULTIMODAL HIERARCHICAL RELATIONSHIP MODELING
We design a MHRM module to incrementally aggregate and fuse image features in multiple stages of image feature learning through multiple stacked Transformer encoders.Unlike other Transformerbased methods, we eschew the use of decoder structures as a means of feature interaction and instead use only encoder structures, combining multimodal inputs into one feature sequence that is fed into the encoder structure simultaneously.In the multilayer encoder structure, a free flow of information is constructed by self-attention, and the visible-infrared image pairs are connected by a multi-directional information flow.Multimodal information is directed to each other for feature extraction, and each token embedding in the input sequence can complete the global interaction between two pairs.The proposed MHRM structure is shown in the middle part of Fig. 2. A pair of visible and infrared images of a frame of a video sequence in the RGBT dataset is input, and then the images are cropped to obtain the visible template and search image as well as the infrared template and search image.First, we divide the template image and the search image separately.According to the size, the template image is divided into n × n patches and the search image is divided into N × N patches, and then the above image patches are sorted into a sequence of patches to obtain the visible template patch sequence The linear projection layer flattens z v , z i , x v and x i to 2-dimensional features, while adding the learnable position embedding p z and p x , to mark the position information of each patch.The projection layer outputs the token embedding sequences Z and X, in the case of visible modality, the process can be described as: where P is the learnable parameter of the linear projection layer.The same can be done for the infrared token embedding sequence.According to the distribution of token embeddings, the visible and infrared token embeddings are sequentially crossarranged and concatenated into a sequence, which is then fed in parallel into a MHRM consisting of L Transformer encoders.The encoder structure uses the ViT structure that has been applied many times to downstream tasks, with some modifications to make it more suitable for multimodal tasks, as Template Search shown in the left part of Fig. 2. The encoder consists of two layer normalization followed by a multihead self-attention and a multi-layer perceptron layer, during which two residual connections are made.ViT provides a variety of publicly available pre-training models, which can greatly improve the efficiency of our training models.In the encoder, the input token embeddings sequence consisting of two modalities is subjected to multiple self-attention operations.Unlike the cross-attention of two inputs, the self-attention is a process of interaction between two input token embeddings, which enhances their own features by generating an attention matrix.In this process, not only enhancing of the two modalities' own feature representations, but also the interactive fusion between the template and the search, and between the visible and infrared images are carried out.Furthermore, the weight allocation network in Section 3.3 also plays a role in this process.After obtaining the weight matrix W , we calculate the ratio matrix between W and (1 −W ) before encoding, and then multiply it with the infrared token sequence.Also, since this phase is performed simultaneously, the fusion of visible and infrared information is discriminative and instructive in the training process.Compared to cross-attention, self-attention using cascaded features makes the whole framework highly parallelized.Although the input to ViT is still a visible-infrared image pair, the inference speed is less affected by the highly parallel structure.
We use a token subsequence [Z v j ; Z i j ; X v j ; X i j ] to illustrate the principle of the method.This process can be analyzed from two perspectives.First, the formula for the attention mechanism can be expressed as: From the perspective of multimodal relationship, the attention weight map calculation process can be expressed as follows: where W iv can be considered as a measure of the similarity between visible and infrared images, resulting in the self-attention output: where W iv V v {z,x} is responsible for the fusion between visible and infrared modalities, while W vv V v {z,x} and W ii V i {z,x} are feature aggregation operations for image itself.Therefore, the global relationship modeling of the L-layer Transformer encoders achieves a more adequate perceptual fusion of visible and infrared information.
Similarly, from the perspective of relationship between template and the search image, the attention weight map calculation process can be expressed as follows: where W zx can be considered as a measure of the similarity between template and search images, resulting in the self-attention output: (7) where W xz V {v,i} z is responsible for the relational interaction between the template image and the search image, W xx V {v,i} x is equivalent to feature aggregation by attention of the image itself.

DYNAMIC COMPONENT FEATURE FUSION
We focus on how to assign reasonable weights to the multimodal information to regulate the importance of visible and infrared information in the whole tracking task, and thus guide the interaction between the different modalities.We design a patch-level DCFM by introducing an illumination decomposition network to obtain the input visible image illumination map, and then dynamically derive the corresponding weights for each patch by a trainable neural network.
Since there may be significant scene changes between videos and even between frames, we adjust the weighting of the interactions between the different modes to a dynamic value.Since there may be significant illumination differences in different regions of the same frame, we design a Dynamic Component Fusion Module (DCFM) at image patch-level by introducing an illumination decoupling network to obtain the input visible image illumination map, and then dynamically derive the corresponding weights for each patch by a trainable neural network.DCFM estimates a deterministic value α ∈ (0, 1) to describe the trustworthiness of the visible information in each region by measuring the illumination information of the visible image.α and (1 − α) will be used as modality weights to dynamically guide the interaction of visible and infrared modalities throughout.Specifically, we directly refer to the illumination decoupling network in the publicly available pre-trained weight model KinD++ Zhang et al. (2021d).Based on the Retinex illumination enhancement theory, we set two branches for the visible image to decompose the illumination component I and the reflectance component R of the visible image, respectively, denoted as:S = R • I.We keep only the illumination component as illumination map I for subsequent operations.For the actual tracking task, we only need a deterministic value to regulate the multimodal fusion.Therefore, similar to TNet Cong et al. (2022), we set a trainable network to map the illumination map to (0,1) to describe the trustworthiness of the visible illumination.Specifically, the illumination map is divided into regions according to the image patch partitioning rule, then resized by global average pooling, 1×1 convolutional transformation of the channels, and then, finally, the features are mapped to a specific fractional value by a fully connected layer and a sigmoid activation function.The process can be expressed as follows: where I i denotes the feature of the i-th region, Conv and FC represents the convolution and fully connected layer, GAP and σ denotes the global average pooling and sigmoid activation function.α denotes the final weight.
After calculating α for each region, we can obtain the weight matrix W for the search image.W initializes the weights for both visible and infrared modalities across the entire network and adjusts them through a trainable network.The regions of operation for W are the encoder within the main backbone and the final weighted summation module.It globally regulates the fusion-related effects for multimodal integration.
The image patch-level weight assignment can effectively regulate the global multimodal information fusion.To a certain extent, assigning a more reasonable weight to the region patches containing the objects can achieve the purpose of highlighting the object and weakening the background.The is input to the encoder structure in the backbone and the final weighted addition module to globally regulate the fusion-related effects of multimodal.

PREDICTION HEAD AND LOSS FUNCTION
A sequence of token embeddings containing multimodal information is reshaped into a spatial feature map, which are then fed into a fully convolutional network consisting of m convolutionalnormalized-ReLU activation function layers.The response maps M and local offsets O are output to obtain the final predicted classification results and object coordinates.In the training process, the entire tracking network architecture uses both classification and regression loss functions to achieve the best training results.We use weighted focal loss Law and Deng (2018) as the classification loss, GIoU loss Rezatofighi et al. (2019) as the regression loss, and mean absolute error loss.
Weighted focal loss adjusts the training focus onto challenging samples by dynamically modifying the weights of easily distinguishable samples during the training process.Its calculation formula can be described as follows: where M xy is the prediction score at position (x, y) in the predicted response map, Mxy denotes the truth heat map generated using Gaussian kernel, β and η is the hyperparameter set to 2 and 4 respectively during training.The IoU loss employs the Intersection over Union (IoU) metric to address the issue where bounding boxes with the same L-distance between predicted and ground truth boxes have different IoU values, making it challenging to optimize using IoU alone.However, when two bounding boxes do not intersect (i.e., IoU=0), the loss value is 0, which prevents gradient backpropagation.Hence, the Generalized IoU (GIoU) loss is used.It introduces a minimum enclosing box to confine the overlap range.The calculation formula is as follows: where A c represents the minimum enclosing box area of the real target bounding box and the predicted bounding box, while µ stands for the union area of the real bounding box and the predicted bounding box.The total loss of the network is described as follows: where λ 1 and λ 2 are the equilibrium parameters, which are set to 2 and 5 in the experiment.

EXPERIMENTS DATASETS AND METRICS
GTOT includes 50 pairs of highly aligned visible and infrared videos, which were captured in different scenes and conditions.Each frame contains manually annotated data, including the coordinates of the object's bounding box and attributes indicating challenging conditions Li et al. (2016).
RGBT234 is an extension of the RGBT210 dataset, comprising 234 pairs of highly aligned visible and infrared videos.It also includes manually annotated object bounding boxes and attributes indicating challenging conditions.The annotations are more accurate, and the attributes are richer, taking into account various environmental challenges Li et al. (2019a).
LasHeR is a large RGBT dataset that consists of 1224 pairs of visible and infrared videos, featuring greater scene complexity.Among these, 979 video sequences are allocated to the training set, while 245 sequences are allocated to the test set Li et al. (2021).
In the evaluation of the GTOT, RGBT234, and LasHeR test sets, we use the same two evaluation metrics: Precision Rate (PR) and Success Rate (SR).We calculate the center position error between predicted bounding boxes and ground truth bounding boxes for all frames and set a threshold, where the CLE is defined as: where (x 1 , y 1 ) and (x 2 , y 2 ) indicates the center point coordinates of the predicted bounding box and the ground truth.PR represents the percentage of all frames whose center position error is less than this threshold.Similarly, we compute the overlap between the predicted bounding box and the groundtruth for all frames and set a threshold, where the overlap is defined as: where a and b indicates the predicted bounding box and ground truth.SR is the percentage of all frames whose overlap is greater than the threshold, i.e. the percentage of frames successfully tracked.

EXPERIMENT SETTING
We conduct experiments on a computer equipped with an NVIDIA GTX 4090 GPU, running the Ubuntu 20.04 operating system.et al. (2022).The comparison of tracking results on the GTOT dataset is shown in Fig. 3.The comparison results show that our method shows superior accuracy and success rate compared with most of the tracking methods, and achieves the best performance in the success rate metric.It achieves 74.5%.The accuracy achieves 90.2%.The accuracy and success rate are 11.5% and 9.2% higher than the baseline method OTrack, respectively.The success rate is 0.8% higher and the accuracy is only 0.3% lower than that of the state-of-the-art method APFNet.In addition, the proposed method is based on SiameseFC, which has a significant speed advantokene over APFNet based on MDNet Nam and Han (2016), and the speed comparison results are shown in detail in Section 4.5.

EVALUATION ON RGBT TRACKING DATASETS
RGBT234 Dataset.The comparison of tracking results on the RGBT234 dataset is shown in Fig. 4. The comparative results show that our method also achieves better results on RGBT234.The best performance is achieved in the success rate index, which reaches 59.9%.The accuracy achieves 80.5%.The accuracy and success rate are 5.6% and 3.6% higher than the baseline method OSTrack, respectively.The success rate is 2% higher and the accuracy is 2.2% lower than that of the state-of-the-art method APFNet.Compared with the method APFNet, the accuracy is improved by 7.3% and the success rate is increased by 8.9%.According to the analysis, the LasHeR dataset has more complex scenes and challenging environments, and the proposed method focuses more on modulating the enhancement of visible modality by infrared information, so it is more advantokeneous when facing more challenging environments.
Visualization.The tracking visualization results of GTOT and RGBT234 subsequences are shown in Fig. 6, and the visualization of the LasHeR subsequencesis shown in Fig. 7.The results show that our method compared to APFNet and OSTrack, the tracking bounding box results are closer to GroundTruth and less prone to tracking drift when dealing with video sequences in poor environments such as nighttime or low-light.

EVALUATION OF ATTRIBUTE CHALLENGE
GTOT dataset primarily consists of seven attributes, corresponding to seven external environmental challenges.These challenges include target occlusion (OCC), scale variation (LSV), fast  motion (FM), low illumination (LI), thermal crossover (TC), small objects (SO), and object deformation (DEF).To address these attribute challenges, we conducted precision and success rate comparisons of seven tracking methods on the GTOT dataset, as shown in Fig. 8.The figure shows that our method achieves the best results for both SR and PR in the LSV and LI attributes, and SR alone achieves the best results in the OCC, FM, and TC attributes, and PR is mostly in the second position.There are also cases of PR or SR in second place in SO and DEF.In summary, our method maintains the best or second-best performance on most attributes' two metrics, especially in scale variation, low illumination.Compared to attribute-based method APFNet, our method also shows certain advantages.Experimental results confirm that our method excels in dealing with external environmental challenges.

ABLATION STUDY
Module Ablation Experiments.In order to verify the effectiveness of proposed MHRM and DCFM, the entire network was disassembled and combined.To further validate the effectiveness of DCFM, Static Component Fusion Module (SCFM) was set and α was manually set to 0.6.Ablation experiments were conducted on two datasets, RGBT234 and GTOT.The comparison results are shown in Table 1.The results show that the tracking model with the addition of the MHRM and DCFM can improve the accuracy and success rate, and the weight assignment provided by the DCFM can also improve the performance compared to the static assignment, and the synergy between the two can achieve the best performance improvement.This demonstrates the effectiveness of each of the proposed modules.
Candidate Elimination Ablation Experiments.The Candidate Elimination (CE) module is a key module in the benchmark method OSTrack.We compared the method with the Candidate Elimination (CE) retained to the method with CE removed.The comparison results are shown in Table 1.The experimental results show that the model with CE in place has a slight decrease in both metrics for both datasets.Therefore, we removed CE module.The experimental results indicate that incorporating the CE module leads to a slight decrease in both metrics for both datasets.As a result, we decided to remove the CE module.

EFFICIENCY ANALYSIS
The efficiency analysis experiments were conducted in the same environment.We compare the tracking speed of our method with several tracking methods with better performance, APFNet Xiao et al. (2022), MANet Li et al. (2019b), and MANet++ Lu et al. (2021).The comparison results are shown in Table 2.The results show that the tracking speed of our method can reach 87 FPS, which is sufficient to achieve real-time tracking.Compared with the best method APFNet, the average FPS of our method is 78.9 higher than APFNet.Compared with the fastest

CONCLUSION
In this paper, we propose an RGBT tracking framework, which uses a stacked Transformer encoders to progressively aggregate and fuse multimodal image features.During the entire multimodal interaction process of the network, a DCFM is used to dynamically solve for the importance of visible information in each region of the tracking scene, thereby regulating the interaction between visible and infrared information in the tracking process.Experimental results on three datasets demonstrate the competitive performance of the proposed method.

Fig. 2 :
Fig. 2: The framework of multimodal hierarchical relationship modeling tracking model.It includes three parts: multimodal hierarchical relationship modeling, dynamic component feature fusion and prediction head.
During the training process, the OSTrack-384 is used as a baseline.Firstly, the pretrained model parameters based on the ViT with MAE He et al. (2022) are loaded, and the number of encoders, denoted as L, is set to 12.The encoder model parameters are initialized.Subsequently, the DCFM parameters are initialized based on the Retinex-based illumination decoupling network.The entire model is trained using the LasHeR dataset train sets, and data augmentation strategies are employed during training, including operations like flipping, rotation, brightness jitter, and so on.The initial learning rate for the encoder backbone is set to 4 × 10 −5 , while the learning rate for other network structures in the model is set to 4 × 10 −4 .The Adam W optimizer is used to optimize the model with a weight decay of 10 −4 , and the training iteration count is set to 100.During testing, the pretrained model parameters are loaded, and model parameters are fixed.The classification map is simply multiplied by a Hann window of the same size.The bounding box with the highest score after multiplication is selected as the tracking result.

Fig. 3 :
Fig. 3: Evaluation results on the GTOT dataset.GTOT Dataset.We compare the our method (Ours) with several state-of-the-art tracking methods in terms of both accuracy and success rate, including the RGBT tracking methods APFNet Xiao et al. (2022), MANet Li et al. (2019b), JMMAC Zhang et al. (2021b), M5L Tu et al. (2022), HDINet Mei et al. (2021), DAFNet Gao et al. (2019), DAPNet Zhu et al. (2019) and the traditional tracking method OTrack Ye et al. (2022).The comparison of tracking results on the GTOT dataset is shown in Fig.3.The comparison results show that our method shows superior accuracy and success rate compared with most of the tracking methods, and achieves the best performance in the success rate metric.It achieves 74.5%.The accuracy achieves 90.2%.The accuracy and success rate are 11.5% and 9.2% higher than the baseline method OTrack, respectively.The success rate is 0.8% higher and the accuracy is only 0.3% lower than that of the state-of-the-art method APFNet.In addition, the proposed method is based on SiameseFC, which has a significant speed advantokene over APFNet based on MDNetNam and Han (2016), and the speed comparison results are shown in detail in Section 4.5.

Fig. 5 :
Fig. 5: Evaluation results on the LasHeR dataset.LasHeR Dataset.We compare our method (Ours) with several RGBT tracking methods, including APFNet Xiao et al. (2022), MANet Li et al. (2019b), MANet++ Lu et al. (2021), DAFNet Gao et al. (2019), DAPNet Zhu et al. (2019), FANet Zhu et al. (2021), mfDiMP Zhang et al. (2019) and OSTrack Ye et al. (2022).The comparison results are shown in Fig. 5.The results show that our method achieves best performance in both accuracy and success rate.Compared with the method APFNet, the accuracy is improved by 7.3% and the success rate is increased by 8.9%.According to the analysis, the LasHeR dataset has more complex scenes and challenging environments, and the proposed method focuses more on modulating the enhancement of visible modality by infrared information, so it is more advantokeneous when facing more challenging environments.

Table 1 :
PR/SR scores of ablation experiments on GTOT and RGBT234.RGB T MHRM SCFM DCFM CE Ye et al.method MANet++, the average FPS is improved by 35.6.The data in the table are sufficient to prove that our method performs well in terms of accuracy, success rate and tracking speed.

Table 2 :
Comparison of efficiency and real-time performance (PR/SR/FPS) of four methods.