Question: I am facing an issue because I am trying to concatenate a one hot encoding vector of size 5 with the patches of the image

I am facing an issue because I am trying to concatenate a one hot encoding vector of size 5 with the patches of the image (I know I am using a different approach in the code), I am using a pretrained vision transformer model - vit_base_patch16_224. It is a multilabel classification problem.
Ideally, I would feed the model the image and the resolution and I would like the model to predict the most suitable labels. An image of the dataset is attached below.
And this is the model I am using; I have done all the necessary steps :
class ResizingModel(MultilabelImageClassificationBase):
def _init_(self, num_classes):
super(ResizingModel, self)._init_()
self.num_classes = num_classes
self.vit = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
self.fc = nn.Linear(self.vit.config.hidden_size +5, num_classes)
# print(f"Input size of self.cls_embedding_size: {self.vit.config.hidden_size +5}")
# print(f"Output size of self.fc: {num_classes}")
def forward(self, x, target_resolution_one_hot):
# Process input image through Vision Transformer
vit_output = self.vit(x)
# Extract the CLS token embedding from the Vision Transformer output from the last hidden state
#the hidden state refers to the information
cls_token_embedding = vit_output.last_hidden_state[:,0, :] # Use only the CLS token
# print("Shape of cls_token_embedding:", cls_token_embedding.shape) # Shape
# print("Shape of target res:", target_resolution_one_hot.shape) # Shape
concatenated_input = torch.cat([cls_token_embedding, target_resolution_one_hot], dim=1)
# Apply linear layer
logits = self.fc(concatenated_input)
return logits
I have also attached an image of the sample output, my main concern is that the accuracy is extremely low (~0.29).
I would appreciate it if you could help me in solve this problem.
\table[[image,resolution,CR,SC,SNS,SCL],[images\1_city.jpg,0,1,0,0,0,0,0,1,0],[images\1_city.jpg,0,0,0,1,0,1,1,1,1],[images\1_city.jpg,1,0,0,0,0,0,0,1,0],[images\1_city.jpg,0,0,1,0,0,0,0,1,0],[images\1_city.jpg,0,0,0,0,1,1,1,1,1],[images \\1_crowd.jpg,0,0,0,1,0,1,0,1,1],[images \\1_crowd.jpg,1,0,0,0,0,1,0,0,0],[images \\1_crowd.jpg,0,0,0,0,1,1,1,1,1],[images \\1_crowd.jpg,0,1,0,0,0,1,0,1,0],[,0,0,1,0,0,1,0,1,0]]
I am facing an issue because I am trying to

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!