I am facing an issue because I am trying to concatenate a one hot encoding vector of size 5 with the patches of the image ( I know I am using a different approach in the code ) , I am using a pretrained vision transformer model vit base patch 1 6 2 2 4 It is a multilabel classification problem Ideally, I would feed the model the image and the resolution and I would like the model to predict the most suitable labels An image of the dataset is attached below And this is the model I am using I have done all the necessary steps class ResizingModel ( MultilabelImageClassificationBase ) def init ( self , num classes ) super ( ResizingModel , self ) init ( ) self num classes num classes self vit ViTModel from pretrained ( ' google vit base patch 1 6 2 2 4 in 2 1 k ' ) self fc nn Linear ( self vit config hidden size 5 , num classes ) print ( f Input size of self cls embedding size self vit config hidden size 5 ) print ( f Output size of self fc num classes ) def forward ( self , x , target resolution one hot ) Process input image through Vision Transformer vit output self vit ( x ) Extract the CLS token embedding from the Vision Transformer output from the last hidden state the hidden state refers to the information cls token embedding vit output last hidden state , 0 , Use only the CLS token print ( Shape of cls token embedding , cls token embedding shape ) Shape print ( Shape of target res , target resolution one hot shape ) Shape concatenated input torch cat ( cls token embedding, target resolution one hot , dim 1 ) Apply linear layer logits self fc ( concatenated input ) return logits I have also attached an image of the sample output, my main concern is that the accuracy is extremely low ( 0 2 9 ) I would appreciate it if you could help me in solve this problem table image , resolution,CR , SC , SNS , SCL , images 1 city jpg , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , images 1 city jpg , 0 , 0 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , images 1 city jpg , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , images 1 city jpg , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , images 1 city jpg , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , images 1 crowd jpg , 0 , 0 , 0 , 1 , 0 , 1 , 0 , 1 , 1 , images 1 crowd jpg , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , images 1 crowd jpg , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , images 1 crowd jpg , 0 , 1 , 0 , 0 , 0 , 1 , 0 , 1 , 0 , , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 0

The Answer is in the image, click to view ...

Question: I am facing an issue because I am trying to concatenate a one hot encoding vector of size 5 with the patches of the image

I am facing an issue because I am trying to concatenate a one hot encoding vector of size

5

with the patches of the image

(

I know I am using a different approach in the code

),

I am using a pretrained vision transformer model

-

vit

_

base

_

patch

16_224 .

It is a multilabel classification problem.

Ideally, I would feed the model the image and the resolution and I would like the model to predict the most suitable labels. An image of the dataset is attached below.

And this is the model I am using; I have done all the necessary steps :

class ResizingModel

(

MultilabelImageClassificationBase

)

def

_

init

_(

self

,

num

_

classes

)

super

(

ResizingModel

,

self

) ._

init

_()

self.num

_

classes

=

num

_

classes

self.vit

=

ViTModel.from

_

pretrained

('

google

/

vit

-

base

-

patch

16 - 224 -

21

')

self.fc

=

.

Linear

(

self

.

vit.config.hidden

_

size

+ 5,

num

_

classes

)

# print

(

"

Input size of self.cls

_

embedding

_

size:

{

self

.

vit.config.hidden

_

size

+ 5} ")

# print

(

"

Output size of self.fc:

{

num

_

classes

} ")

def forward

(

self

,

,

target

_

resolution

_

one

_

hot

)

# Process input image through Vision Transformer

vit

_

output

=

self.vit

(

)

# Extract the CLS token embedding from the Vision Transformer output from the last hidden state

#the hidden state refers to the information

cls

_

token

_

embedding

=

vit

_

output.last

_

hidden

_

state

[

, 0,

]

# Use only the CLS token

# print

("

Shape of cls

_

token

_

embedding:", cls

_

token

_

embedding.shape

)

# Shape

# print

("

Shape of target res:", target

_

resolution

_

one

_

hot.shape

)

# Shape

concatenated

_

input

=

torch.cat

([

cls

_

token

_

embedding, target

_

resolution

_

one

_

hot

],

dim

= 1)

# Apply linear layer

logits

=

self.fc

(

concatenated

_

input

)

return logits

I have also attached an image of the sample output, my main concern is that the accuracy is extremely low

(

0.29) .

I would appreciate it if you could help me in solve this problem.

\

table

[[

image

,

resolution,CR

,

,

SNS

,

SCL

], [

images

\ 1_

city.jpg

, 0, 1, 0, 0, 0, 0, 0, 1, 0], [

images

\ 1_

city.jpg

, 0, 0, 0, 1, 0, 1, 1, 1, 1], [

images

\ 1_

city.jpg

, 1, 0, 0, 0, 0, 0, 0, 1, 0], [

images

\ 1_

city.jpg

, 0, 0, 1, 0, 0, 0, 0, 1, 0], [

images

\ 1_

city.jpg

, 0, 0, 0, 0, 1, 1, 1, 1, 1], [

images

\ \ 1_

crowd.jpg

, 0, 0, 0, 1, 0, 1, 0, 1, 1], [

images

\ \ 1_

crowd.jpg

, 1, 0, 0, 0, 0, 1, 0, 0, 0], [

images

\ \ 1_

crowd.jpg

, 0, 0, 0, 0, 1, 1, 1, 1, 1], [

images

\ \ 1_

crowd.jpg

, 0, 1, 0, 0, 0, 1, 0, 1, 0], [, 0, 0, 1, 0, 0, 1, 0, 1, 0]]

I am facing an issue because I am trying to

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Hide Assignment Information Instructions Project Closure Report When a project has ended, whether in successful completion or in an unsuccessful termination, it is important to create a project...

Jupiter Notebook We have covered some of the limitations of single layer neural networks in class, but they are still powerful learning systems that provide a good way to begin learning about how to...

Due to the changing environment and external triggers, contingency planning is necessary. What qualities make a future issue a ?trigger?? Consider you are on the strategic planning team for a soft...

Through the use of strategic alternatives, companies may compete in a marketplace, achieve its vision, or if no vision has been articulated, decide where it might go and what it might achieve....

This text was adapted by The Saylor Foundation under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License without attribution as requested by the work's original creator or licensee....

CASE EXAMPLE Google: who drives the strategy? Phyl Johnson, Strategy Explorers From an idea to a verb in less than 15 years: 'to Google - to search the Internet' If you are in need of the answer to a...

Chapter 3 Describe and explain the terms climate sensitivity, fat tail, and low beta, and how they relate to climate change. What are the estimated economic costs of global warming? Can we trust...

I want answers to the case study below CASE STUDY The Rise, Fall, and Resurrection of Iridium: A Project Management Perspective The Iridium Project was designed to create a worldwide wireless...

Fixing the payment system at Alvalade XXI: a case on IT project risk management Ramon O'Callaghan Tilburg University, The Netherlands Correspondence: AO'Callaghan, School of Economics and Business...

Jupyter Notebook Now that we have tried our hand at some single-layer nets, let's see how they stack up compared to multi-layer nets. :) We will be exploring the basic concepts of learning non-linear...

Coding Language c++ Huffman Encoding Using the Huffman encoding algorithm as explained in class, encode and decode the Speech.txt file using frequency tree and priority queue. Implement Huffman style...

Q1) An object moving along the x-axis has an initial velocity v= 1 m/s at t=0. Its velocity two seconds later is -7 m/s. What is the average acceleration (in m/s') of the particle between t=0 and...

1. Why did the Supreme Court choose to hear this case? 2. Can color be an inherently distinctive mark? When can color be protected as a mark? 3. Can color be protected as a trademark if it enhances...

Lee pays one percent for a month interest on his credit card account and his monthly rate multiplied by 1 2 . The result in interest is referred to the simple rate of annual percentage rate compound...