Question: Image captioning using Deep Learning General Instructions: You are recommended to use Google Colab or Jupyter notebook. No need to upload data. If the pre-processed
Image captioning using Deep Learning
General Instructions:
You are recommended to use Google Colab or Jupyter notebook.
No need to upload data. If the pre-processed data, in the form of Python pickle or JSON, is used, this is allowed.
The URL of the source has to be clearly mentioned in the notebook and use Tensorflow and Keras only for Model building
You are expected to provide the Python notebook file (ipynb), and the pdf of the notebook showing the outputs clearly.
Task Response and Task Completion
All the models should be logically sound and have decent accuracy.
There are a lot of subparts, so answer each completely and correctly, as no partial marks will be awarded for partially correct subparts.
The model layers, parameters, hyperparameters, evaluation metrics, etc. should be properly implemented.
Please organize your code with correct line spacing and indentation, and add comments to make your code more readable.
Problem Statement:
Topic: Generate Image Captions using CNN+LSTM. (you can use pre-trained models)
Dataset Type: Common Objects in Context (COCO)
You can use any data source given below to use the dataset.
Data Source 1: https://cocodataset.org/#download
Data Source 2: https://www.tensorflow.org/datasets/catalog/coco
Data Source 3: https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset
Definition: Image Captioning is the process of generating a textual description of an image. It uses both Natural Language Processing and Computer Vision to generate the captions.
Encoder is a CNN. The input image is given to CNN to extract the features. The last hidden state of the CNN is connected to the Decoder. Decoder is LSTM which does language modeling up to the word level. The first time step receives the encoded output from the encoder and also the START vector.
Steps to Perform in the Jupyter Notebook
1. Import Libraries/Dataset
Import the required libraries.
Check the GPU available (recommended - use free GPU provided by Google Colab). ii. Data Processing
Convert the data into the correct format which could be used for the DL model. Plot at least two samples and their captions (use matplotlib/seaborn/any other library).
Load the data into train and test data in the required format.
2. Model Building
Use any pre-trained model trained on the ImageNet dataset (available publicly on google) for image feature extraction.
Create a 2-layer LSTM model and other relevant layers for image caption generation.
Add one layer of dropout at the appropriate position and give reasons.
Choose the appropriate activation function for all the layers.
Print the model summary.
Justify the choice of a number of layers, activation function, and any other hyperparameters used.
3. Model Compilation
Compile the model with the appropriate loss function.
Use an appropriate optimizer.
Justify the choice of the learning rate, optimizer, loss function, and any other hyperparameter used.
4. Model Training
Train the model for an appropriate number of epochs.
Print the train and validation loss for each epoch. Use the appropriate batch size. Plot the loss and accuracy history graphs for both the train and validation sets.
Print the total time taken for training.
5. Model Evaluation
Take 5 random images from Google and generate a caption for that image.
Print confusion metrics and classification reports for the test data.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
