Question: I am working on the below data frame, but not sure how to get the total duration of scan time in nano-sec by using pyspark.

I am working on the below data frame, but not sure how to get the total duration of scan time in nano-sec by using pyspark. Assume the data frame is already defined. Below are some of my thoughts in python:

timestamp event value
1 2020-11-17_19:15:33.438102 scan start
2 2020-11-17_19:18:33.433002 scan end
3 2020-11-17_26:25:21.538105 scan start
4 2020-11-17_29:13:09.538102 scan end
5 2020-11-17_32:13:09.538102 pending start
6 2020-11-17_34:13:09.538102 pending end
7 2020-11-17_35:13:09.538102 pending start
.........

column types:

timestamp: timestamp, event: string, value: string

# get scan start time scan_start = df[(df['event'] == 'scan') & (df['value'] == 'start')] scan_start_time = scan_start['timestamp'] # get scan end time scan_end = df[(df['event'] == 'scan') & (df['value'] == 'end')] scan_end_time = scan_start['timestamp'] # the duration of each scan each_duration = scan_end_time.values - scan_start_time.values # total duration total_duration_ns = each_duration.sum() 

But, I am not sure how to do the above steps in pyspark. Can someone please provide me with some sample code

Thank you

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!