Question: I am working on the below data frame, but not sure how to get the total duration of scan time in nano-sec by using pyspark.
I am working on the below data frame, but not sure how to get the total duration of scan time in nano-sec by using pyspark. Assume the data frame is already defined. Below are some of my thoughts in python:
| timestamp | event | value | |
|---|---|---|---|
| 1 | 2020-11-17_19:15:33.438102 | scan | start |
| 2 | 2020-11-17_19:18:33.433002 | scan | end |
| 3 | 2020-11-17_26:25:21.538105 | scan | start |
| 4 | 2020-11-17_29:13:09.538102 | scan | end |
| 5 | 2020-11-17_32:13:09.538102 | pending | start |
| 6 | 2020-11-17_34:13:09.538102 | pending | end |
| 7 | 2020-11-17_35:13:09.538102 | pending | start |
| ......... |
column types:
timestamp: timestamp, event: string, value: string
# get scan start time scan_start = df[(df['event'] == 'scan') & (df['value'] == 'start')] scan_start_time = scan_start['timestamp'] # get scan end time scan_end = df[(df['event'] == 'scan') & (df['value'] == 'end')] scan_end_time = scan_start['timestamp'] # the duration of each scan each_duration = scan_end_time.values - scan_start_time.values # total duration total_duration_ns = each_duration.sum()
But, I am not sure how to do the above steps in pyspark. Can someone please provide me with some sample code
Thank you
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
