Question: (PYTHON) I am trying to convert a text file with lines like this: 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] GET /history/apollo/ HTTP/1.0 200 6245 into a
(PYTHON)
I am trying to convert a text file with lines like this:
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
into a pandas data frame like this:
| host | timestamp | method | url | version | response_code | content_size |
| 199.72.81.55 | 01/Jul/1995:00:00:01 -0400 | GET | /history/apollo/ | HTTP/1.0 | 200 | 6245 |
| unicomp6.unicomp.net | 01/Jul/1995:00:00:06 -0400 | GET | /shuttle/countdown/ | HTTP/1.0 | 200 | 3985 |
I am really close with this method:
df = pandas.read_csv(src_log_filepath, sep="\s-\s-\s\[|\s(?=/)|\]\s\"|\"(?=\s)|\s(?=\d+)", names=["host", "timestamp", "method", "url", "version", "response_code", "content_size"])
Except for It does not separate the "url" contents from what should go into the "version" column.So it would look like this in the url column and the version column would just be NaN
| url | version |
| /history/apollo/ HTTP/1.0 | NaN |
Everything else is fine though. But when I try to add "|\s(?=HTTP)" into the "sep" arg it fixes this issue but then the rest of the data columns get messed up. Where the host column and the timestamp column will now have the IP for some reason:
Example host: 10.223.157.186 15/Jul/2009:14:58:59 -0700
Example timestamp: 10.223.157.186 GET
Some how adding "|\s(?=HTTP)" into "sep" causes this.
sep="\s-\s-\s\[|\s(?=/)|\]\s\"|\"(?=\s)|\s(?=\d+)|\s(?=HTTP)" it would look like this
Why does this happen and how can I separate the URL from the method without this occurring?
(Some requirements for the assignment ask me to clean up the string before I put it into the table. that's why my regex is so weird.)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
