How to Find Strange Lines from a Text File with Python

You have a text file with too many lines in it. It would take forever to skim through that file but you need to find strange lines differing from the rest of the file. This article describes how you can detect those anomalies from the file.

For testing purposes you can copy the following lines to a file called access.log.

192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.3 - - [11/May/2014:08:03:01 -0700] "GET /cat.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:07:55 -0700] "GET /cat.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.4 - - [11/May/2014:08:15:32 -0700] "GET /plane.png HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:19:33 -0700] "GET /image.png HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.77 - - [11/May/2014:08:44:20 -0300] "GET /modules.php?name=Downloads&d_op=modifydownloadrequest& %20lid=-1%20UNION%20SELECT%200, username,user_id,user_password, name,%20user_email,user_level,0, 0%20FROM%20nuke_users HTTP/1.1" 200 9918 "http://www.example.com/start.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" "www.example.com"
192.168.0.4 - - [11/May/2014:09:01:19 -0700] "GET /picture.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:09:11:11 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.3 - - [11/May/2014:09:12:10 -0700] "GET /test.png HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:09:13:44 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:09:13:55 -0700] "GET /page.php HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.3 - - [11/May/2014:09:14:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.5 - - [11/May/2014:09:15:10 -0700] "GET /cat.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.6 - - [11/May/2014:09:15:55 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.5 - - [11/May/2014:09:16:05 -0700] "GET /cat.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.4 - - [11/May/2014:09:16:15 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:16:45 -0700] "GET /plane.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:17:19 -0700] "GET /fig.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.3 - - [11/May/2014:08:19:01 -0700] "GET /dog.png HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:19:56 -0700] "GET /dog.png HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"
192.168.0.2 - - [11/May/2014:08:01:19 -0700] "GET /cake.jpg HTTP/1.0" 200 3330 "http://www.example.com/start.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1" "www.example.com"

We’l be doing this in Python, so fire up your favorite code editor and start typing.

1. First things first

We need to make a request to the CAP server. For Python, the Requests library is ideal for the task. Install the library, for example using pip in the command prompt:

pip install requests

Now it is time to import the library to your script. Write the following lines to a file and run the script, or you can just type them into the Python interpreter.

import requests

We’ll set the timeout in seconds and filename so we know which file to inspect.

TIMEOUT = 60
filename = 'access.log'

2. CAP does the work

Set your request URL using your personal API key that is found on the header of the CAP pages when logged in or from the documentation page. Then we read the file and send the request.

url = "https://api.capdatatechnologies.com/file/YOUR_API_KEY_HERE"
with open(filename, 'rb') as f:
    resp = requests.post(url, stream=True, data=f, verify=True, timeout=TIMEOUT)

3. Results just in!

Let’s wait a bit. The time needed depends on the size of your file. After the request is complete, you have the response and can print the text. It is in JSON fromat.

print(resp.text)

The output will look something like this (from the example file above):

[{"line_number": 6, "anomality_level": "1.402", "is_anomaly": 1, "line": "192.168.0.77 - - [11/May/2014:08:44:20 -0300] \"GET /modules.php?name=Downloads&d_op=modifydownloadrequest& %20lid=-1%20UNION%20SELECT%200, username,user_id,user_password, name,%20user_email,user_level,0, 0%20FROM%20nuke_users HTTP/1.1\" 200 9918 \"http://www.example.com/start.html\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\" \"www.example.com\""}, {"line_number": 30, "anomality_level": "1.096", "is_anomaly": 1, "line": ""}]

That’s it! Now you know which lines are not normal in your file. Sign up and start finding anomalies.


For further information from CAP

CAP Data Technologies
Tuomo Sipola, Ph.D., CEO
tuomo.sipola@capdatatechnologies.com, +358 40 753 2169