This is my first try on using R in network analytics. For this example I'm going to use the dataset used on Raffy's blog Cleaning Up Network Traffic Logs - VAST 2013 Challenge.
As you can see in the network topology , the netflow collector is logging all traffic to/from internet.
The package data.table has a great performance loading the data, as you can see:
library(data.table) system.time(nf1<-fread("netflow/VAST2013MC3_NetworkFlow/nf-chunk1.csv",sep=",",header=TRUE)) Read 15172767 rows and 19 (of 19) columns from 1.777 GB file in 00:01:11 user system elapsed 47.75 2.56 71.65 setkey(nf1,TimeSeconds)
So we can analyse the data, Let's make a pair of graphs to understand the data:
As you can see:
- TCP is by far the most used protocol
- ICMP has few responses, common in places with firewalls that deny it from internet
Other interesting graphs:
The ports usage is important because you can identify an OS by the Ephemeral Source Port Selection Strategies.
Futhermore, most internet services (http,smtp,pop, …) run below port 1024.