Wednesday, October 22, 2014

Network Ports: Categorical or Numerical

Network Ports: Categorical or Numerical

Network Ports: Categorical or Numerical

Raffael Marty make me a great advice about the treatment of network ports for visualization.

Network Ports

The port numbers are used by the Transport layer to provide host-to-host connectivity of protocol services.The Internet Assigned Numbers Authority (IANA) is responsible for maintaining the official assignments of port numbers for specific uses. More on this here
We can usually distingish 3 ranges of ports:
  • Well-known ports [0 to 1023]: They are used by system processes that provide widely used types of network services. You can find here the most used protocols, like HTTP(80), HTTPS(443), SMTP(25), DNS(53), …
  • Registered ports [1024 to 49151]: They are assigned by IANA for specific service upon application by a requesting entity.
  • Ephemereal ports [49152 to 65535]: This range is used for custom or temporary purposes and for automatic allocation. Team Cymru has a great compilation of default ephemeral port usage and source port selection strategies known to be used by a variety of systems.

Types of variables: Numerical or Categorical

I bought the book OpenIntro Statistics to learn R and statistics, and in the Chapter 1.2.2 it explain the types of variables.
Summarizing, a numerical variable can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. On the other hand, average, sum, and difference of categorical variables have no clear meaning.

Test lab

I have collected some netflow data from one host of my home network, if we consider network ports as numerical:
plot of chunk unnamed-chunk-3
Analyzing Destination Ports:
plot of chunk unnamed-chunk-4
Destination TCP ports are port 80 (HTTP) and port 443 (HTTPS), The average, sum, and difference of them has not clear meaning, so this variable can't be numerical.
On the other hand, if we look at Source Ports:
plot of chunk unnamed-chunk-5
We can see that most source ports in UDP and TCP are between 49000 to 52000. Also, UDP uses port 123 and 5353 source ports.
NOTE: Searching for udp ports 123 and 5353 in google you can see articles about AirPlay, AppleTV and Apple related technologies.
With this information and looking at the Team Cymru table we can assume that this device is working with Apple IOS (It's my IPad)


I think that most times you are analyzing network flows you must consider ports as a canonical variable.

Considering all flows originated from local network, SrcPort means a Local Port and DestPort means the ports connected to.

# Filter only flows originated from local network
n<-nf[V4 %like% "192.168.1."]
# Clean nfdump columns to srcip,srcport,dstport and protocol
# Remove ICMP flows
nports<-nports[Proto != "ICMP" ]
# Melt and group flows to plot
n.melt<-melt(nports,measure.vars = (2:3))
n.melt$PortCat<-cut(n.melt$value,c(-1,1024,49151,65535),labels=c("Well Known","Registered","Ephemereal") )
# Table with data
## , ,  = SrcPort
##                 TCP    UDP
##   Well Known      2   2592
##   Registered  44502  33690
##   Ephemereal  91647  19634
## , ,  = DestPort
##                 TCP    UDP
##   Well Known 134664  31216
##   Registered   1459  24699
##   Ephemereal     28      1
mosaic( ~ PortCat + Proto + variable, data = n.melt)

plot of chunk unnamed-chunk-6

In this graph you can see that most TCP connections are originated from Ephemeral and Registered Ports to Well Known Ports. UDP connections have a different behavior because the connections are more distributed. ICMP is removed because nfdump use the ICMP code as the destination port.

Anyway, if you are dealing with ephemereal ports trying to detect the OS of the host, you need to consider them as a numerical variable.