Research on container network of Kubernetes

Posted on 2016-11-14 | In container , network |

Network provider research

provider	k8s version	k8s network policy	pros	cons	thoughoutput /%direct
flannel vxlan	>= 1.2	no	1) easy to configure 2) easy to span VLAN/datacenter	1) Broadcast flood to 192.168.0.0/16 since no exact ip route setting. 2) performance downgrade 3) network isolation needs extra subnet mgt. efforts	45%
flannel host-gw	>= 1.2	no	1) easy to configure 2) no obvious performance downgrade	1) To span multiple subnets in a vlan, need extra steps to add routing rules 2) can’t span multiple vlans, datacenters. 3) doesn’t support network policy 4) network isolation needs extra subnet mgt. efforts	93%
calico	>= 1.3	yes	1) `bird` agents configure routes with BGP on each node. 2) flexible subnets expansiton with ip address pool mgmt 3) support k8s network policy 4) enable `-ipip` to support cross L2 VLAN.	1) arch is complex, management of bird and felix, need higher learning carve for deployment, debugging, operation 2) when enabling `-ipip`, introduce additional packet encapsulation with significant performance downgrade.	BGP:93% vs BGP+ipip:64%
canal (calico + flannel vxlan)	>= 1.3	yes	1) support vxlan to cross L2. 2) Network policy support extended from Felix of calico. 3) smooth migration from existing flannel to calico	1)significant performance downgrade due to packet encapsulation and broadcast flood. 2) Double complexity	45%
calico + IaaS IP address (SL portableIP) + hostAffirnity	>= 1.3	yes	1) consist network IP address space as IaaS env. 2) Network policy support.	1) No L2 support to cross VLAN. 2) enable subnet of hostAffirnity and integrate application of ip address space from IaaS.	93%

Flannel

Summary of flannel over “vxlan” and “host-gw”

:star: vxlan is the default backend type of ubuntu k8s deployment. The container-to-container test result above (1.37 G/sec) performs ~50% of the raw host-to-host result (3.02 G/sec).
:star: “host-gw” leverage the kernel route table with “ip routes” to route traffic to target host. The container-to-container test result above (2.84 G/sec) performs ~93% of the raw host-to-host result (3.02 G/sec)

container to container over vxlan

$ docker run  -it --rm networkstatic/iperf3 -c 172.31.71.4
Connecting to host 172.31.71.4, port 5201
[  4] local 172.31.15.4 port 57807 connected to 172.31.71.4 - - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.60 GBytes  1.37 Gbits/sec  398             sender
[  4]   0.00-10.00  sec  1.60 GBytes  1.37 Gbits/sec                  receiver

container to container over host-gw

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  3.31 GBytes  2.85 Gbits/sec  284             sender
[  4]   0.00-10.00  sec  3.31 GBytes  2.84 Gbits/sec                  receiver

Calico

Summary of calico BGP and calico BGP+ipip

:star: “calico BGP(node to node mesh)” is similar to “host-gw”. about 93%+ performance of direct connection.
:star: “calico BGP + -ipip(node to node mesh)” is similar to “vxlan”. it is about 64%+ performance of direct connection due to packet escapsulation, but it is better than “vxlan” without broadcast flood due to appropriate ip routes setting. Significant negative performance impact even for connection in the same VLAN.

BGP without ipip

host to host (in the same VLAN)

Connecting to host 10.177.83.70, port 5201
[  4] local 10.177.83.83 port 39122 connected to 10.177.83.70 port 5201
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  3.71 GBytes  3.19 Gbits/sec    0             sender
[  4]   0.00-10.00  sec  3.71 GBytes  3.18 Gbits/sec                  receiver

Container to container over BGP node to node mesh(same VLAN)

1
2
3

 ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  3.45 GBytes  2.97 Gbits/sec  608             sender
[  4]   0.00-10.00  sec  3.45 GBytes  2.96 Gbits/sec                  receiver

Container to container over BGP+-ipip (same VLAN)

1
2
3

[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  2.37 GBytes  2.03 Gbits/sec    3             sender
[  4]   0.00-10.00  sec  2.36 GBytes  2.03 Gbits/sec                  receiver

Container to container over BGP+-ipip (cross VLAN)

1
2
3

[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  2.47 GBytes  2.12 Gbits/sec  556             sender
[  4]   0.00-10.00  sec  2.46 GBytes  2.12 Gbits/sec                  receiver

host to host (cross vlan)

1
2
3

[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  4.43 GBytes  3.81 Gbits/sec  1742             sender
[  4]   0.00-10.00  sec  4.43 GBytes  3.81 Gbits/sec                  receiver

AWS VPC network

Summary of aws-vpc

:star: “aws-vpc” is similar to “host-gw”. about 93%+ performance of direct connection.

host to host
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   853 MBytes   716 Mbits/sec    0             sender
[  4]   0.00-10.00  sec   853 MBytes   715 Mbits/sec                  receiver
container to container
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   831 MBytes   697 Mbits/sec    7             sender
[  4]   0.00-10.00  sec   830 MBytes   696 Mbits/sec                  receiver

NetworkPolicy for isolation (TBD)

kubectl annotate ns gamestop "net.beta.kubernetes.io/network-policy={\"ingress\": {\"isolation\": \"DefaultDeny\"}}"

apiVersion: extensions/v1beta1
kind: NetworkPolicy
metadata:
 name: access-gamestop
 namespace: gamestop
spec:
 podSelector:
  matchLabels:
    role: db
 ingress:
  - from:
     - namespaceSelector:
        matchLabels:
         project: gamestop
    ports:
     - protocol: tcp
       port: 50000
     - protocol: tcp
       port: 50001
     - protocol: tcp
       port: 80
     - protocol: tcp
       port: 443

Performance Tuning of Spark ML Job

Posted on 2016-08-25 | In bigdata , analytic , tech |

Abstract

Spark Advanced Tuning course is designed for attendees of online course of Big Data University. This article discuss some common way to tuning an job of spark ML.

Setup

Clone spark-ml-pipeline project

clone git@github.com:big-data-university/spark-ml-pipeline.git
git checkout -b advanced

Tuning

1. Change parallelism (default partition number of RDD)

1.1 Default core number

val conf = new SparkConf().setAppName("CrossValidation Pipeline").setMaster("local[4]") //Using 2 core running on local

core number

1.2 Default parallelism

/*
*Increase parallelism, local default = number of core. e.g. 8 on mac
*/
conf.set("spark.default.parallelism", "4")

1.3 Repartition RDD to change parallelism on the fly

1
2
3

train.repartition(1)
    
train.repartition(10)

2. Efficient Serialize

Enable Kryoserilizer for efficient persistent/cache and memory usage

conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

Enable library of kryo in build.sbt

libraryDependencies += "com.esotericsoftware" %% "kryo" % "4.0.0"

RDD persistent/cache

RDD.persist(StorageLevel.MEMORY_ONLY_SER)

3. Reduce times of full GC

3.1 Enable GC verbose output for analysis

Add JVM parameters to enable verbose and GC detail -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops

/**
 * Enable GC trace for executor
 */
conf.set("spark.executor.extraJavaOptions", "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops")

3.2 Determine number of full GC during whole execution using standard output.

cat <your_log_file>|grep "Full GC"|wc -l

3.3 Evaluate size of RDD partition

dfs.block.size - The default value in Hadoop 2.0 is 128MB. In the local mode the corresponding parameter is fs.local.block.size (Default value 32MB). It defines the default partition size.

numPartitions - The default value is 0 which effectively defaults to 1. This is a parameter you pass to the sc.textFile(inputPath,numPartitions) method. It is used to decrease the partition size (increase the number of partitions) from the default value defined by dfs.block.size
mapreduce.input.fileinputformat.split.minsize - The default value is 1 byte. It is used to increase the partition size (decrease the number of partitions) from the default value defined by dfs.block.size

println("fs.local.block.size: " + sc.getConf.get("fs.local.block.size"))
println("mapreduce.input.fileinputformat.split.minsize: " + sc.getConf.get("mapreduce.input.fileinputformat.split.minsize"))
println("size of trainRDD: " + SizeEstimator.estimate(train) + " Byte")
println("size of testRDD: " + SizeEstimator.estimate(test) + " Byte")

3.4 Adjust parallelism to reduce number of full GC

Reduce parallelism when size of RDD < block size
Increase parallelism when of RDD > block size

3.5 Increase java heap for both driver JVM and executor to reduce GC

conf.set("spark.driver.memory", "2G") // Increase java heap for driver, -Xmx is illegal setting

conf.set("spark.executor.memory", "1G") //Increase to 2G per executor from default value 1G. -Xmx1G is illegal setting for spark executor.

3.6 validate number of full GC and conclusion

Conclusion: It is significantly to reduce the execution time after reducing number of full GC.

4. Increase instance number of executor in the cluster env

Check the size and partition of RDD

When size of RDD >> block size and number partition >> executor * core, it is significant improve performance to increase instance number of executor allowed in the cluster.

/**
 * Increase executor numbers to increase parallelism, only works for cluster env. standalone, yarn, mesos
 */
conf.set("spark.executor.instances", "4")

5. Reserve as much as possible memory for executor, but does make executor overload

conf.set("spark.executor.memory", "1G")

All in one

Research on Ubernetes (High availability zone) and federation clusters for Kubernetes

Posted on 2016-08-05 | In tech |

Abstract

Multiple zone minion nodes share the same master node (api server, etcd and etc). Need to seriously considering the high availability of master node. Nodes is labeled by different zone and failure-domain to allow distribute workload across zones.
PV Affinity is used for zone awareness. But PV affinity is not ready. Most cloud providers (at least GCE and AWS) cannot attach their persistent volumes across zones. Thus when a pod is being scheduled, if there is a volume
attached, that will dictate the zone. This will be implemented using a new
scheduler predicate (a hard constraint): VolumeZonePredicate.
The PV is also labeled with the zone & region it was created in. For version 1.3.4, dynamic persistent volumes are always created in the zone of the cluster master (here us-central1-a / us-west-2a); this will be improved in a future version (issue #23330.).
Current kube-up.sh scripts depends on kube-up/kube-down/kube-push implementation and SDK of cloudprovider. gce and aws are supported well, but not for other cloudprovider today. (v1.3.4)
Alternative solution to setup an multiple availability zone with simple cloudprovider ubuntu/centos.

4.1 Key spec
- share etcd
- share flannel network
- share docker registry
- share apiserver. MASTER_INTERNAL_IP=172.20.0.9
  
  4.2 Solution
  
  Create VM firslty, then kube-up.sh an ubuntu (centos) worker only to reuse the same master.

Prerequisite

pip install --upgrade aws-cli

Init install

curl -sS https://get.k8s.io | MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2 NUM_NODES=1 bash

Setup an multizone cluster on aws

Setup 1 node from zone us-west-2

MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a NUM_NODES=1 ./cluster/kube-up.sh

Add 1 node cluster from zone us-west-2a

KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2b NUM_NODES=1 KUBE_SUBNET_CIDR=172.20.1.0/24 MASTER_INTERNAL_IP=172.20.0.9 kubernetes/cluster/kube-up.sh

Nodes labels for availability zone

kubectl get nodes --show-labels
NAME                                         STATUS     AGE       LABELS
ip-172-20-0-31.us-west-2.compute.internal    NotReady   2d        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.micro,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,kubernetes.io/hostname=ip-172-20-0-31.us-west-2.compute.internal
ip-172-20-1-138.us-west-2.compute.internal   Ready      2d        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.micro,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2b,kubernetes.io/hostname=ip-172-20-1-138.us-west-2.compute.internal

Create PV with zone affinity, provisioned at zone of master node

kubectl create -f - <<EOF
{
  "kind": "PersistentVolumeClaim",
  "apiVersion": "v1",
  "metadata": {
    "name": "claim1",
    "annotations": {
        "volume.alpha.kubernetes.io/storage-class": "foo"
    }
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "1Gi"
      }
    }
  }
}
EOF

1
2

NAME                                       CAPACITY   ACCESSMODES   STATUS    CLAIM            REASON    AGE       LABELS
pvc-feb20d75-593a-11e6-afe8-06d12fd78863   1Gi        RWO           Bound     default/claim1             42s       failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a

Destroy multiple-zone cluster

1 2	KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2b kubernetes/cluster/kube-down.sh KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a kubernetes/cluster/kube-down.sh

Multiple cluster management

Federation panel

#Mandatory to build your own federation images with source code 
FEDERATION=true KUBE_RELEASE_RUN_TESTS=n make quick-release
FEDERATION=true FEDERATION_PUSH_REPO_BASE=<Your private registry> ./build/push-federation-images.sh
#Setup federation container, dns, router, lb, kubeconfig and federation namespace
GOROOT=/root/go KUBE_ROOT=/root/k8s/src/k8s.io/kubernetes KUBERNETES_PROVIDER=aws DNS_ZONE_NAME=myfederation.aws FEDERATION_NAME=myfederation FEDERATION_PUSH_REPO_BASE=<your private registry> /root/k8s/src/k8s.io/kubernetes/federation/cluster/federation-up.sh
GOROOT=/root/go KUBE_ROOT=/root/k8s/src/k8s.io/kubernetes KUBERNETES_PROVIDER=aws DNS_ZONE_NAME=myfederation.aws FEDERATION_NAME=myfederation FEDERATION_PUSH_REPO_BASE=<your private registry> /root/k8s/src/k8s.io/kubernetes/federation/cluster/federation-down.sh
service "federation-apiserver" deleted
secret "default-token-k6v8j" deleted
secret "federation-apiserver-kubeconfig" deleted
namespace "federation" deleted