Eric Li's Blog


  • Home

  • Categories

  • Archives

  • Tags

  • Search

Research on container network of Kubernetes

Posted on 2016-11-14 | In container , network |

Network provider research

provider k8s version k8s network policy pros cons thoughoutput /%direct
flannel vxlan >= 1.2 no 1) easy to configure
2) easy to span VLAN/datacenter
1) Broadcast flood to 192.168.0.0/16 since no exact ip route setting.
2) performance downgrade
3) network isolation needs extra subnet mgt. efforts
45%
flannel host-gw >= 1.2 no 1) easy to configure
2) no obvious performance downgrade
1) To span multiple subnets in a vlan, need extra steps to add routing rules
2) can’t span multiple vlans, datacenters.
3) doesn’t support network policy
4) network isolation needs extra subnet mgt. efforts
93%
calico >= 1.3 yes 1) bird agents configure routes with BGP on each node.
2) flexible subnets expansiton with ip address pool mgmt
3) support k8s network policy
4) enable -ipip to support cross L2 VLAN.
1) arch is complex, management of bird and felix, need higher learning carve for deployment, debugging, operation
2) when enabling -ipip, introduce additional packet encapsulation with significant performance downgrade.
BGP:93% vs BGP+ipip:64%
canal (calico + flannel vxlan) >= 1.3 yes 1) support vxlan to cross L2.
2) Network policy support extended from Felix of calico.
3) smooth migration from existing flannel to calico
1)significant performance downgrade due to packet encapsulation and broadcast flood.
2) Double complexity
45%
calico + IaaS IP address (SL portableIP) + hostAffirnity >= 1.3 yes 1) consist network IP address space as IaaS env.
2) Network policy support.
1) No L2 support to cross VLAN.
2) enable subnet of hostAffirnity and integrate application of ip address space from IaaS.
93%

Flannel

Summary of flannel over “vxlan” and “host-gw”

:star: vxlan is the default backend type of ubuntu k8s deployment. The container-to-container test result above (1.37 G/sec) performs ~50% of the raw host-to-host result (3.02 G/sec).
:star: “host-gw” leverage the kernel route table with “ip routes” to route traffic to target host. The container-to-container test result above (2.84 G/sec) performs ~93% of the raw host-to-host result (3.02 G/sec)

container to container over vxlan

1
2
3
4
5
6
$ docker run -it --rm networkstatic/iperf3 -c 172.31.71.4
Connecting to host 172.31.71.4, port 5201
[ 4] local 172.31.15.4 port 57807 connected to 172.31.71.4 - - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.60 GBytes 1.37 Gbits/sec 398 sender
[ 4] 0.00-10.00 sec 1.60 GBytes 1.37 Gbits/sec receiver

container to container over host-gw

1
2
3
4
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 3.31 GBytes 2.85 Gbits/sec 284 sender
[ 4] 0.00-10.00 sec 3.31 GBytes 2.84 Gbits/sec receiver

Calico

Summary of calico BGP and calico BGP+ipip

:star: “calico BGP(node to node mesh)” is similar to “host-gw”. about 93%+ performance of direct connection.
:star: “calico BGP + -ipip(node to node mesh)” is similar to “vxlan”. it is about 64%+ performance of direct connection due to packet escapsulation, but it is better than “vxlan” without broadcast flood due to appropriate ip routes setting. Significant negative performance impact even for connection in the same VLAN.

BGP without ipip

host to host (in the same VLAN)

1
2
3
4
5
6
Connecting to host 10.177.83.70, port 5201
[ 4] local 10.177.83.83 port 39122 connected to 10.177.83.70 port 5201
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 3.71 GBytes 3.19 Gbits/sec 0 sender
[ 4] 0.00-10.00 sec 3.71 GBytes 3.18 Gbits/sec receiver

Container to container over BGP node to node mesh(same VLAN)

1
2
3
ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 3.45 GBytes 2.97 Gbits/sec 608 sender
[ 4] 0.00-10.00 sec 3.45 GBytes 2.96 Gbits/sec receiver

Container to container over BGP+-ipip (same VLAN)

1
2
3
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 2.37 GBytes 2.03 Gbits/sec 3 sender
[ 4] 0.00-10.00 sec 2.36 GBytes 2.03 Gbits/sec receiver

Container to container over BGP+-ipip (cross VLAN)

1
2
3
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 2.47 GBytes 2.12 Gbits/sec 556 sender
[ 4] 0.00-10.00 sec 2.46 GBytes 2.12 Gbits/sec receiver

host to host (cross vlan)

1
2
3
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 4.43 GBytes 3.81 Gbits/sec 1742 sender
[ 4] 0.00-10.00 sec 4.43 GBytes 3.81 Gbits/sec receiver

AWS VPC network

Summary of aws-vpc

:star: “aws-vpc” is similar to “host-gw”. about 93%+ performance of direct connection.

1
2
3
4
5
6
7
8
9
10
11
host to host
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 853 MBytes 716 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 853 MBytes 715 Mbits/sec receiver
container to container
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 831 MBytes 697 Mbits/sec 7 sender
[ 4] 0.00-10.00 sec 830 MBytes 696 Mbits/sec receiver

NetworkPolicy for isolation (TBD)

kubectl annotate ns gamestop "net.beta.kubernetes.io/network-policy={\"ingress\": {\"isolation\": \"DefaultDeny\"}}"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: extensions/v1beta1
kind: NetworkPolicy
metadata:
name: access-gamestop
namespace: gamestop
spec:
podSelector:
matchLabels:
role: db
ingress:
- from:
- namespaceSelector:
matchLabels:
project: gamestop
ports:
- protocol: tcp
port: 50000
- protocol: tcp
port: 50001
- protocol: tcp
port: 80
- protocol: tcp
port: 443

Performance Tuning of Spark ML Job

Posted on 2016-08-25 | In bigdata , analytic , tech |

Abstract

Spark Advanced Tuning course is designed for attendees of online course of Big Data University. This article discuss some common way to tuning an job of spark ML.

Setup

Clone spark-ml-pipeline project

clone git@github.com:big-data-university/spark-ml-pipeline.git
git checkout -b advanced

Tuning

1. Change parallelism (default partition number of RDD)

1.1 Default core number

val conf = new SparkConf().setAppName("CrossValidation Pipeline").setMaster("local[4]") //Using 2 core running on local

core number

1.2 Default parallelism
1
2
3
4
5
/*
*Increase parallelism, local default = number of core. e.g. 8 on mac
*/
conf.set("spark.default.parallelism", "4")
1.3 Repartition RDD to change parallelism on the fly
1
2
3
train.repartition(1)
train.repartition(10)

2. Efficient Serialize

  • Enable Kryoserilizer for efficient persistent/cache and memory usage

conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

  • Enable library of kryo in build.sbt

libraryDependencies += "com.esotericsoftware" %% "kryo" % "4.0.0"

  • RDD persistent/cache

RDD.persist(StorageLevel.MEMORY_ONLY_SER)

3. Reduce times of full GC

3.1 Enable GC verbose output for analysis

Add JVM parameters to enable verbose and GC detail -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops

/**
 * Enable GC trace for executor
 */
conf.set("spark.executor.extraJavaOptions", "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops")
3.2 Determine number of full GC during whole execution using standard output.
cat <your_log_file>|grep "Full GC"|wc -l
3.3 Evaluate size of RDD partition
  1. dfs.block.size - The default value in Hadoop 2.0 is 128MB. In the local mode the corresponding parameter is fs.local.block.size (Default value 32MB). It defines the default partition size.
  1. numPartitions - The default value is 0 which effectively defaults to 1. This is a parameter you pass to the sc.textFile(inputPath,numPartitions) method. It is used to decrease the partition size (increase the number of partitions) from the default value defined by dfs.block.size

  2. mapreduce.input.fileinputformat.split.minsize - The default value is 1 byte. It is used to increase the partition size (decrease the number of partitions) from the default value defined by dfs.block.size

println("fs.local.block.size: " + sc.getConf.get("fs.local.block.size"))
println("mapreduce.input.fileinputformat.split.minsize: " + sc.getConf.get("mapreduce.input.fileinputformat.split.minsize"))
println("size of trainRDD: " + SizeEstimator.estimate(train) + " Byte")
println("size of testRDD: " + SizeEstimator.estimate(test) + " Byte")
3.4 Adjust parallelism to reduce number of full GC
  1. Reduce parallelism when size of RDD < block size
  2. Increase parallelism when of RDD > block size
3.5 Increase java heap for both driver JVM and executor to reduce GC
conf.set("spark.driver.memory", "2G") // Increase java heap for driver, -Xmx is illegal setting

conf.set("spark.executor.memory", "1G") //Increase to 2G per executor from default value 1G. -Xmx1G is illegal setting for spark executor.
3.6 validate number of full GC and conclusion

Conclusion: It is significantly to reduce the execution time after reducing number of full GC.

4. Increase instance number of executor in the cluster env

Check the size and partition of RDD

When size of RDD >> block size and number partition >> executor * core, it is significant improve performance to increase instance number of executor allowed in the cluster.

/**
 * Increase executor numbers to increase parallelism, only works for cluster env. standalone, yarn, mesos
 */
conf.set("spark.executor.instances", "4")

5. Reserve as much as possible memory for executor, but does make executor overload

conf.set("spark.executor.memory", "1G") 

All in one

Research on Ubernetes (High availability zone) and federation clusters for Kubernetes

Posted on 2016-08-05 | In tech |

Abstract

  1. Multiple zone minion nodes share the same master node (api server, etcd and etc). Need to seriously considering the high availability of master node. Nodes is labeled by different zone and failure-domain to allow distribute workload across zones.

  2. PV Affinity is used for zone awareness. But PV affinity is not ready. Most cloud providers (at least GCE and AWS) cannot attach their persistent volumes across zones. Thus when a pod is being scheduled, if there is a volume
    attached, that will dictate the zone. This will be implemented using a new
    scheduler predicate (a hard constraint): VolumeZonePredicate.
    The PV is also labeled with the zone & region it was created in. For version 1.3.4, dynamic persistent volumes are always created in the zone of the cluster master (here us-central1-a / us-west-2a); this will be improved in a future version (issue #23330.).

  3. Current kube-up.sh scripts depends on kube-up/kube-down/kube-push implementation and SDK of cloudprovider. gce and aws are supported well, but not for other cloudprovider today. (v1.3.4)

  4. Alternative solution to setup an multiple availability zone with simple cloudprovider ubuntu/centos.

    4.1 Key spec

    • share etcd
    • share flannel network
    • share docker registry
    • share apiserver. MASTER_INTERNAL_IP=172.20.0.9

      4.2 Solution

      Create VM firslty, then kube-up.sh an ubuntu (centos) worker only to reuse the same master.

Prerequisite

pip install --upgrade aws-cli

Init install

curl -sS https://get.k8s.io | MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2 NUM_NODES=1 bash

Setup an multizone cluster on aws

  1. Setup 1 node from zone us-west-2

    MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a NUM_NODES=1 ./cluster/kube-up.sh
    
  2. Add 1 node cluster from zone us-west-2a

    KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2b NUM_NODES=1 KUBE_SUBNET_CIDR=172.20.1.0/24 MASTER_INTERNAL_IP=172.20.0.9 kubernetes/cluster/kube-up.sh
    
  3. Nodes labels for availability zone

1
2
3
4
kubectl get nodes --show-labels
NAME STATUS AGE LABELS
ip-172-20-0-31.us-west-2.compute.internal NotReady 2d beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.micro,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,kubernetes.io/hostname=ip-172-20-0-31.us-west-2.compute.internal
ip-172-20-1-138.us-west-2.compute.internal Ready 2d beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.micro,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2b,kubernetes.io/hostname=ip-172-20-1-138.us-west-2.compute.internal

Create PV with zone affinity, provisioned at zone of master node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
kubectl create -f - <<EOF
{
"kind": "PersistentVolumeClaim",
"apiVersion": "v1",
"metadata": {
"name": "claim1",
"annotations": {
"volume.alpha.kubernetes.io/storage-class": "foo"
}
},
"spec": {
"accessModes": [
"ReadWriteOnce"
],
"resources": {
"requests": {
"storage": "1Gi"
}
}
}
}
EOF
1
2
NAME CAPACITY ACCESSMODES STATUS CLAIM REASON AGE LABELS
pvc-feb20d75-593a-11e6-afe8-06d12fd78863 1Gi RWO Bound default/claim1 42s failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a

Destroy multiple-zone cluster

1
2
KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2b kubernetes/cluster/kube-down.sh
KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a kubernetes/cluster/kube-down.sh

Multiple cluster management

Federation panel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#Mandatory to build your own federation images with source code
FEDERATION=true KUBE_RELEASE_RUN_TESTS=n make quick-release
FEDERATION=true FEDERATION_PUSH_REPO_BASE=<Your private registry> ./build/push-federation-images.sh
#Setup federation container, dns, router, lb, kubeconfig and federation namespace
GOROOT=/root/go KUBE_ROOT=/root/k8s/src/k8s.io/kubernetes KUBERNETES_PROVIDER=aws DNS_ZONE_NAME=myfederation.aws FEDERATION_NAME=myfederation FEDERATION_PUSH_REPO_BASE=<your private registry> /root/k8s/src/k8s.io/kubernetes/federation/cluster/federation-up.sh
GOROOT=/root/go KUBE_ROOT=/root/k8s/src/k8s.io/kubernetes KUBERNETES_PROVIDER=aws DNS_ZONE_NAME=myfederation.aws FEDERATION_NAME=myfederation FEDERATION_PUSH_REPO_BASE=<your private registry> /root/k8s/src/k8s.io/kubernetes/federation/cluster/federation-down.sh
service "federation-apiserver" deleted
secret "default-token-k6v8j" deleted
secret "federation-apiserver-kubeconfig" deleted
namespace "federation" deleted
1234
Eric Li

Eric Li

11 posts
10 categories
27 tags
© 2017 Eric Li
Powered by Hexo
Theme - NexT.Pisces