EKSでCluster Autoscalerを設定する
Cluster Autoscalerとは、
The cluster autoscaler on AWS scales worker nodes within any specified autoscaling group. It will run as a Deployment in your cluster.
との事で、EKS Worker Nodeの数をいい感じに管理してくれるものです。各種ドキュメントを参考に、利用方法を確認します。
- Kubernetes(EKS) 14.9
- eksctl 0.13.0
- Cluster Autoscalier v1.14.7
EKS Clusterの用意
eksctlを利用して、EKS Clusterを作成します。
Master Nodeの作成
Master Node作成用のeksctlマニフェストファイルを作成します。
apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: "cluster-sample" region: "ap-northeast-1" version: "1.14" tags: 'cfn-key-string': 'cfn-value-string' vpc: id: "vpc-xxx" cidr: "xx.xx.xx.xx/xx" # autoAllocateIPv6: boolean clusterEndpoints: privateAccess: true publicAccess: true # extraCIDRs: # cidr: String # nat: # gateway: Disable, Single, HighlyAvailable # publicAccessCIDRs: # - "xx.xx.xx.xx/32" # securityGroup: String # sharedNodeSecurityGroup: xxx subnets: public: ap-northeast-1a: id: "subnet-xxx" cidr: "xx.xx.xx.xx/xx" ap-northeast-1c: id: "subnet-xxx" cidr: "xx.xx.xx.xx/xx" ap-northeast-1d: id: "subnet-xxx" cidr: "xx.xx.xx.xx/xx" # cloudWatch: # clusterLogging: # enableTypes: ["api", "audit", "authenticator", "controllerManager", "scheduler"]
$ eksctl create cluster -f cluster.yml
Managed Node Groupの作成
Manged Node Groupとは、EKSのWorker Nodeにあたるもので、
Amazon EKS 管理ノードグループを使用すると、Kubernetes アプリケーションを実行するための計算能力を提供する EC2 インスタンスを個別にプロビジョニングまたは接続する必要がありません。1 つのコマンドでクラスターのノードを作成、更新、または終了できます。ノードは、AWS アカウントの最新の EKS 最適化 AMI を使用して実行されますが、ノードの更新と終了は、アプリケーションが使用可能な状態を維持するようにノードを適切にドレインします。
Amazon EKS が Kubernetes ワーカーノードのプロビジョニングと管理のサポートを追加
kubectl drain
の処理を、EKS側で管理してくれます。 kubectl draint
とは、停止されるWorker Nodeから起動中Workder Nodeへ、Worker Node上のPodを安全に退去してくれる機能です。
You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified.
When kubectl drain returns successfully, that indicates that all of the pods (except the ones excluded as described in the previous paragraph) have been safely evicted (respecting the desired graceful termination period, and respecting the PodDisruptionBudget you have defined). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.
Use kubectl drain to remove a node from service
Master Node Group作成用のeksctlマニフェストファイルを作成します。
apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: "cluster-sample" region: "ap-northeast-1" managedNodeGroups: - name: "sample-node-group" desiredCapacity: 2 maxSize: 3 minSize: 1 volumeSize: 20 amiFamily: "AmazonLinux2" availabilityZones: - "ap-northeast-1a" - "ap-northeast-1c" - "ap-northeast-1d" iam: # instanceProfileARN: String # instanceRoleARN: String attachPolicyARNs: - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore instanceRoleName: String withAddonPolicies: albIngress: true appMesh: true autoScaler: true certManager: true cloudWatch: true ebs: true efs: true externalDNS: true fsx: true imageBuilder: true xRay: true instanceType: "t3.small" labels: 'label-key': 'label-value' ssh: allow: true # publicKey: String publicKeyName: "eks-worker-node" # publicKeyPath: String # sourceSecurityGroupIds: # - String tags: tag-key: tag-value
それぞれのパラメーターが、Worker Nodeのオートスケーリング設定になっています。
$ eksctl create nodegroup -f nodegroup.yml
worker nodeの確認。
$ kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-0-94.ap-northeast-1.compute.internal Ready <none> 3m5s v1.14.8-eks-b8860f Amazon Linux 2 4.14.154-128.181.amzn2.x86_64 docker://18.9.9 ip-10-0-2-134.ap-northeast-1.compute.internal Ready <none> 3m2s v1.14.8-eks-b8860f Amazon Linux 2 4.14.154-128.181.amzn2.x86_64 docker://18.9.9
AWSコンソールを確認すると、Worker Node用のAuto Scaling Groupが作成されています。
Cluster Autoscalerの設定
Worker Node向けIAM設定
EC2のAutoScaling機能を利用する訳で、EKS Worker NodeにはAWSのAutoScaling向けAPIを叩ける権限を付与する必要があります。そのため、Worker Nodeに付与されているIAM Roleに、必要となるポリシーが含まれているか確認します。必要となるIAMポリシーは、公式の以下ドキュメントに記載されています。
Cluster Autoscaler ノードグループの考慮事項 - ノードグループ IAM ポリシー
AWS側でAuto Scaling Group(ASG)にタグ付けしておくことで、Cluster Autoscalerが利用するASGを、自動的に判断してくれるらしいです。そのためのタグ付けをしておきます。以下のタグを付与します。
Key | Value |
k8s.io/cluster-autoscaler/<cluster-name> | owned |
k8s.io/cluster-autoscaler/enabled | true |
Cluster Autoscalerのapply
Cluster Autoscalerのdeploymentを作成します。GitHub上にあるサンプルとなるマニフェストファイルをダウンロードしてきます。
サンプルマニフェストファイル内の、 cluster-autoscaler
parameter | description |
node-group-auto-discovery | One or more definition(s) of node group auto-discovery |
balance-similar-node-groups | Detect similar node groups and balance the number of nodes between them |
skip-nodes-with-system-pods | If true cluster autoscaler will never delete nodes with pods from kube-system (except for DaemonSet or mirror pods) |
What are the parameters to CA?
cluster-autoscalerのannotationに、 cluster-autoscaler.kubernetes.io/safe-to-evict="false"
の設定を追加します。この設定により、Cluster AutoScalerが起動しているWorker Nodeは、スケールインのされなくなります。
What types of pods can prevent CA from removing a node?
--- apiVersion: v1 kind: ServiceAccount metadata: labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler name: cluster-autoscaler namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: cluster-autoscaler labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler rules: - apiGroups: [""] resources: ["events", "endpoints"] verbs: ["create", "patch"] - apiGroups: [""] resources: ["pods/eviction"] verbs: ["create"] - apiGroups: [""] resources: ["pods/status"] verbs: ["update"] - apiGroups: [""] resources: ["endpoints"] resourceNames: ["cluster-autoscaler"] verbs: ["get", "update"] - apiGroups: [""] resources: ["nodes"] verbs: ["watch", "list", "get", "update"] - apiGroups: [""] resources: - "pods" - "services" - "replicationcontrollers" - "persistentvolumeclaims" - "persistentvolumes" verbs: ["watch", "list", "get"] - apiGroups: ["extensions"] resources: ["replicasets", "daemonsets"] verbs: ["watch", "list", "get"] - apiGroups: ["policy"] resources: ["poddisruptionbudgets"] verbs: ["watch", "list"] - apiGroups: ["apps"] resources: ["statefulsets", "replicasets", "daemonsets"] verbs: ["watch", "list", "get"] - apiGroups: ["storage.k8s.io"] resources: ["storageclasses", "csinodes"] verbs: ["watch", "list", "get"] - apiGroups: ["batch", "extensions"] resources: ["jobs"] verbs: ["get", "list", "watch", "patch"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["create"] - apiGroups: ["coordination.k8s.io"] resourceNames: ["cluster-autoscaler"] resources: ["leases"] verbs: ["get", "update"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: cluster-autoscaler namespace: kube-system labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["create","list","watch"] - apiGroups: [""] resources: ["configmaps"] resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"] verbs: ["delete", "get", "update", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: cluster-autoscaler labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-autoscaler subjects: - kind: ServiceAccount name: cluster-autoscaler namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: cluster-autoscaler namespace: kube-system labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: cluster-autoscaler subjects: - kind: ServiceAccount name: cluster-autoscaler namespace: kube-system --- apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system labels: app: cluster-autoscaler annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false" spec: replicas: 1 selector: matchLabels: app: cluster-autoscaler template: metadata: labels: app: cluster-autoscaler annotations: prometheus.io/scrape: 'true' prometheus.io/port: '8085' spec: serviceAccountName: cluster-autoscaler containers: - image: k8s.gcr.io/cluster-autoscaler:v1.14.7 name: cluster-autoscaler resources: limits: cpu: 100m memory: 300Mi requests: cpu: 100m memory: 300Mi command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/cluster-sample - --balance-similar-node-groups - --skip-nodes-with-system-pods=false env: - name: AWS_REGION value: ap-northeast-1 volumeMounts: - name: ssl-certs mountPath: /etc/ssl/certs/ca-certificates.crt readOnly: true imagePullPolicy: "Always" volumes: - name: ssl-certs hostPath: path: "/etc/ssl/certs/ca-bundle.crt"
なお、利用するCluster Autoscalerのバージョンは、Kubernetesのバージョンに合わせるべきらしいので、過去バージョンのEKSを利用する場合には注意が必要です。
We recommend using Cluster Autoscaler with the Kubernetes master version for which it was meant.
上記のサンプルは、最新EKSのバージョンと同じ v1.14.7
バージョンのCluster Autoscalerとなっています。
$ kubectl apply -f autoscale.yml
serviceaccount/cluster-autoscaler created
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
deployment.apps/cluster-autoscaler created
$ kubectl get deployment/cluster-autoscaler -o wide -n kube-system NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR cluster-autoscaler 1/1 1 1 50m cluster-autoscaler k8s.gcr.io/cluster-autoscaler:v1.14.7 app=cluster-autoscaler
apiVersion: v1 kind: Namespace metadata: name: sample --- apiVersion: apps/v1 kind: Deployment metadata: name: nginx namespace: sample spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: container-nginx image: nginx:latest ports: - containerPort: 80 resources: limits: cpu: 200m memory: 512Mi requests: cpu: 200m memory: 512Mi --- apiVersion: v1 kind: Service metadata: name: nginx namespace: sample spec: type: ClusterIP ports: - port: 80 protocol: TCP targetPort: 80 selector: app: nginx --- apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: sample-pdb spec: maxUnavailable: 1 selector: matchLabels: app: nginx
$ kubectl apply -f nginx.yml
namespace/sample created
deployment.apps/nginx created
service/nginx created
poddisruptionbudget.policy/sample-pdb created
$ kubectl get pods -o wide -n sample NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-69ffbfc87b-dbrxk 1/1 Running 0 2m22s ip-10-0-0-94.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-q2jt8 1/1 Running 0 2m22s ip-10-0-2-134.ap-northeast-1.compute.internal <none> <none>
$ kubectl scale --replicas=5 deployment/nginx -n sample
pod数が5つとなりましたが、1つのpodのみ Pending
$ kubectl get pods -o wide -n sample NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-69ffbfc87b-875lp 0/1 Pending 0 18s <none> <none> <none> <none> nginx-69ffbfc87b-dbrxk 1/1 Running 0 8m52s ip-10-0-0-94.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-lvq5f 1/1 Running 0 18s ip-10-0-0-94.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-pgl9v 1/1 Running 0 18s ip-10-0-2-134.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-q2jt8 1/1 Running 0 8m52s ip-10-0-2-134.ap-northeast-1.compute.internal <none> <none>
Worker Nodeのリソース状況を確認してみますと、既にメモリー使用率が限界に近いことが分かります。
$ kubectl describe nodes ip-10-0-2-134.ap-northeast-1.compute.internal Name: ip-10-0-2-134.ap-northeast-1.compute.internal ... Non-terminated Pods: (5 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- kube-system aws-node-jrcm4 10m (0%) 0 (0%) 0 (0%) 0 (0%) 23m kube-system cluster-autoscaler-54c755c8f9-sfghr 100m (5%) 100m (5%) 300Mi (21%) 300Mi (21%) 18m kube-system kube-proxy-xv45g 100m (5%) 0 (0%) 0 (0%) 0 (0%) 23m sample nginx-69ffbfc87b-pgl9v 200m (10%) 200m (10%) 512Mi (37%) 512Mi (37%) 5m39s sample nginx-69ffbfc87b-q2jt8 200m (10%) 200m (10%) 512Mi (37%) 512Mi (37%) 14m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 610m (31%) 500m (25%) memory 1324Mi (96%) 1324Mi (96%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 ...
今回Worker Nodeは t3.small
しばらくすると、Cluster Autoscalerが新規Worker Nodeを起動してくれます。
$ kubectl get node NAME STATUS ROLES AGE VERSION ip-10-0-0-94.ap-northeast-1.compute.internal Ready <none> 21m v1.14.8-eks-b8860f ip-10-0-1-35.ap-northeast-1.compute.internal Ready <none> 3m37s v1.14.8-eks-b8860f ip-10-0-2-134.ap-northeast-1.compute.internal Ready <none> 21m v1.14.8-eks-b8860f
そして、pendingであったpodが、新規Worker Nodeで起動してくれます。
$ kubectl get pods -o wide -n sample NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-69ffbfc87b-875lp 1/1 Running 0 3m54s ip-10-0-1-35.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-dbrxk 1/1 Running 0 12m ip-10-0-0-94.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-lvq5f 1/1 Running 0 3m54s ip-10-0-0-94.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-pgl9v 1/1 Running 0 3m54s ip-10-0-2-134.ap-northeast-1.compute.internal <none> <none> nginx-69ffbfc87b-q2jt8 1/1 Running 0 12m ip-10-0-2-134.ap-northeast-1.compute.internal <none> <none>
起動するreplica数を減らせば、当然worker node数は減少します。worker node数が過剰であると判断された後、デフォルトでは10分後にNodeのTerminate処理が始まります。scale-down-unneeded-time
というCluster AutoScalerのパラメーターが、デフォルトで10分に指定されているためです。
What are the parameters to CA?
今回検証で利用したnginxのreplica(deployment)には、下記の PodDisruptionBudget
apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: sample-pdb spec: maxUnavailable: 1 selector: matchLabels: app: nginx
An Application Owner can create a PodDisruptionBudget object (PDB) for each application. A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.
例えば、Cluster AutoScalerでスケールインが発生し、Worker NodeがTerminateされる際、Node上のPodは同時にevictされます。そのため、Worker Node上でPodのばらつきが偏っており、ある特定のWorker NodeでしかPodが存在しないようなreplica(deployment)がある場合、一時的にpodが1つも存在しないような瞬間が生まれてしまいます。(podのぱらつきが偏っている点は、それはそれで問題ですが...)
上記で設定した .spec.maxUnavailable
とは、 selector
which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
(available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.