Prevent corner case for infinite PrepareForMaintenance#3095
Prevent corner case for infinite PrepareForMaintenance#3095yadvr merged 2 commits intoapache:4.11from
Conversation
|
@blueorangutan package |
|
@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2490 |
|
@blueorangutan test |
|
@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-3255)
|
| return CollectionUtils.isEmpty(failedMigrations) ? | ||
| setHostIntoMaintenance(host) : | ||
| setHostIntoErrorInMaintenance(host, failedMigrations); | ||
| } else if (retryHostMaintenance.containsKey(host.getId())) { |
There was a problem hiding this comment.
you do not need the else here. I mean, the other if body will always return. Therefore, you can simply follow with an IF. This will change a bit the formatting.
There was a problem hiding this comment.
You're right, I'll refactor it, thanks
| setHostIntoErrorInMaintenance(host, failedMigrations); | ||
| } else if (retryHostMaintenance.containsKey(host.getId())) { | ||
| Integer retriesLeft = retryHostMaintenance.get(host.getId()); | ||
| if (retriesLeft != null) { |
There was a problem hiding this comment.
If the retriesLeft is null, does it mean that ACS will retry forever?
There was a problem hiding this comment.
Yes, until the VMs are migrated to another host
GabrielBrascher
left a comment
There was a problem hiding this comment.
Code LGTM. Thanks, @nvazquez!
|
@blueorangutan package |
|
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2505 |
|
@blueorangutan test |
|
@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-3273)
|
|
|
||
| private SearchBuilder<HostGpuGroupsVO> _gpuAvailability; | ||
|
|
||
| private Map<Long,Integer> retryHostMaintenance = new HashMap<>(); |
There was a problem hiding this comment.
This may be called by concurrent threads of execution, @nvazquez can you use a concurrent hashmap and/or instead of integer perhaps one of the atomic ints? I understand that normally per host we may hit this, but just to have a defensive implementation and usage.
|
@blueorangutan package |
|
@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2507 |
|
@blueorangutan test |
|
@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-3275)
|
…e#3095) A corner case was found on 4.11.2 for apache#2493 leading to an infinite loop in state PrepareForMaintenance To prevent such cases, in which failed migrations are detected but still running on the host, this feature adds a new cluster setting host.maintenance.retries which is the number of retries before marking the host as ErrorInMaintenance if migration errors persist. How Has This Been Tested? - 2 KVM hosts, pick one which has running VMs as H - Block migrations ports on H to simulate failures on migrations: iptables -I OUTPUT -j REJECT -m state --state NEW -m tcp -p tcp --dport 49152:49215 -m comment --comment 'test block migrations' iptables -I OUTPUT -j REJECT -m state --state NEW -m tcp -p tcp --dport 16509 -m comment --comment 'test block migrations - Put host H in Maintenance - Observe that host is indefinitely in PrepareForMaintenance state (after this fix it goes into ErrorInMaintenance after retrying host.maintenance.retries times) (cherry picked from commit 13c81a8) Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Description
A corner case was found on 4.11.2 for #2493 leading to an infinite loop in state
PrepareForMaintenanceTo prevent such cases, in which failed migrations are detected but still running on the host, this feature adds a new cluster setting
host.maintenance.retrieswhich is the number of retries before marking the host asErrorInMaintenanceif migration errors persist.Types of changes
Screenshots (if appropriate):
How Has This Been Tested?
iptables -I OUTPUT -j REJECT -m state --state NEW -m tcp -p tcp --dport 49152:49215 -m comment --comment 'test block migrations' iptables -I OUTPUT -j REJECT -m state --state NEW -m tcp -p tcp --dport 16509 -m comment --comment 'test block migrationsPrepareForMaintenancestate (after this fix it goes intoErrorInMaintenanceafter retryinghost.maintenance.retriestimes)