xds: Fix load reporting when pick first is used for locality-routing.#11495
xds: Fix load reporting when pick first is used for locality-routing.#11495DNVindhya merged 6 commits intogrpc:masterfrom
Conversation
…mption that all addresses for a subchannel are in the same locality.
larry-safran
left a comment
There was a problem hiding this comment.
I've recommended a number of minor changes to comments
| /** | ||
| * An internal class. Do not use. | ||
| * | ||
| * <p>An interface to provide the address connected by subchannel. |
There was a problem hiding this comment.
This says it provides the address, but the method provides address attributes.
There was a problem hiding this comment.
Updated the interface name to InternalSubchannelAddressAttributes.
| public void shutdown() { | ||
| if (localityStats != null) { | ||
| localityStats.release(); | ||
| if (localityAtomicReference.get() != null) { |
There was a problem hiding this comment.
This can't possibly be null AFAICT
There was a problem hiding this comment.
You are right. Removed the null check.
| // attributes with its locality, including endpoints in LOGICAL_DNS clusters. | ||
| // In case of not (which really shouldn't), loads are aggregated under an empty | ||
| // locality. | ||
| if (locality == null) { |
There was a problem hiding this comment.
It shouldn't be, but is possible that locality is set but locality name is null. You should probably have an else clause that does a null check on localityName.
There was a problem hiding this comment.
This is existing code, and it seems in general code assumes the values to be non-null. I'd much rather we handle that centrally in io.grpc.xds.client.Locality.create() than lots of random not-possible-to-trigger checks. The data is primarily coming from a proto, so it will be "" when unset.
There was a problem hiding this comment.
Are you suggesting to move the locality name null check to io.grpc.xds.client.Locality.create()?
There was a problem hiding this comment.
Oh, I see now Larry's comment wasn't about the Locality struct, but instead the attribute. That had been discussed when the code was originally introduced:
#11133 (comment)
This PR is likely to be reverted (in some way) later, once the old PF policy goes away. So changes to the existing code are likely to be lost.
| } | ||
|
|
||
| /** | ||
| * Represents the locality attributes of a cluster. |
There was a problem hiding this comment.
It is actually the stats and the name. Name could be considered an attribute, the the stats aren't.
| checkState(allDropStats.containsKey(cluster) | ||
| && allDropStats.get(cluster).containsKey(edsServiceName), | ||
| "stats for cluster %s, edsServiceName %s not exits", cluster, edsServiceName); | ||
| "stats for cluster %s, edsServiceName %s not exist", cluster, edsServiceName); |
There was a problem hiding this comment.
s/not exist/do not exist/
| // attributes with its locality, including endpoints in LOGICAL_DNS clusters. | ||
| // In case of not (which really shouldn't), loads are aggregated under an empty | ||
| // locality. | ||
| if (locality == null) { |
There was a problem hiding this comment.
This is existing code, and it seems in general code assumes the values to be non-null. I'd much rather we handle that centrally in io.grpc.xds.client.Locality.create() than lots of random not-possible-to-trigger checks. The data is primarily coming from a proto, so it will be "" when unset.
|
|
||
| // This clusterLocality ideally should not be utilized. We derive locality | ||
| // information from the first address, even prior to the subchannel becoming ready. | ||
| // This is primarily for the purpose of facilitating load reporting in the pre-READY |
There was a problem hiding this comment.
There's no load reporting pre-READY. But an LB API could return the subchannel even before it is ready, and PF does this. We generally won't end up reporting load for such picks because the channel will ignore the selected (not-ready) subchannel, but we needed to handle the case.
There was a problem hiding this comment.
I see, updated the comment.
There was a problem hiding this comment.
Did you miss pushing the update?
There was a problem hiding this comment.
Discussed offline, we both are able to see the updated comment.
| private static final Attributes.Key<AtomicReference<ClusterLocality>> | ||
| ATTR_CLUSTER_LOCALITY = |
There was a problem hiding this comment.
You can put ATTR_CLUSTER_LOCALITY on the same line as the type.
This fixes Incorrect load reporting when using pick first as locality-routing policy.
Cause: ClusterImplLoadBalancer used to assume all the address are in the same locality and used first address in the list of addresses to determine locality. This is not always true leading to incorrect load reporting.
Fix: Get locality from address connected by the subchannel.
Closes #11434.