-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Is your feature request related to a problem? Please describe.
In Compass when inserting a new Asset, the payload looks as follows
{
"urn": "main-postgres:my-database.orders",
"type": "table",
"service": "postgres",
"name": "orders",
"data": {
"database": "my-database",
"namespace": "tenant-1"
}
}
Separating assets based on tenants can only be done via data key/value attributes. While insertion, a key of namespace can be utilized, and querying can use a filter of namespace as the key. This design looks suitable for small-scale implementations but will start to break as the number of tenants increases. Compass uses two data repositories(Postgres & ES) and both stores are suffering from this flaw that will eventually limit the scale.
Postgres
One quick fix is to create a btree index on (data->namespace) of the assets table. It can improve the querying speed but still, the design doesn't mandate whether each asset really belongs to a namespace or not. The next step is, adding an application-level handler that will simply add a default namespace id if not supplied in the insert request. That means we are mandating a key in a blob of JSON data that was supposed to be generic. Okay, maybe we can add this in the documentation but I still feel we should have a dedicated column of the namespace which in the future can be utilized for sharding a database if ever Postgres grew to a substantial size which I feel it will looking at the use-case of what Compass is doing(ingesting whole company's metadata).
ElasticSearch
Compass for each service type(kafka/table/etc) creates its own index in ElasticSearch. All the tenants share all the shards and indexes across the ES cluster. To segregate tenant-specific queries, Compass uses filters just like Postgres over data keys. One problem with this design is all the queries span across the complete ES cluster for searching which document belongs to which tenant. For example, if tenant A has 1 GB of data but tenant B has 100 GB of data, both tenants whenever Compass makes a request will search 101 GB of documents. Ideally, tenant A search should limit it to 1 GB(or in the worst case couple of more GBs) but not the whole cluster size.
Describe the solution you'd like
To support multi-tenant from the ground up, I propose
Postgres
Adding a new column for the namespace in Assets table. Our APIs can choose to accept the namespace as one more field, if it's passed in the request it will be populated in the database but if not it can be set as null uuid(0000...). So in the case where users don't care about multi-tenancy, everything will be pushed to a single null namespace but individual tenants can also query assets by namespace id.
ElasticSearch
One index for all services and tenants but all requests should be routed to a unique shard for a single tenant. Compass can categorize a tenant into two tires, shared and dedicated. For shared tenants, all the requests will be routed by namespace id over a single shard in an index. For dedicated tenants, each tenant can have its own index. Note, a single index will have N number of types same as the number of Services supported in Compass. This design will ensure, all the document insert/query requests are only confined to a single shard(in case of shared) or a single index(in case of dedicated).

