The Technical Stack
Tens of years of evolution in IT teach us that the boundaries and the interactions between the parts of a system are a key part of the design of a program. Not only is the efficiency in question but more important is the ability to maintain the program. Each part of the system is exposed to the others via its API.
Let's be honest: designing a good user interface requires different skills than designing a good backend. I have no recent UI skill, and designing a UI would end in server-side generated HTML pages. Instead, I dedicate my time to the backend: a well-design set of interconnected services, each exposing a clean API with the proper instrumentation to secure and observe it.
My personal conviction leads me to split the system into isolated services (where a "service" stands for a set of network processes serving its API). Despite the burden of maintaining connections between the services, the latency introduced in the communications, it opens the door to a per-service scalability and instrumentation.
Standards appeared, with different trade offs. I chose mine: all the backend code is written in Golang. gRPC is the middleware shared by all our microservices. The persistence of the data for each service currently depend on the type of service, but the trend is to stick on either object storage or key/value storage:
- maps: the local filesystem is currently used, with one file per map. There are plans to locate the maps on an Object Storage platform.
- events: a local RocksDB database is currently used. There are plans for a TiKV storage.
- region: a local file system is currently used. There are plans for a TiKV storage.
Should we centralize the logs? No.
It sounds weird, isn't it? The purpose of such a centralization is following the journey of a request in the whole distributed system. There are others ways to achieve this. I want to place a bet on observability instead.
Log traces remain very interesting, I'll never say the opposite. In debug mode particularly. Much less in production mode, excepted for abnormal situations (slow requests, errors). One trace in the access log per request seems acceptable. But no centralization. Just a local dump, 2 or 3 days retention, and log rotation.
Observability at the core
The fine-grained observability of the execution of the RPC calls is allowed by OpenTelemetry. Traces are generated in a side-car agent. By default, the sandbox installation proposes an all-in-one installation with its own Prometheus storage and a dashboard.
Best-in-class Authorization & Authentication
The ORY suite is used for the AAA purpose. At the gate, a HAProxy enforces the security policy with the help of a ORY OathKeeper. OpenID Connect is used for the Authentication via ORY Hydra. A local user registry is available thanks to ORY Kratos, as a source for ORY Hydra. The authorization of the users on their cities is achieved with ORY Keto. The security is enforced at the gate, in the API micro-service that requires valid JSON Web Tokens for each RPC.
The monitoring of the platform is achieved with Prometheus. A set of exporters is in place for the gRPC services.
The monitoring of the hosts is achieved via NetData. No data from Netdata is centralized in Prometheus.