PalmInfrastructure

33min
general infrastructure and services overview description of the infrastructure setup, including kubernetes on eks our infrastructure is built upon the foundation of kubernetes running on amazon elastic kubernetes service (eks) kubernetes serves as the orchestrator for our containerized applications, providing a robust and scalable platform for deployment, management, and scaling of our microservices architecture kubernetes on eks amazon eks simplifies the process of deploying, managing, and scaling kubernetes clusters in the aws cloud leveraging eks, we benefit from amazon's managed kubernetes service, which handles the underlying infrastructure management tasks, such as server provisioning, patching, and maintenance, allowing us to focus on deploying and managing our applications key features and benefits scalability eks seamlessly scales our kubernetes clusters to accommodate varying workload demands, ensuring high availability and performance reliability eks provides a highly available control plane across multiple aws availability zones, enhancing the resilience of our kubernetes clusters integration eks integrates seamlessly with other aws services, enabling us to leverage a wide range of aws features and services within our kubernetes environment m communication mechanisms between services in our architecture, communication between services is crucial for enabling seamless interaction and data exchange we utilize various communication mechanisms to facilitate communication and collaboration between different components of our system messaging queue with rabbitmq palmdao and palmvalidator microservices communicate with each other using rabbitmq, a robust messaging queue system rabbitmq enables asynchronous communication between services, facilitating decoupling and scalability both palmdao and palmvalidator are scalable, allowing them to handle varying workloads efficiently data storage and retrieval with mongodb palmdao writes information to mongodb, serving as the primary data store for our application mongodb provides a flexible and scalable nosql database solution, enabling palmdao to store and retrieve data efficiently palmvalidator discord service palmvalidator discord is a service dedicated to verifying discord accounts unlike palmdao and palmvalidator, palmvalidator discord does not scale horizontally and remains as a single instance metrics retrieval with palmplatformmetrics palmplatformmetrics is responsible for retrieving metrics from mongodb related to platform usage and user activity it connects to mongodb to gather relevant metrics, which are then utilized by grafana for visualization and analysis this service contributes to monitoring and analyzing platform performance and user engagement scalability considerations scalability is a fundamental aspect of our infrastructure design, ensuring that our system can efficiently handle varying workload demands and accommodate growth without compromising performance or reliability we employ several key strategies to achieve scalability within our kubernetes environment horizontal pod autoscaling (hpa) horizontal pod autoscaler (hpa) dynamically adjusts the number of replicas for our microservices based on observed cpu utilization or other custom metrics this automatic scaling capability allows our system to respond dynamically to changes in workload demand, ensuring optimal resource utilization and maintaining consistent performance under varying loads efficient resource allocation we carefully allocate resources such as cpu and memory to our microservices based on their individual requirements and performance characteristics by rightsizing resource requests and limits, we ensure efficient utilization of underlying infrastructure resources, maximizing scalability and minimizing resource contention stateless architecture embracing a stateless architecture for our microservices enables horizontal scaling by allowing multiple instances of the same service to handle incoming requests independently stateless services facilitate easy deployment and scaling, as new instances can be added or removed dynamically without impacting overall system functionality load balancing load balancing distributes incoming traffic across multiple instances of our microservices, ensuring optimal utilization of resources and preventing any single instance from becoming overwhelmed by employing load balancing techniques, we enhance scalability and fault tolerance, enabling our system to handle increased traffic seamlessly horizontal pod autoscaler (hpa) manifest snippet apiversion autoscaling/v1 kind horizontalpodautoscaler metadata name palmvalidator autoscaler namespace palmdao spec scaletargetref apiversion apps/v1 kind deployment name palmvalidator minreplicas 2 maxreplicas 6 targetcpuutilizationpercentage 95 brief rationale behind the chosen architecture our architecture is meticulously crafted to optimize scalability, resilience, automation, and ease of management, leveraging a blend of cutting edge technologies and best practices here's a succinct explanation of why we chose this architecture microservices architecture we embraced a microservices architecture to foster agility and scalability by breaking down our application into smaller, modular services, we enable independent development, deployment, and scaling, facilitating faster iteration and innovation kubernetes orchestration kubernetes serves as the backbone of our infrastructure, providing robust container orchestration capabilities with kubernetes on eks, we benefit from a managed environment that automates cluster management, scaling, and updates, allowing us to focus on building and deploying resilient applications ci/cd pipelines with github actions integration with github repositories and ci/cd pipelines using github actions streamlines our development and deployment workflows automated testing, builds, and deployments ensure rapid and reliable delivery of changes to production, enhancing developer productivity and software quality infrastructure management with gitops we adopted a gitops approach for managing our kubernetes configurations and deployments argocd, a gitops continuous delivery tool, automates the deployment process by synchronizing desired state configurations stored in git repositories with the actual state of our kubernetes clusters, ensuring consistency and reliability high availability and disaster recovery our architecture prioritizes high availability and disaster recovery services such as rabbitmq and mongodb are deployed as replica sets with multiple instances to withstand failures additionally, robust backup and failover procedures are in place to mitigate the impact of catastrophic events and ensure business continuity monitoring and alerting comprehensive monitoring tools, including grafana, prometheus, and loki, provide real time visibility into the health and performance of our infrastructure and applications uptimekuma enhances proactive monitoring with alerting capabilities via telegram, enabling timely response to incidents and performance anomalies security and secrets management security is ingrained into our architecture from the ground up external secrets, integrated with aws secrets manager, centralize and secure the management of sensitive data and credentials within our kubernetes environment, reducing the risk of exposure and unauthorized access disaster recovery document overview of disaster recovery strategies risk assessment and business impact analysis this section focuses on conducting a comprehensive risk assessment and business impact analysis, identifying potential threats and their potential impact on business operations it evaluates the likelihood and severity of various disaster scenarios, such as natural disasters, cyber attacks, or system failures, and assesses their potential consequences on critical business functions and services by understanding the risks and their potential impact, organizations can prioritize resources and develop effective disaster recovery strategies to mitigate these risks and ensure business continuity backup and data protection in this section, the focus is on implementing backup and data protection measures to safeguard critical data and ensure its availability in the event of a disaster it includes strategies for regular data backups, off site storage, and encryption to protect data integrity and confidentiality by establishing robust backup and data protection mechanisms, organizations can minimize the risk of data loss and expedite recovery efforts in the event of a disaster high availability architecture this section elaborates on the implementation of high availability architecture to ensure continuous operation and resilience in the face of disruptions it includes deploying services as replica sets with redundant components, such as load balancers and failover mechanisms, to maintain service availability and data durability by leveraging high availability architecture, organizations can minimize downtime and ensure uninterrupted access to critical services during disasters disaster recovery planning and testing here, the focus is on developing comprehensive disaster recovery plans and conducting regular testing exercises to validate their effectiveness it involves defining roles and responsibilities, establishing communication protocols, and documenting step by step procedures for disaster response and recovery regular testing ensures that recovery procedures are up to date and that personnel are prepared to execute them effectively during emergencies failover and redundancy this section emphasizes the importance of failover and redundancy mechanisms in ensuring continuous operation and minimizing service disruptions it includes configuring redundant components, such as servers, networks, and data centers, to automatically take over operations in the event of a failure by implementing failover and redundancy strategies, organizations can mitigate the impact of hardware or software failures and maintain service availability during disasters cloud based disaster recovery here, the focus is on leveraging cloud based disaster recovery solutions to enhance resilience and flexibility it includes utilizing cloud infrastructure and services for data backup, replication, and recovery, as well as implementing disaster recovery as a service (draas) solutions for rapid recovery and scalability by embracing cloud based disaster recovery, organizations can achieve cost effective and scalable disaster recovery capabilities while minimizing infrastructure complexity incident response and communication this section highlights the importance of establishing effective incident response procedures and communication channels to coordinate response efforts and ensure timely notifications during disasters it includes defining incident response teams, establishing escalation paths, and implementing communication tools and protocols for notifying stakeholders and coordinating response activities by fostering a proactive incident response culture and maintaining open communication channels, organizations can minimize the impact of disasters and expedite recovery efforts specific disaster scenarios and their potential impact while our platform is designed with resilience and redundancy in mind, it's crucial to anticipate and prepare for potential disaster scenarios that could impact the availability, integrity, and security of our services below, we outline specific disaster scenarios and their potential impact on our platform infrastructure outage potential impact an infrastructure outage, whether due to hardware failure, network issues, or cloud provider downtime, could lead to service disruptions and downtime for our platform this could impact user access, disrupt business operations, and lead to potential revenue loss natural disaster potential impact natural disasters such as earthquakes, hurricanes, or floods could physically damage data centers or disrupt network connectivity this could result in prolonged service outages, data loss, and financial repercussions for our organization cyberattack potential impact a cyberattack, such as a distributed denial of service (ddos) attack or ransomware infection, could disrupt service availability, compromise data integrity, and lead to financial extortion or data theft recovery efforts could be time consuming and costly cloud provider outage potential impact a cloud provider outage, affecting services or regions where our infrastructure is hosted, could result in widespread service disruptions and data unavailability this could impact our ability to serve customers and fulfill business obligations software or configuration error potential impact human error, software bugs, or misconfigurations could lead to unintended consequences such as service downtime, data corruption, or security vulnerabilities rapid identification and mitigation of such errors are essential to minimizing their impact on our platform third party service outage potential impact dependency on third party services such as payment gateways, apis, or external integrations introduces the risk of service outages or disruptions this could impact the functionality of our platform and impair user experience loss of key personnel potential impact loss of key personnel due to illness, resignation, or unforeseen circumstances could disrupt operations, hinder decision making processes, and impact the continuity of our platform's development and support mitigation and recovery strategies to mitigate the impact of these disaster scenarios and ensure the resilience of our platform, we implement a comprehensive set of mitigation and recovery strategies, including implementing robust backup and recovery mechanisms regularly backing up critical data stored in mongodb to both kubernetes persistent volume claims (pvcs) and amazon s3 for comprehensive data protection and redundancy utilizing automated backup and recovery solutions to minimize data loss and expedite recovery in the event of a disaster enhancing network and infrastructure security measures in addition to the robust disaster recovery strategies outlined, we further strengthen our network and infrastructure security measures to safeguard against potential threats and vulnerabilities implementing stringent network security controls we implement stringent network security controls to fortify our infrastructure against unauthorized access and potential data breaches this includes deploying firewalls, implementing network segmentation strategies, and utilizing encryption protocols to protect sensitive data in transit and at rest these measures ensure that only authorized entities can access our network resources and data, reducing the risk of security incidents and data compromise employing aws albs for traffic distribution we leverage aws application load balancers (albs) to efficiently distribute incoming traffic across multiple availability zones by distributing traffic in this manner, we enhance the resilience of our infrastructure against distributed denial of service (ddos) attacks and ensure the high availability of our services additionally, albs provide advanced features such as web application firewall (waf) and web acl (access control list), which allow us to implement additional security measures to protect against common web based attacks and unauthorized access attempts monitoring for suspicious activities and implementing threat detection mechanisms utilizing advanced monitoring and logging tools, such as prometheus, grafana, and aws cloudwatch, to continuously monitor system performance, detect anomalies, and identify potential security threats implementing intrusion detection systems and security information and event management (siem) solutions to proactively identify and respond to security incidents in real time diversifying infrastructure across multiple regions or cloud providers deploying kubernetes clusters with nodes distributed across different availability zones within aws regions to minimize the impact of localized outages or infrastructure failures considering multi cloud strategies to further diversify infrastructure and reduce the risk of service disruptions caused by cloud provider specific issues or regional outages conducting regular security audits and vulnerability assessments leveraging tools like snyk for vulnerability assessment to proactively identify and remediate security vulnerabilities in software dependencies and container images performing regular security audits and penetration testing exercises to evaluate the effectiveness of existing security controls and identify areas for improvement providing training and awareness programs for employees to prevent human error conducting regular security awareness training sessions to educate employees about best practices for data security, password hygiene, and phishing awareness establishing clear security policies and procedures, and enforcing role based access controls to minimize the risk of human error and insider threats by proactively identifying potential disaster scenarios and implementing appropriate mitigation and recovery strategies, we aim to minimize the impact on our platform and ensure the continued delivery of reliable and secure services to our users github repositories utilization of github actions for ci in our ci (continuous integration) workflow, we leverage github actions to automate the deployment of docker images to aws ecr (elastic container registry) and manage the configuration of kubernetes resources using helm charts this streamlined process ensures the efficient and reliable delivery of our applications while maintaining consistency across different environments deployment workflow overview building and pushing docker images github actions orchestrates the build process for docker images defined in our application repositories upon successful build completion, github actions tags the docker images with a version identifier concatenated with today's date (e g , 1 2 3 2024 02 28) and pushes them to aws ecr configuration management with helm charts after docker images are pushed to aws ecr, the ci process initiates a workflow to manage the configuration of kubernetes resources using helm charts stored in a private github repository called palmtemplates this repository is authenticated using a private ssh key dynamic values configuration inside the palmtemplates repository, we maintain helm templates for each application, along with corresponding values files for different environments (e g , values dev yaml, values stg yaml, values prod yaml) these values files define configuration parameters such as image tags, environment specific settings, and resource allocations updating image tags the ci process dynamically updates the image tag value in the appropriate values file to match the newly tagged docker image pushed to aws ecr this ensures that kubernetes resources reference the latest version of the application image deployment trigger with argocd argocd, configured within our kubernetes cluster, continuously monitors the palmtemplates repository for changes upon detecting a change, specifically a version update in the helm values files, argocd initiates a new deployment of the corresponding application with the updated docker image below is an example workflow illustrating the process for the development environment this visual representation outlines the automated steps involved in deploying application updates and configurations within the development (dev) environment rabbitmq and mongodb in our infrastructure, we utilize the concept of replica sets to ensure high availability and fault tolerance for critical components such as rabbitmq and mongodb by deploying these services as replica sets with three replicas each, we enhance resilience and mitigate the risk of service disruptions or data loss rabbitmq replica set rabbitmq, as a message queue system central to our architecture, is deployed as a replica set with three replicas this configuration ensures that even if one replica becomes unavailable due to hardware failure or maintenance, the remaining replicas can continue to process messages seamlessly, thereby maintaining uninterrupted communication between microservices additionally, rabbitmq is configured with a high availability policy, ensuring that messages are replicated across all replicas in real time this policy guarantees that even in the event of a node failure, messages remain accessible and processing can continue without interruption mongodb replica set mongodb, serving as the primary data store for our applications, is deployed as a replica set with three replicas this architecture provides data redundancy and automatic failover capabilities, ensuring data integrity and high availability even in the event of node failures or network partitions in our mongodb replica set configuration, three replicas are deployed one primary replica and two secondary replicas the primary replica serves as the primary node for read and write operations, while the secondary replicas act as backups, keeping synchronized copies of the data mongodb replica sets are equipped with automatic failover mechanisms in the event of a primary replica failure or unresponsiveness, mongodb automatically initiates a failover process during failover, one of the secondary replicas is elected as the new primary replica, ensuring seamless continuity of operations without manual intervention microservices within our architecture connect to the mongodb replica set using a connection string that specifies all members of the replica set this connection string typically includes the hostnames or ip addresses of all replicas along with the replica set name microservices use this connection string to establish connections to the replica set, allowing them to perform read and write operations against the primary replica and read operations from secondary replicas mongodb replica sets ensure data consistency across replicas through replication write operations performed on the primary replica are replicated to secondary replicas in real time, ensuring that all replicas maintain synchronized copies of the data this replication process employs the oplog (operation log) to capture and replicate write operations, maintaining consistency across the entire replica set by deploying mongodb as a replica set with three replicas, we achieve several benefits high availability the presence of multiple replicas ensures that even if one or more replicas become unavailable, the remaining replicas can continue serving read and write requests, minimizing downtime and ensuring high availability of data automatic failover the automatic failover mechanism ensures continuous availability of the database in the event of primary replica failure, reducing the need for manual intervention and minimizing service disruptions data redundancy data redundancy provided by replica sets ensures that multiple copies of data are available across replicas, reducing the risk of data loss and enhancing data resilience in the face of failures by implementing rabbitmq and mongodb as replica sets with three replicas each, we establish a resilient foundation for our infrastructure, capable of withstanding failures and maintaining consistent service availability these replica sets play a critical role in ensuring the reliability and scalability of our applications, enabling seamless operation and data integrity in the face of various challenges mongodb backup microservice backup mechanism to kubernetes persistent volume claim (pvc) and historical backup to s3 in our infrastructure, we have implemented a comprehensive backup solution that ensures the integrity and availability of our data this solution consists of two primary components backup to kubernetes persistent volume claim (pvc) for real time data protection and historical backup to amazon s3 for long term retention and archival purposes backup to kubernetes persistent volume claim (pvc) for real time data protection and recovery within our kubernetes environment, we leverage a backup mechanism that utilizes kubernetes persistent volume claims (pvcs) this approach involves taking periodic snapshots of data stored within pvcs associated with mongodb these snapshots are then stored locally within the kubernetes cluster, providing an efficient and scalable backup solution that ensures minimal data loss and rapid recovery in the event of failures or data corruption historical backup to amazon s3 in addition to real time backups to pvcs, we implement historical backup to amazon s3 for long term data retention and archival purposes this involves regularly transferring backup data from kubernetes pvc snapshots to amazon s3 buckets, where it is stored securely and durably by leveraging the scalability and cost effectiveness of amazon s3, we ensure that historical backup data is preserved and accessible for extended periods, meeting compliance requirements and enabling disaster recovery scenarios deployment manifest snippet apiversion apps/v1 kind deployment metadata name mongodb backup namespace mongodb spec replicas 1 selector matchlabels app mongodb backup template metadata labels app mongodb backup spec containers \ name mongodb backup image tiredofit/db backup volumemounts \ name backup volume mountpath /backup \ name post script mountpath /assets/scripts/post resources limits memory 512mi cpu 600m requests memory 512mi cpu 600m env \ name timezone value "america/vancouver" \ name container name value "db backup" \ name container enable monitoring value "false" \ name debug mode value "true" \ name backup job concurrency value "1" \ name default checksum value "none" \ name default compression value "zstd" \ name default backup interval value "1440" \ name default backup begin value "0000" \ name default cleanup time value "20160" \ name db01 type value "mongo" \ name db01 host value "\<db host>" \ name db01 name value "\<db name>" \ name db01 user value "\<db user>" \ name db01 pass value "\<db pass>" \ name db01 backup interval value "720" \ name db01 backup begin value "+1" \ name db01 cleanup time value "20160" \ name db01 checksum value "sha1" \ name db01 compression value "gz" \ name default s3 bucket value "\<bucket name>" \ name default s3 key id value "\<aws iam key id>" \ name default s3 key secret value "\<aws iam secret>" \ name default s3 path value "production" \ name default s3 region value "us east 1" \ name default backup location value "s3" \ name default auth value "\<db auth>" \ name db02 type value "mongo" \ name db02 host value "\<db host>" \ name db02 name value "\<db name>" \ name db02 user value "\<db user>" \ name db02 pass value "\<db pass>" \ name db02 backup interval value "720" \ name db02 backup begin value "+30" \ name db02 cleanup time value "7200" \ name db02 checksum value "sha1" \ name db02 compression value "gz" \ name db02 backup location value "filesystem" \ name db02 size value value "megabytes" \ name db01 size value value "megabytes" volumes \ name backup volume persistentvolumeclaim claimname mongodb backups pvc \ name post script configmap name post script defaultmode 0744 \ apiversion v1 kind persistentvolumeclaim metadata name mongodb backups pvc namespace mongodb spec accessmodes \ readwriteonce resources requests storage 25gi # adjust the storage size as needed \ apiversion v1 kind configmap metadata name post script namespace mongodb data post script sh | \#!/bin/bash token="\<telegram token>" chat id="\<telegram group chat id>" curl s x post https //api telegram org/bot$token/sendmessage d chat id=$chat id d text="backup process finished code $1 host $3 database name $4 process duration $7 \[seconds] backup file name $8 backup file size $9" > /dev/null uptimekuma and telegram alerts to ensure continuous monitoring of our infrastructure and timely response to potential issues, we utilize uptimekuma in conjunction with telegram alerts configuration and activation of alerts we have configured uptimekuma to monitor the availability and performance of our critical services and applications uptimekuma periodically checks the status of these services and reports any instances of downtime or errors additionally, we have set up a dedicated status site (status palmdao app) that provides real time updates on the status of our principal services within the cluster this status site serves as a centralized dashboard for monitoring service availability and performance metrics in the event of any anomalies or errors detected by uptimekuma, automated alerts are triggered and sent to our notifications channel on telegram these alerts provide instant notifications to our operations team, enabling them to promptly investigate and address any issues that may arise by leveraging telegram alerts, we ensure that potential disruptions are swiftly identified and resolved, minimizing the impact on our users and maintaining the reliability of our services external secrets and aws secrets manager management of environment variables within kubernetes microservices in our architecture, the management of sensitive environment variables within kubernetes microservices is a critical aspect of ensuring the security and integrity of our applications we employ external secrets, coupled with aws secrets manager, to securely store and retrieve sensitive information such as credentials, api keys, and other configuration parameters external secrets integration external secrets is a kubernetes operator that enables the automatic provisioning of kubernetes secrets from external secret management systems such as aws secrets manager by integrating external secrets into our kubernetes clusters, we centralize the management of sensitive data and simplify the process of injecting secrets into our microservices aws secrets manager aws secrets manager serves as the centralized repository for storing and managing our application secrets it provides robust security features such as encryption at rest and in transit, fine grained access control, and automatic rotation of secrets, ensuring that sensitive data remains protected throughout its lifecycle secure secret retrieval external secrets retrieves secrets from aws secrets manager and injects them into kubernetes secrets, which are then mounted as environment variables within our microservices' containers this approach ensures that sensitive information is never exposed directly in kubernetes manifests or configuration files, reducing the risk of accidental exposure or unauthorized access granular access control access to secrets stored in aws secrets manager is tightly controlled using aws identity and access management (iam) policies we define granular permissions to restrict access to only authorized entities, such as kubernetes service accounts, ensuring that sensitive data is accessible only to those with the appropriate permissions below is a simplified graphical representation illustrating the workflow of external secrets with aws secrets manager spec replicas {{ values replicacount }} selector matchlabels app {{ values selectorlabels }} template metadata labels app {{ values templatelabels }} spec containers \ name {{ chart name }} image "{{ values image repository }} {{ values env app version }}" imagepullpolicy {{ values image pullpolicy }} ports \ containerport 3000 env \ name "mongodb host" valuefrom secretkeyref name palm key mongodb host \ name "mongodb port" valuefrom secretkeyref name palm key mongodb port \ name "mongodb username" valuefrom secretkeyref name palm key mongodb username \ name "mongodb password" valuefrom secretkeyref name palm key mongodb password in this code snippet, we define a kubernetes deployment configuration for secret retrieval the "palm" secret, referenced within the deployment, enables the seamless integration of external secrets, retrieving key value pairs for each environment variable securely stored within aws secrets manager this ensures that sensitive data, such as credentials and configuration parameters, remains protected throughout the deployment process additionally, the deployment configuration ensures that each environment variable retrieved from the "palm" secret is securely injected as an environment variable within the respective microservice's container this approach shields sensitive information from direct exposure in kubernetes manifests or configuration files, thereby enhancing the security posture of our application by leveraging external secrets in conjunction with kubernetes deployments, we streamline the management of sensitive data, centralize access control, and fortify the security of our microservices architecture monitoring and logging stack utilization of grafana, loki, and prometheus for metric visualization and log analysis in our infrastructure, we rely on grafana, loki, and prometheus as integral components for metric visualization and log analysis this powerful combination empowers us to gain valuable insights into the performance, health, and behavior of our applications and infrastructure components grafana grafana stands as our primary dashboarding and visualization platform it offers a user friendly interface for monitoring and analyzing metrics, allowing us to create customizable dashboards that aggregate data from various sources with grafana's extensive library of plugins and integrations, we visualize key performance indicators (kpis), system metrics, and application specific metrics in diverse formats, ranging from simple line charts to complex heatmaps and histograms additionally, we showcase palm services logs on grafana dashboards, alongside metrics such as memory usage and pod metrics, providing a comprehensive view of our application's health and performance loki loki serves as our preferred log aggregation and querying system, specifically designed for cloud native environments by ingesting logs from microservices, containers, and infrastructure components, loki provides a centralized repository for log data storage and analysis its efficient indexing mechanism and prometheus inspired query language enable us to quickly search, filter, and analyze logs, facilitating troubleshooting, anomaly investigation, and system behavior insights prometheus prometheus acts as our monitoring and alerting toolkit, collecting time series data from various sources, including application metrics, system metrics, and custom exporters its powerful querying language, promql, enables ad hoc analysis and the creation of alerting rules based on predefined thresholds or complex conditions integrating prometheus with grafana allows us to build dynamic dashboards that combine metrics from prometheus with other data sources, facilitating comprehensive monitoring and analysis integration of prometheus with argocd to incorporate prometheus into our infrastructure, we import the prometheus helm chart into argocd below is a snippet showcasing how we configure prometheus for alerting apiversion argoproj io/v1alpha1 kind application metadata name prometheus namespace argocd spec project default source repourl https //prometheus community github io/helm charts targetrevision 51 2 0 chart kube prometheus stack helm values | defaultrules create false nodeexporter enabled false additionalprometheusrulesmap rule name groups \ name palm rules \ alert newpalmdaopodonreadystate\[production] expr changes(kube pod status ready{pod= "palmdao ", condition="true"}\[2m])==1 for 1s labels severity page annotations summary new palmdao pod ready \ alert newpalmvalidatorpodonreadystate\[production] expr changes(kube pod status ready{pod= "palmvalidator ", condition="true"}\[2m])==1 for 1s labels severity page annotations summary new palmvalidator pod ready \ alert palmpodlooprestarting\[production] expr kube pod container status restarts total{pod= "palm "}>0 for 1s labels severity page annotations summary palm pod restart count is different than 0 \ alert palmvalidatorscalingprocess\[production] expr kube horizontalpodautoscaler status current replicas{horizontalpodautoscaler="palmvalidator autoscaler"} > 1 for 1s labels severity page annotations summary palmvalidator scaling process started prometheus prometheusspec ruleselectorniluseshelmvalues false podmonitorselectorniluseshelmvalues false servicemonitorselectorniluseshelmvalues false probeselectorniluseshelmvalues false additionalalertmanagerconfigs \ scheme http static configs \ targets \ "alertmanager service prometheus svc cluster local 9093" labels group prometheus alertingendpoints \ name " " namespace "prometheus" port http scheme http pathprefix "" apiversion v2 alertmanager alertmanagerspec useexistingsecret true grafana env gf install plugins flant statusmap panel adminpassword \<reallysecurepassword> ingress enabled true ingressclassname alb annotations kubernetes io/ingress class alb alb ingress kubernetes io/scheme internet facing alb ingress kubernetes io/listen ports '\[{"https" 443}]' alb ingress kubernetes io/group name palm ingress labels {} hosts \ \<host> path / tls \ hosts \ \<host> service portname http web type nodeport destination server https //kubernetes default svc namespace prometheus syncpolicy syncoptions \ createnamespace=true \ serversideapply=true automated prune true selfheal true integration of promtail and loki for log shipping promtail, along with loki, is integrated into our logging pipeline to efficiently collect and ship logs from various sources to loki for centralized storage and analysis below is a snippet demonstrating how we install and configure promtail and loki using argocd apiversion argoproj io/v1alpha1 kind application metadata name loki namespace argocd spec project default source repourl https //grafana github io/helm charts targetrevision 2 9 11 chart loki stack helm values | loki auth enabled false commonconfig replication factor 1 storage type 'filesystem' singlebinary replicas 1 destination server https //kubernetes default svc namespace prometheus syncpolicy syncoptions \ createnamespace=true \ serversideapply=true automated prune true selfheal true \ # daemonset yaml apiversion apps/v1 kind daemonset metadata name promtail daemonset namespace prometheus spec selector matchlabels name promtail template metadata labels name promtail spec serviceaccount promtail serviceaccount containers \ name promtail container image grafana/promtail args \ config file=/etc/promtail/promtail yaml env \ name 'hostname' # needed when using kubernetes sd configs valuefrom fieldref fieldpath 'spec nodename' volumemounts \ name logs mountpath /var/log \ name promtail config mountpath /etc/promtail \ mountpath /var/lib/docker/containers name varlibdockercontainers readonly true volumes \ name logs hostpath path /var/log \ name varlibdockercontainers hostpath path /var/lib/docker/containers \ name promtail config configmap name promtail config kube state metrics we utilize kube state metrics for kubernetes metric collection, providing insights into the state and health of our kubernetes clusters below is a snippet demonstrating how we integrate kube state metrics into argocd for streamlined monitoring and management of kubernetes resources apiversion argoproj io/v1alpha1 kind application metadata name kube state metrics namespace argocd spec project default source repourl https //kubernetes sigs github io/metrics server targetrevision 3 11 0 chart metrics server destination server https //kubernetes default svc namespace kube system syncpolicy syncoptions \ createnamespace=true automated prune true selfheal true this snippet illustrates how we define an argocd application resource to deploy and manage kube state metrics within our kubernetes clusters, enabling seamless integration of kubernetes metric collection into our monitoring stack