Improve Application Reliability while you accelerate software delivery for your rapid Digital Transformation
Why do we need SRE (Site Reliability Engineering) Consulting ?
Key Highlights of our SRE Consulting:
Data Driven Approach to Operations
Culture that drives Efficiency
Automation driven Risk Mitigation
Tool Based Hypothesis driven Methodology
Industry segments we work with to provide SRE Consulting services:
How our SRE Consulting services support the industry?
SRE Initiation & Transformation
Get a granular understanding of your current state and develop an SRE implementation roadmap
DevOpsLabs’ follows the DevOps Institute SRE Blueprint, a structured approach to baselining operations and rolling out changes along the People, Process, and Technology dimensions on the following:
Culture
Toil Reduction
SLA’s and SLI’s
Measurements
Antifragility
Work Sharing
Deployments
Performance Management
Incident Management
Looking for detailed maturity assessment to transform your SRE Practices?
Our SRE Consulting Approach
SRE Initiation & Transformation
Assess the SRE maturity of your organization and envision a target state for your operations aligned to your business strategy and priorities.
Our proprietary assessment tool helps organisations to assess their SRE maturity by:
SRE Initiation & Transformation Milestones
- Culture
- Toil Reduction
- SLAs/SLIs
- Measurements
- AntiFragility
- Work Sharing
- Deployments
- Performance Management
- Incident Management
Initial
Little, if any, communication or collaboration between IT Operations and application developers. The SRE role is not defined.
Managed
Communication and collaboration between Operations and Development is controlled and limited by management. The SRE role is defined. A team structure to support SRE development is defined.
Defined
Open communication and collaboration betweeh SREs and Development according to defined rules. A team structure to support SRE development is operating.
Quantitively Managed
Communication and collaboration between SREs and Develoment is encouraged and managed to measured goals. SREs have developed capabiities for most of the 9 pillars of SRE.
Optimizing
Culture of continuous communication and collaboration between SREs and Development. SREs have achieved mastery of SRE practices. SREs are embedded in most development teams.
Initial
Operations tasks are generally manual. Toil automation is not routinely practiced.
Managed
Some Toil reduction tasks are automated. Toil automation occurs on a very limited scale with very limited budget.
Defined
Many Toilsome tasks are automated. Innovative toil automation work is encouraged and routinely practiced.
Quantitively Managed
Most Operations tasks are supported by automation. Toil automation work is practiced extensively with measurable goals. SRE team topologies are tailored to suit platforms and services.
Optimizing
Toil automation is a strategic priority for the organization. Optimization of SRE tasks through automation is proactively pursued.
Initial
SLOs and SLIs are not defined for most, if any, services.
Managed
Some SLOs and SLIs are defined and used for some services.
Defined
SLOs, SLIs and Error Budgets are defined and used for many services.
Quantitively Managed
The organization has a policy to measure the performance of services using SLOs, SLIs and Error Budgets.
Optimizing
The organization continuously reviews and optimizes SLOs. SLIS and Error Budget for all services.
Initial
Few services and infrastructure systems in production are monitored for health and performance.
Managed
Many services and infrastructure systems in production are monitored for health and performance.
Defined
Logs and traces for services and infrastructure systems in production are collectively analyzed for health and performance.
Quantitively Managed
Logs and traces for services and infrastructure systems in production are analyzed and measured for health and performance to identify anomolies relative to normal states.
Optimizing
Observability tools provide intelligent inferences and recommended actions for applications and infrastructure systems in production.
Initial
Business Continuity, Disaster Prevention and Recovery, Fire Drills, and Production Security procedures are not practiced for many services.
Managed
Business Continuity, Disaster Prevention and Recovery, Fire Drills, and Production Security procedures are practiced to some extent for most services.
Defined
Business Continuity, Disaster Prevention and Recovery, Fire Drills, and Production Security procedures are practiced consistently according to policies for all services.
Quantitively Managed
Business Continuity, Disaster Prevention and Recovery, Fire Drills, and Production Security procedures are practiced consistently and measured according to policies for all services.
Optimizing
Chaos Engineerig practices are used to proactively seek improvements to reliability and security of applications and infrastructure systems in production.
Initial
Little, if any, work is shared between Operations and Development.
Managed
Operations and Devlopment occassionally collaborate to improve application performance in production.
Defined
A policy is defined that ensures SREs and Development collaborate to ensure applications will perform according Operations requirements in production.
Quantitively Managed
A policy is defined that measure and control the extent to which SREs and Development collaborate to ensure applications will perform according to Operations requirements in production.
Optimizing
SREs and Development collaboratively create innovative approches that ensure applications performance in production will continuously improve.
Initial
Application deployments to production environments are mostly manual and often a source of stress for the organization and stakeholders.
Managed
Some Application deployments to production environments are supported by automation and may use deployment strategies such as Blue-Green, Canary or Feature-Flag Rollouts.
Defined
Automation of Application Deployments to production environments are governed by policies and may use deployment strategies such as Blue-Green, Canary or Feature-Flag Rollouts.
Quantitively Managed
Automation of Application Deployments to production environments are measured and may use deployment strategies such as Blue-Green, Canary or Feature-Flag Rollouts.
Optimizing
SREs continuously improve Automation of Application Deployments to production environments.
Initial
Application Performance Monitoring, and Proactive Capacity testing is not performed.
Managed
Application Performance Monitoring, and Proactive Capacity testing is performed for many services.
Defined
Application Performance Monitoring, and Proactive Capacity testing is performed for services consistently according to policies.
Quantitively Managed
Application Performance Monitoring, and Proactive Capacity testing for services is measured.
Optimizing
SREs routinely seek Innovative approaches to advance Application Performance Monitoring, and Proactive Capacity testing for services.
Initial
Emergency response procedures do not follow consistent procedures that ensure blameless retrospectives are performed for critical events.
Managed
Emergency response procedures are managed to ensure blameless retrospectives are performed for many events.
Defined
Emergency response procedures are managed by polices to ensure blameless retrospectives are consistently performed for events that match defined criterion.
Quantitively Managed
The effectiveness and completeness of Emergency response procedures are measured to ensure blameless retrospectives are consistently performed for events that match defined criterion.
Optimizing
The results of blameless post-mortems are routinely used to optimize emergency responses practices.
Adapted from DevOps Institute & Engineering DevOps by Marc Hornbeek