Posted in

成为 SRE 的系统工程学习资源 | Google Cloud Blog_AI阅读总结 — 包阅AI

包阅导读总结

1. 关键词:Systems Engineering、SRE、Reliable Systems、Learning Resources、Best Practices

2. 总结:文本介绍了成为 SRE 所需的系统工程学习资源,包括相关论文、设计方法、实践工作坊、演讲和研究论文,帮助读者了解并掌握系统工程知识和实践。

3. 主要内容:

– 系统工程与 SRE 的关系:

– 阐述系统工程是创建和实现可靠系统的学科,被 Google SRE 广泛应用。

– 学习资源介绍:

– 《The Systems Engineering Side of Site Reliability Engineering》论文探讨系统工程师与其他角色的差异及关键视角和方法。

– 《Non-abstract Large System Design》介绍可靠可扩展系统设计需实用方法。

– 《Distributed ImageServer workshop》提供实践 NALSD 原则的工作坊。

– 《Google Production Environment》YouTube 演讲介绍 Google 生产环境。

– 《Reliable Data Processing with Minimal Toil》研究论文探讨可靠数据处理。

思维导图:

文章地址:https://cloud.google.com/blog/products/devops-sre/systems-engineering-learning-resources-to-become-an-sre/

文章来源:cloud.google.com

作者:Max Saltonstall,Salim Virji

发布时间:2024/6/13 0:00

语言:英文

总字数:862字

预计阅读时间:4分钟

评分:86分

标签:DevOps & SRE,系统工程,站点可靠性工程,Google Cloud


以下为原文内容

本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com

Creating and implementing reliable systems (of code, infrastructure, or anything really) forms the discipline of systems engineering, which is used extensively by Google site reliability engineers (SREs). To help you learn more about systems engineering, and how to get hands-on with best practices, we’ve assembled some resources for you to get started.

The Systems Engineering Side of Site Reliability Engineering (USENIX paper)
What is a systems engineer, and how might they be different from a given SRE, software engineer or a SysAdmin? This paper explores the key perspectives and approaches that systems engineers take, the way they look at intersecting sets of services, and how they continue to grow their own knowledge. Read this short report to get an inside view from Google SREs on how they tackle investigating and architecting applications.

Non-abstract Large System Design (in the SRE Workbook)
Designing systems reliably and scalably requires a focused, practical approach, which we call non-abstract large system design (NALSD). Doing this well requires iterating and refining on designs in consideration of feasibility, resilience and efficiency, and also focusing design on real-world resource constraints or expectations. If you want to learn more about NALSD after reading this chapter, take a look at some other real-world examples (Reliable Cron across the Planet, Making “Push on Green” a Reality, or SRE Best Practices for Capacity Management) and dig into the research behind capacity planning, component isolation and graceful degradation.

Distributed ImageServer workshop (in the SRE Classroom)
Want to put these systems design and engineering concepts into practice? This self-guided workshop helps you code and deploy a large-scale system using NALSD principles, helping bridge the gap between the theoretical and the practical. You will make a robust, scalable, reliable system, and see what it takes to iterate on designs. To go further, check out the other workshops in SRE classroom or join an SRE community in your area.

Google Production Environment (YouTube talk)
Curious how Google runs its production environment? This lively 15 minute talk digs into software, hardware, and the numerous subsystems that power the online services used by billions of people every day. Watch to learn about our physical network infrastructure, the Borg cluster management system, persistent file systems, a massive mono-repo, and much more. Hungry for more? Hear more from SREs at Google on how they handle capacity management in this tech talk.

Reliable Data Processing with Minimal Toil (research paper)
Ensuring reliability in data processing can be tricky, especially with batch jobs and automation. And while batching can save costs, it also adds risks around data corruption and downstream delays. A structured approach to safety, validation and testing can make batch pipelines more reliable and consistent. To dive deeper, check out the video, learn more about A/B testing and canary deployments, and try out managed data tools such as BigQuery and Dataflow.