Walmart Inc.

Senior Software Engineer - Site Reliability Operations

Posted on: 12 Apr 2021

Sunnyvale, CA

Job Description

Position Summary...

What you'll do...

As a Senior Site Reliability Operations Engineer within the Global Technology Platforms (GTP) CCC team you will work with other CCC, TDO, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability across our Global Technology platforms.

You're right for the job if you are comfortable leading our major incident response as part of a technical team of engineers laser focused on restoring service across complex distributed systems. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE, Engineering and DevOps teams to support our next generation always up cloud-based e-commerce platforms.

The CCC Senior Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting. Technically you will understand the full end to end stack and use this knowledge to detect errors/failures and take corrective action to mitigate. During a major incident, you will draw on your technical skills and knowledge to triage and troubleshoot, differentiating between symptom and cause, to help restore impacting issues. Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role. Our goal is to protect the customer experience and deliver outstanding levels of availability. To do so, you will need strong skills in the following areas:

* Expert level understanding of incident management processes and procedures.
* Calm under pressure when participating in major incident response.
* Deep technical understanding of core infrastructure, cloud services, platforms and micro-services.
* Ability to understand and capture key data from logs at an expert level.
* Ability to understand traffics flows and key dependencies between services.
* Ability to effectively triage be able to detect and determine symptom vs cause.
* Detect and quantify impact.
* Expert level troubleshooting skills using a diverse set of tools and methods
* Analyze trends to pro-actively prevent incidents.
* Focus on immediate restoration vs root cause.
* Research and recommend alternative actions for incident resolution Develop procedures and documentation to support this.
* Create and maintain procedural documentation.
* Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
* Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.
* Build tools to improve visibility, pro-actively detect issues and restore system availability.
* Develop automation and self-healing with DevOps, Engineering and SRE partners.
* Strong focus on collecting and inferring metrics.
* Clear communication skills.
* Ability to contribute to multiple incidents at any given time.
* Analyze systems and make recommendations to prevent possible problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
* Scripting and software development to automate and help enhance existing solutions.
* Experience owning, developing and evangelizing a product.
* Ability to gather requirements and build solutions into a product.
* Evangelize operational excellence

Additional responsibilities may include:

* Actively provide data for and participate in root cause analysis.
* Define CCC onboarding process and ensure they are adhered to when accepting new systems into service.
* Share knowledge globally between CCC teams.
* Analyze systems and make recommendations to prevent possible incidents.
* Strive for continuous improvement and make recommendations based on CCC process.
* Act as a technical focal point for the CCC team.
* Other duties and responsibilities as assigned.

Qualifications:

* 4+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.
* Bachelor's Degree in Computer Science or a related field, or relevant work experience.
* Strong and demonstrable incident management skills with relevant experience in an enterprise organization.
* Experience and exposure working in a 24/7 operations support environment.
* Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative and drive.
* Experience investigating, analyzing and troubleshooting large scale enterprise systems.
* Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
* Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell.
* Experience administering Unix/Linux in a production environment.
* Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
* Experience working with and developing enterprise monitoring/tooling solutions like Grafana, Kibana, Splunk, Graphite, Nagios, New Relic and DynaTrace.
* Working knowledge of one or more cloud technologies such as AZURE, GCP and OpenStack.
* Working knowledge of CI/CD pipelines.



Minimum Qualifications...

Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.

Bachelors degree in Computer Science and 3 years experience in software engineering or related field OR 5 years experience in software
engineering or related field.



Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Masters degree in Computer Science or related field and 2 years' experience in software engineering or related field



Primary Location...

640 W California Avenue, Sunnyvale, CA 94086-4828, United States of America

Walmart Inc.

Bentonville, AR

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores. Headquartered in Bentonville, Arkansas, the company was founded by Sam Walton in 1962 and incorporated on October 31, 1969. It also owns and operates Sam's Club retail warehouses. As of April 30, 2019, Walmart has 11,368 stores and clubs in 27 countries, operating under 55 different names. The company operates under the name Walmart in the United States and Canada, as Walmart de México y Centroamérica in Mexico and Central America, as Asda in the United Kingdom, as the Seiyu Group in Japan, and as Best Price in India. It has wholly owned operations in Argentina, Chile, Canada, and South Africa. Since August 2018, Walmart only holds a minority stake in Walmart Brasil, with 20% of the company's shares, and private equity firm Advent International holding 80% ownership of the company.

Walmart is the world's largest company by revenue—over US$500 billion, according to Fortune Global 500 list in 2018—as well as the largest private employer in the world with 2.2 million employees. It is a publicly traded family-owned business, as the company is controlled by the Walton family. Sam Walton's heirs own over 50 percent of Walmart through their holding company, Walton Enterprises, and through their individual holdings. Walmart was the largest U.S. grocery retailer in 2019, and 65 percent of Walmart's US$510.329 billion sales came from U.S. operations.

The company was listed on the New York Stock Exchange in 1972. By 1988, Walmart was the most profitable retailer in the U.S., and by October 1989, it had become the largest in terms of revenue. Originally geographically limited to the South and lower Midwest, by the early 1990s, the company had stores from coast to coast: Sam's Club opened in New Jersey in November 1989 and the first California outlet opened in Lancaster in July 1990. A Walmart in York, Pennsylvania opened in October 1990: the first main store in the Northeast.

Walmart's investments outside North America have seen mixed results: its operations and subsidiaries in the United Kingdom, South America, and China are highly successful, whereas its ventures in Germany and South Korea failed.

 

Similar Jobs