GSK group of companies

Data Engineer II

Posted on: 11 Dec 2025

San Francisco, CA

Job Description

The Onyx Research Data Platform organization represents a major investment by GSK R&D and Digital & Tech, designed to deliver a step change in our ability to leverage data, knowledge, and prediction to find new medicines.? We are a full-stack shop consisting of product and portfolio leadership, data engineering, infrastructure and DevOps, data / metadata / knowledge platforms, and AI/ML and analysis platforms, all geared toward:?? 

Building a unified, automated, next-generation data experience for GSK’s scientists, engineers, and decision-makers, increasing productivity, and reducing data friction? 

Providing best-in-class AI/ML, GenAI and data analysis environments to accelerate our predictive capabilities and attract top-tier talent?? 

Aggressively engineering our data at scale to unlock the value of our combined data assets and predictions in real-time?? 

Data Engineering is responsible for the design, delivery, support, and maintenance of industrialised automated end to end data services and pipelines. They apply standardised data models and mapping to ensure data is accessible for end users in end-to-end user tools through use of APIs. They define and embed best practices and ensure compliance with Quality Management practices and alignment to automated data governance. They also acquire and process internal and external, structure and unstructured data in line with Product requirements.?? 

As a Data Engineer II, you are a technical contributor who can take a well-defined specification for a function, pipeline, service, or other sort of component, devise a technical solution, and deliver it at a high level. You are aware of, and adhere to, best practice for software development in general (and data engineering in particular), including code quality, documentation, DevOps practices, and testing. You ensure robustness of our services and serve as an escalation point in the operation of existing services, pipelines, and workflows. You will work across structured, unstructured, and scientific data domains, applying modern engineering and automation best practices to deliver reliable, scalable, and governed data products. You will also contribute to emerging GenAI-enabled data capabilities, such as embedding pipelines, vectorized data flows, and LLM-ready data products.  

You should be deeply familiar with the most common tools (languages, libraries, etc) in the data space, such as Spark, Kafka, Storm, etc., and aware of the open-source communities that revolve around these tools.  You have a strong focus on operability of your tools and services, and develop, measure, and monitor key metrics for their work to seek opportunities to improve those metrics.

Key responsibilities include:

Builds modular code / libraries / services / etc using modern data engineering tools (Python/Spark, Kafka, Storm, …) and orchestration tools (e.g. Google Workflow, Airflow Composer) 

Produces well-engineered software, including appropriate automated test suites and technical documentation 

Develop, measure, and monitor key metrics for all tools and services and consistently seek to iterate on and improve them 

Ensure consistent application of platform abstractions to ensure quality and consistency with respect to logging and lineage 

Fully versed in coding best practices and ways of working, and participates in code reviews and partnering to improve the team’s standards 

Adhere to QMS framework and CI/CD best practices 

Provide L3 support to existing tools / pipelines / services  

Why you?

Basic Qualifications:

We are looking for professionals with these required skills to achieve our goals:

Bachelor’s degree in Data Engineering, Computer Science, Software Engineering, or a related discipline

4+ years of Data engineering Experience 

Software engineering experience 

Orchestrating tooling experience

Cloud experience (GCP, Azure or AWS)

Experience in automated testing and design

Preferred Qualifications:

If you have the following characteristics, it would be a plus:

New PhD or a Masters degree with 2+ years of experience.

Experience overcoming high volume high compute challenges 

Knowledge and use of at least one common programming language: e.g., Python, Scala, Java, including toolchains for documentation, testing, and operations / observability 

Strong experience with modern software development tools / ways of working (e.g. git/GitHub, DevOps tools, metrics / monitoring, …) 

Cloud experience (e.g., AWS, Google Cloud, Azure, Kubernetes) 

Application experience of CI/CD implementations using git and a common CI/CD stack (e.g. Jenkins, CircleCI, GitLab, Azure DevOps) 

Experience with agile software development environments using Jira and Confluence  

Demonstrated experience with common tools and techniques for data engineering (e.g. Spark, Kafka, Storm, …) 

Knowledge of data modelling, database concepts and SQL 

Exposure to GenAI or ML data workflows (vector stores, embeddings, feature pipelines, etc.)

#GSK-LI #R&DTechProject

GSK group of companies

Philadelphia, PA

We are a science-led global healthcare company with a special purpose: to help people do more, feel better, live longer.

We have three global businesses that research, develop and manufacture innovative pharmaceutical medicines, vaccines and consumer healthcare products.

Our goal is to be one of the world’s most innovative, best performing and trusted healthcare companies.

Our values and expectations are at the heart of everything we do and help define our culture - so that together we can deliver extraordinary things for our patients and consumers and make GSK a brilliant place to work. 

Our values are Patient Focus, Transparency, Respect, Integrity.


Our expectations are Courage, Accountability, Development, Teamwork. 

Across the US, we employ more than 15,000 people - from our Vaccines R&D headquarters in Maryland, to our R&D Hub in Pennsylvania, and from one of our nearly 10 manufacturing sites across America, our employees and our values are at the heart of everything we do.

What we do

We aim to bring differentiated, high-quality and needed healthcare products to as many people as possible, with our three global businesses, scientific and technical know-how and talented people.

 Our Pharmaceuticals business has a broad portfolio of innovative and established medicines with commercial leadership in respiratory and HIV. Our R&D approach focuses on science related to the immune system, use of genetics and advanced technologies.

 

Similar Jobs