What Companies Ask For When Hiring Data Engineers (Early 2026)
A snapshot from early 2026:
The goal is to separate what companies reliably hire for from what generates the most noise online. The two overlap less than you'd expect.
Summary:
- Baseline: SQL, Python, one cloud platform - these appear in the majority of postings
- Role-defining: Airflow, Spark, dbt, a warehouse - the technologies that separate DE from other roles
- The rest is situational - streaming, infrastructure-as-code, specific databases depend on the company
- The long tail matters less than you think. If you only learn four things, learn the baseline plus one role-defining tool.
See methodology at the bottom for how the data was collected and processed.
The Baseline
The most frequently mentioned technologies across data engineering postings, plus an aggregated "AI tools" category. Each bar shows what percentage of postings mention that technology.
SQL and Python sit at the top for a reason. They function less as "skills" and more as prerequisites - roughly the same way reading and writing are prerequisites for a journalism job. About 12% of DE postings don't explicitly list either one - but those jobs still ask for Kafka, Airflow, Snowflake, and Spark. They assume the basics rather than stating them. The interesting question is what comes after.
The three cloud platforms (AWS at 46%, Azure at 35%, GCP at 29%) split the market unevenly, with AWS leading. In practice, the specific platform matters less than understanding one of them well enough to operate infrastructure on it. The concepts - object storage, IAM, managed services - transfer between providers. The CLI commands do not, but that's a weekend of adjustment, not a career shift.
Further down, the shape of the role becomes clearer. Spark and Airflow show up consistently across these postings. dbt is newer but appears frequently enough that it's a standard expectation rather than a differentiator. The warehouse layer - Snowflake (29%), BigQuery (21%), Databricks (32%) - appears across a wide range of postings. Analytical workloads show up more often in purpose-built warehouses than in general-purpose databases.
What Defines Data Engineering
Many of the technologies above also appear in backend, DevOps, and machine learning postings. Python is everywhere. SQL is everywhere. The more useful question is: which technologies appear disproportionately in data engineering postings compared to other technical roles?
The chart below shows, for each technology, the share of data engineering postings that mention it (right) versus the share of all other tech postings (left). Technologies where the bar extends mostly to the right are the most distinctive to data engineering.
Python and SQL lead, but what's notable is the size of the gap even when compared to other technical roles. Python appears in 79% of DE postings versus 34% of other tech postings. SQL shows an even larger relative gap: 74% versus 24%. They are common across tech, but data engineering leans on them harder than most.
Airflow has the clearest separation after that: 40% of DE postings versus 2% for other tech roles. It is the clearest DE signal after the basics. Spark, dbt, Databricks, and Snowflake follow a similar pattern - they appear in 29-39% of DE postings but only 2-5% of other tech roles.
The cloud platforms and Kafka sit in between. AWS appears in 46% of DE postings and 23% of other tech roles - it's widespread, but DE still over-indexes. Kafka (27% DE vs 6% other) has become common enough in broader tech that its percentage-point gap over other roles is more modest than you might expect.
At the bottom of the chart, AI tools are the only category where data engineering falls behind other tech roles: 5% of DE postings versus 11% of other tech roles. Data engineering is under-indexing on AI relative to the broader tech market. This makes sense - AI tooling (LLM APIs, frameworks like LangChain, coding assistants) is showing up more in application development and ML engineering roles than in data infrastructure work. Data engineers build the pipelines that feed AI systems, but the AI tools themselves aren't yet a standard part of the DE toolkit. That may change, but in early 2026, the hiring data doesn't reflect it yet.
Technology Pairs
Technologies are rarely mentioned in isolation. When a company asks for Spark, what else tends to appear in the same posting? The network below shows the strongest co-occurrence patterns across data engineering job ads. Hover over a node to see its top co-occurring technologies.
Nodes are sized by overall mention frequency. Edge thickness reflects how often two technologies appear together. Colors indicate the technology category.
The dense cluster around Python, SQL, Airflow, and Spark is expected - these co-occur in a large share of postings because they represent the baseline toolkit. Python-SQL alone appears in 65% of all DE postings.
dbt is one of the most connected nodes in the graph, linking to Python, SQL, Airflow, Snowflake, and AWS. Its strongest connections are to Python (28%) and SQL (27%), but the Airflow-dbt pair (21%) and Snowflake-dbt pair (16%) stand out - these reflect the modern analytics engineering workflow where dbt models run on a warehouse, orchestrated by Airflow.
Azure pairs strongly with Databricks - that connection is the clearest cloud-warehouse link in the graph. BigQuery connects to Python and SQL but not to GCP directly in the top pairs, suggesting it's treated as a standalone warehouse rather than as part of a GCP stack. Snowflake connects most strongly to SQL, Python, and AWS, with dbt close behind - it runs on all three clouds, and the data reflects that platform-agnostic positioning.
Kafka links to Python, SQL, Spark, and Airflow. Its strongest pair is with Python (21%), followed by SQL and Spark (both 19%). The Kafka-Spark connection is expected - streaming data often feeds into Spark for processing. Terraform sits at the periphery with only two connections: Python and SQL.
What This Means in Practice
The data engineering hiring market is more legible than it looks. Beneath the noise of new tools and shifting terminology, the core expectations have been stable for years: SQL, Python, a cloud platform, and some combination of Airflow, Spark, and a warehouse.
If you're entering the field, the priority order is clear. SQL and Python are a must. But keep in mind - data engineering is not rocket science, or data science for that matter. The Python written here is not deep algorithms or concurrent live systems; it's doing simple things reliably and observably.
After that, pick a cloud platform (whichever one the companies you're interested in use), learn Airflow or a modern alternative, get comfortable with at least one warehouse, and learn dbt. Everything else - Kafka, Kubernetes, Terraform - is situational. Important, but dependent on the specific role rather than the profession.
If you're already working in data engineering, the more useful signal is what's not here. The long tail of tools that generate online discussion - the latest orchestrator, the newest table format - rarely shows up in actual hiring data at meaningful scale. The tools that define the role today are largely the same ones that defined it two years ago. That's not stagnation; it's maturity.
The technology pairs chart tells the most practical story: technologies cluster around workflows, not vendors. Companies ask for Python + SQL + Airflow + Spark because those map to real pipeline work. The strongest cloud-warehouse pairing is Azure-Databricks - the rest of the warehouses (Snowflake, BigQuery) connect more to the language-and-tooling core than to any specific cloud. That's useful: it means switching clouds is an infrastructure change, not a skills change.
Methodology
This analysis uses
Role definition: A posting is classified as "data engineering" if its title contains "data engineer" or "data platform engineer" (case-insensitive). This captures variants like "Senior Data Engineer," "Data Engineering Manager," and "Data Platform Engineer" while excluding "Data Analyst" or "Data Scientist."
Time window: All analyses use postings from early 2026.
Technology pairs: Co-occurrence within the same job posting. If a company lists Spark and Airflow together, it's a strong signal the role will touch both. Job ads are imperfect - some include aspirational or templated requirements - but patterns at scale are informative.
Data source: