visit
After going through multiple Job Descriptions and based on my experience in the field , I have come up with the detailed skill sets to become a competent Data Engineer.If you are a Backend Developer, some of your skills will overlap with this list below. Yes, it's quite easier for you to make the jump provided the skill gaps are addressed.🎯 Must Haves1️⃣ The Art of Scripting and AutomatingCan't stress this enough.Ability to Write reusable code and to know the Common Libraries and frameworks used in Python for:
* Data Wrangling operations - Pandas,numpy,re2️⃣ Cloud Computing PlatformsThe rise of cloud storage and computing has changed a lot for data engineers. So much that, being well versed in at least one of the cloud platforms is required.
* Data Scraping - requests/BeautifulSoup/lxml/Scrapy
* Interacting with External APIs and other Data Sources, Logging
* Parallel processing Libraries - Dask, Multiprocessing
*Serverless Computing, Virtual Instances, Managed Docker and Kubernetes ServicesEither start with or services.3️⃣ Linux OS
* Security Standards, User Authentication and Authorization, Virtual Private Cloud, Subnet
Importance of Working with Linux OS is often overlooked.
* Bash Scripting concepts in Linux like control flow, looping, passing input parameters4️⃣ Database Management - Relational Databases, OLAP vs OLTP, NoSQL
* File System Commands
* Running daemon processes
* Creating tables, Read,Write,Update and Delete operations, joins, procedures, materialized views, aggrgated views, window functionsCommon Relational Databases preferred - PostgreSQL, MySQL etc5️⃣ Distributed Data Storage Systems
* Database vs Data warehouse. Star and snowflake schemas, facts and dimension tables.
* Knowledge of how distributed data store works.Some of the mostly used ones - HDFS, AWS S3 or any other NoSQL database (MongoDB, DynamoDB,Cassandra)6️⃣ Distributed Data Processing Systems
* Understanding the Concepts like partitioned data storage, sorting key, SerDes, data replication, caching and persistence.
* Common techniques and patterns for data processing such as partitioning, predicate pushdown, sort by partition, maintaining size of shuffle blocks, window functionCommon Distributed processing frameworks - Map Reduce, Apache Spark (Start with Pyspark if you are already comfortable with Python)
* Leveraging all cores and memory available in the cluster to improve concurrency.
7️⃣ ETL/ELT tools and Modern Workflow management Frameworks
Different companies will have different ways to pick ETL frameworks,
One with an In-house data engineering team would prefer to have ETL jobs set up with properly managed workflow management tools for Batch Processing.
* ETL - ETL vs ELT, Data connectivity, Mapping, Metadata, Types of Data LoadingETL Tools: Informatica, TalendWorkfow Management Frameworks: Airflow, Luigi🎯Good To Have
* When to use a Workflow Management System - Directed Acyclic graphs, CRON scheduling, Operators
8️⃣ JAVA / JVM Based Frameworks
Knowledge of a JVM based language such as Java or Scala will be extremely useful
- Understand both functional and object oriented programming conceptsJVM Based Frameworks - Apache Spark, Apache Flink, etc9️⃣ Message Queuing Systems
- Many of the high performance data science frameworks that are built on top of Hadoop usually are written using Scala or Java.
* Understanding how data injestion happens in Message QueuesPopular Messaging queues: Kafka,RabbitMQ, Kinesis, SQS etc🔟 Stream Data Processing
* What are Producer and Consumers and how are they implemented
* Sharding, Data Retention, Replay, de-duplication
*Differentiating between Real-time, Stream and Batch Processing.Commonly used frameworks: AWS Kinesis Streams,Apache Spark,Storm,Samza etc
*Sharding, Repartioning, Poll Wait time, topics/groups,brokers
Follow for updates.
Also published on: