A Thoughtful Data Nerd

How to Build a Data Platform

2025-04-06T00:00:00+00:00

When building a data platform, it’s tempting and admittedly, a lot of fun, to dive straight into the latest and greatest technologies, running proof-of-concepts and benchmarking tools to find the top performer. But over the course of my career, I’ve seen this approach fall short time and time again.

At the end of the day, the role of a Data team is to deliver value to the business through data products. While it’s easy to get caught up in building a platform that aims to “do it all,” this often results in projects that take significantly more time and budget than the business is willing to support. Focus should always start with delivering business value, not chasing perfection in infrastructure.

Business Questions

The first step you should take is to answer the following business questions:

Clear Business Problem Statement: Without a clear problem statement, a data platform can quickly spiral into trying to solve every possible problem. A well-defined problem sharpens the project’s focus and helps identify practical requirements, such as whether data needs to be updated hourly, daily, or in real time. As with any successful project, clearly understanding the business problem is essential before any work begins.
Team skillsets: Is the team made up of experienced Data Platform Engineers familiar with running open-source big data solutions, or is it a smaller team, perhaps composed of Data Analysts with some engineering knowledge? If the team lacks deep engineering experience or bandwidth, it's often wiser to lean toward user-friendly managed services. And when bringing in external consultants, be sure that any recommended solution is maintainable by the existing team. Otherwise, you risk becoming dependent on specialized hires just to keep the platform running.
Existing Technology: What tools are currently in use across the company? Is there already a Cloud Data Warehouse in place? Are other teams using an orchestrator? Before introducing new solutions, it's important to first assess the existing tooling. Other teams may have already invested time and effort into setting up tools, managing deployments, and negotiating pricing; leveraging this existing work can save significant time and resources.
Timeline: What is the timeline for implementing a solution? If the deadline is tight, it may be necessary to start with a more expensive, narrowly focused managed service to meet immediate needs, while keeping a longer-term plan in place for building a more scalable and cost-efficient solution. Even with a longer timeline, it's still important to start small and gather iterative feedback from stakeholders to ensure the solution evolves in the right direction.
Data Infrastructure Budget: Establishing a realistic budget is essential, though often challenging. The necessary funding will be directly influenced by factors like team capabilities, project timelines, and leveraging existing technologies. To operate within budget constraints while delivering initial value, you may need to strategically narrow the project's scope, such as addressing a subset of key markets initially. This focused strategy ensures the platform delivers tangible value early, within budget, while building a foundation for future scaling.

Data Questions

Next, once the key business questions are sufficiently answered (and documented), you can move onto the 5Vs + S of data. This topic has been thoroughly covered so I will just highlight them below:

Volume: How much data is needed?
Velocity: How frequent does the data need to be processed?
Variety: What kind of data are you working with? Structured? Unstructured?
Veracity: How trustworthy is the data source? Do you need to build data quality checks into the ingestion pipeline to catch bad data coming in?
Value: How valuable is the data source? It is important to balance the value of a dataset versus how much it costs to ingest and manage.
(Extra) Sensitive: When you are dealing with sensitive data, it is crucial to ensure that it is sufficiently protected at motion and at rest.

POC Time!

Congratulations! With this information in hand, you can now start considering what tools will make sense to start doing POCs for a data platform.

Introduction to a Data Lakehouse with AWS Athena, dbt, and Apache Iceberg - Part 1

2025-04-04T00:00:00+00:00

Introduction

The concept of a Data Lakehouse has been around for a few years now but the barrier to entry for building a Data Lakehouse can seem fairly high. In this blog post series, I will talk through the pros and cons of a Data Lakehouse and then step through how to stand a Data Lakehouse up and then finally how to maintain it.

In this post, I am introducing what is Data Lakehouse, contrasting it against a Data Lake and a Cloud Data Warehouse, and then talking about pros and cons. In Part 2 I will dive into setting up a Data Lakehouse using AWS Athena, Apache Iceberg, and dbt. In Part 3, I will dig into tasks around how to maintain a Data Lakehouse in order to maintain costs and performance.

What is a Data Lakehouse?

A Data Lakehouse is effectively Data Lake except that it uses an Open Table Format (such as Apache Iceberg, Hudi, or Delta Lake) which allows you to perform CRUD (Create, Read, Update, Delete) operations with ACID (Atomicity, Consistency, Isolation, and Durability) transactions within a Data Lake. Prior to Open Table Formats, in a Data Lake it was only possible to work within the context of partitions where you could append to an existing partition, overwrite the whole partition, or delete it. If you wanted to perform CRUD operations at the row level, you had to use a Cloud Data Warehouse which in the context of AWS, is most commonly AWS Redshift or Snowflake. However, with Open Table Formats like Apache Iceberg, you can now use any popular computing framework such as Trino and Spark via AWS EMR or AWS Athena to perform CRUD operations with ACID transactions directly in the Data Lake(house).

Data Lakehouse vs Cloud Data Warehouse

First, I would like to say it is entirely possible for an organization to use both a Data Lakehouse in combination with a Cloud Data Warehouse. One possible scenario is to have a Data Platform team that manages a Data Lakehouse where an organization’s data is centralized and then smaller Data Analytic or Science teams can then use a Cloud Data Warehouse such as Snowflake or AWS Redshift to read in the data they need for building data products.

Where a Data Lakehouse Excels

Storage Costs

Even though Cloud Data Warehouses have made great efforts to reduce the cost of storing data, they are still much more expensive than S3 especially considering the different storage tiers that AWS S3 provides.

Bring your own Compute Framework

Any group can choose the compute framework to use such as Spark or Trino via AWS EMR, AWS Athena, or even a Cloud Data Warehouse to pull data from a Data Lakehouse. Outside of the S3 API get requests (which I will discuss later how to minimize), the bulk of the cost is on the data consumer in whatever compute solution they decide to use.

Data Governance and Access

Data Governance tools like AWS Lake Formation have come a long way in streamlining granting fine grained access (both column and row level) to datasets in a Data Lakehouse.

Challenges of a Data Lakehouse

Complexity

A Data Lakehouse just like a Data Lake, is much more complex to build and maintain than a Cloud Data Warehouse. First, it requires using a metastore like the AWS Glue Data Catalog to manage the metadata about the files and partitions stored within AWS S3. Then, to interact with a Data Lakehouse, you have to bring your own compute framework such as Spark, Trino, AWS Athena, or even a Cloud Data Warehouse to access or modify the data.

Regular Maintenance

Regular maintenance operations are required in order to manage cost and performance. In most Cloud Data Warehouses, maintenance operations are are either handled behind the scenes or are at least very streamlined.

There is always a tradeoff

Like all things in the world of data, the trade off between a Data Lakehouse and Cloud Data Warehouse is that while a Data Lakehouse is cheaper to run (if well maintained), it requires specific sets of skills to stand up and operate versus a Cloud Data Warehouse which may carry a larger upfront cost but is easier to stand up, operate, and maintain.

Stay tuned!

In my next post, I will be talking through tools you can use to stand up your first Data Lakehouse! These tools will include the following:

awswrangler aka AWS SDK for Pandas: This is a handy python package built and maintained by the AWS Professional Services team that makes building solutions in AWS much easier.
dbt-athena: This is the dbt adapter for AWS Athena that makes doing transformations with Athena and Iceberg much more streamlined.
AWS Services: Athena, Glue, S3
Apache Iceberg: In the AWS ecosystem, Apache Iceberg has come out on top as one of the better supported Open Table Formats over Hudi and Delta Lake.

Note: The scope of this blog post series will not include how to run this in production as it could be its won blog post series. Popular solutions include using Dagster, Airflow with Cosmos, or AWS Step Functions with ECS Fargate Tasks.

HubSpot for Job Hunting

2023-06-02T00:00:00+00:00

The Problem

When I initially posted to LinkedIn that I was looking for my next job opportunity, I was a bit overwhelmed with the amount of responses I received. I feel very fortunate to have an amazing network of colleagues that reached out to see how they could help. However, I quickly realized I needed to start tracking conversations, recommended companies and job applications. My first idea was to spin up a Google Sheet that included a Contacts, Companies, Job Applications, and Activities tab.

As I started to fill this out, I quickly identified a handful of issues:

This was going to be very manual
Typos would be the end of me
I would have to use something like Google Data Studio to combine information across tabs

A More Automated Approach

As I was looking at how I had organized my Job Hunting Google Sheet, I realized this looked very familiar with how I have worked with data out of CRM's like Salesforce and HubSpot. After some quick googling, I found this blog post by HubSpot How to Organize Your Job Hunt in HubSpot CRM and it inspired me give it a shot.

Getting HubSpot setup for Job Hunting

Setting up Companies

The first thing I did was to go into HubSpot and customized the entry forms and table views for Companies and Contacts to line up with the kind of information I wanted to capture. One of the first parts of HubSpot I really liked was when adding a company, all I had to do was input the companies website and HubSpot would automatically pull in information about the company such as industry, employee size, etc. Once I had a company added, I could go in and add people to that company to establish a relationship between people I was talking with and the companies they worked at.

One more involved adjustment I wanted to make was to give myself a way of specifying a Type of company. By default, HubSpot Company.Type was tailored more towards sales but fortunately, you can go into their data model and customize various elements. You can accomplish this by going to Profile & Preferences > Data Management > Objects > Companies. From there, I changed Type to be Startup, Consulting, Recruitment, Company, and University.

Setting up Contacts

For Contacts, I took the approach of keeping it as simple as possible so I cut down the input form quite a bit. While they had a field for Twitter, HubSpot did not include one for LinkedIn. I thought it would be helpful to look at people's LinkedIn's prior to a call or in person meeting, so I co-opted the Website URL field to be used for people's LinkedIn URL.

One feature that has been helpful for me is a built-in data quality feature for HubSpot where it will call out if you have already added someone with their e-mail as a contact.

Setting up Job Applications (aka Deals)

With all of my companies and contacts added, I then set up Deals as Job Applications. For this, I went into the Profile & Preferences > Data Management > Objects > Deals and changed the Pipelines to map to the different parts of the application process.

With that setup, I then tailored down the Deal submit form to be short so it would be quick add Job Applications. Also, I can tie those job applications to people and/or companies. In the Deal Description section, I put the Job's URL so I can quickly bring up the information about the job.

Since I can connect both a Company and Contacts to a Deal, in my Deal's view I can see what job applications where I have had recently activity with the target company. I also included an "Amount" column that will be populated in the event I receive an offer from a company.

Using HubSpot

Now that I have Companies, Contacts, and Job Applications set up, my workflow for using HubSpot involves logging e-mails, calls, and meetings with a Contact.

When you log anything, there is an option at the bottom to set up a "Follow up task" with either a default or custom follow up date. I have to say this is easily one of my favorite features since it streamlines setting up reminders to follow up.

Other Interesting Features

After it only took a few hours to get Companies, Contacts, and Job Applications setup, I decided to explore some other features of HubSpot's free tier that could be helpful in automating other aspects of my job hunt.

Free Meeting Scheduler

If you are open to connecting your Google e-mail to your HubSpot platform, HubSpot has a free meeting scheduler that can connect directly with your Google Calendar to streamline people scheduling time with you.

Sending E-mails from HubSpot

To help connect e-mail conversations with people you are networking with to potential job opportunities, you can use HubSpot to send one-on-one e-mails to a contact.

Feel free to reach out

I hope other people can find HubSpot to be as helpful as I have in their job hunt! If you have any questions or betters ideas on how I could improve this, please feel free to reach out!

The Process of Data

2023-04-24T00:00:00+00:00

Do you have complex processes that you are struggling to streamline or automate? If so, my recommended approach has some key benefits:

⭐ Shorter timelines for improving an overall process

⭐ Faster ROI when investing in technology

⭐ Purpose built tools that are less expensive and easier to use

⭐ Improved data quality

⭐ Improved data accessibility and ease of use

🙈 Problem: Building tools or platforms that streamline an entire or multiple parts of a process can be complex to build, complex to use, and expensive. This can also prevent you from being able to use existing purpose built tools since any one tool may only be able to streamline a portion of your process. 🙈

💡 My recommended steps are: 💡

1️⃣ Document the whole process. This could include an overall process flow along with procedural steps for each part. This step is the most important since it is challenging to streamline an undocumented process.

2️⃣ Break the process down into separate parts. When a process is broken down into separate parts, it allows for less complex solutions to be used to automate it versus attempting to streamline an entire process with one tool.

3️⃣ Identify one part of the process that is the most important and/or time consuming. Rarely are all parts in a process made equal. Usually there are a handful of parts that are either crucial to success and/or time consuming.

4️⃣ Automate or Streamline Automate or streamline the part of the process identified in 3️⃣. This could involve using an existing tool or building a new tool. ⚠ If a tool needs to be built, it is important to keep in mind this new tool is only meant to solve a single part of the overall process. ⚠

5️⃣ Iterate. Start back over at 1️⃣ Once you have automated or streamlined the most important and/or time consuming part of a process, it is helpful to go through the whole exercise again. This will help inform what worked, what didn’t work, and things to keep in mind when continuing to streamline a process.

Some final notes:

🗒 Key people that do the process should be involved. There are always nuances to a process that can be missed in documentation.

🗒 It is helpful to try and keep an overall technology architecture in mind. Selecting tools that can't work with one another leads to data silos and forced manual parts of a process.

🗒 Have a high level understanding of what an ideal end state/process looks like. This is helpful in thinking beyond the current process and can help inform technologies to use.

Tool Selection Framework

2023-03-01T00:00:00+00:00

Overview

The goal of the Data Tools Conceptual Framework is to highlight various tools that can be used to process and analyze data and data storytelling. While most tools can be used to accomplish a task, there are instances where some tools may be better suited to complete a given project.

To distill down a project down into something that can be evaluated, a project can be broken down into individual work units. A project will be evaluated based on the volume and complexity of work units. For projects that have either a combination of complex work units or a large volume of work units will notably impact the type of tools that would be better suited.

The key metrics that will be used when evaluating data tools will be the learning curve and how productive a person can be with a tool. While some tools may have a lower learning curve making them easier to pick up, if a project is large and/or complex, a tool that allows a person to be more productive could allow for the project to be done faster and with higher quality.

Defining a Project

A project consists of one or many work units. A work unit is a loose concept meant to represent the steps taken to complete a part of a project. When evaluating work units, it is important to keep in mind the volume and complexity of work units for a project.

Complexity

The complexity of a work unit is a function of how many steps there are in a single work unit as well as if there are nuances in a work unit based on the input. Work units that have higher complexity will result in each individual unit taking longer to complete and increasing the opportunity for human error.

Volume

The volume of work units in a project is how many items need to be processed. This could range from either working with a single data file downloaded from a data platform or having to process three hundred sensor data files. Volume is important to keep in mind since even for projects where the individual work complexity is low, if there is a significant volume of units it could lead to the overall project taking a long time and increasing the likelihood of errors.

Data Tool Metrics

When looking at data tools, a comparison will be made between the learning curve of the tool and how productive a person can be with that tool.

Learning Curve

The learning curve represents how long it will take to initially learn the tool.

For scoring, a high score represents a lower learning curve, and a low score represents a higher learning curve.

Productivity

Productivity is determined by how efficiently a person can complete an individual work unit. This is important since there could be instances where a person may have a large volume of work units for a given project or they may have to repeat a work unit because either an error was discovered or a step was missed.

For scoring, a high score represents a tool that allows a person to be very productive with the tool versus a low score meaning the tool is more manual.

Data Tools

Data Tools will be evaluated from the following categories: Spreadsheet, Programming Language, and Data Storytelling. An overview will be provided for where the tool excels and struggles in relation to the score.

Spreadsheet

Overview


Learning Curve	Productivity
10	1

Excel and Google Sheets are both good examples of a spreadsheet tool. What is great about Excel and Google Sheets is that both have a low learning curve to get started and over the years have added many features that make it easy to go from structured data to insights. This could involve creating a pivot table that allows for easily aggregating and summarizing data to creating a variety of pre-baked charts that allow for a decent amount of customization.

However, with the lower learning curve also comes with low Productivity. Since spreadsheet tools only allow manual input, each work unit must be done manually. This can become an issue with projects that have a large amount of work units or if a project has work units with high complexity.

One caveat is that there is the ability to write macros in a spreadsheet tool to enhance the functionality but at that point, it would be more beneficial to take a step back to evaluate whether a Data Preparation Tool or Programming Language would be better suited for the project.

Capabilities

Data Preparation
Data Storytelling

Ideal Project

Where Excel and Google Sheets excel are when you have a project that has a low volume of work units, and the complexity of the work units is low. In this scenario, the amount of time it takes to complete the project is low and additional time can be spent ensuring the quality of the project.

Cost and Availability

Where Google Sheets is free with a Google Account, Microsoft Excel has to be purchased if you are not in a setting where you are working at a company or are in school where Excel is provided. Additionally, Microsoft Excel works best on a Windows PC where Google Sheets is browser based and works in any operating System.

Data Visualization Tools

Overview


Learning Curve	Productivity
6	2

Data Visualization Tools such as Tableau, Juicebox, and PowerBI allow for creating Data Visualizations with structured and cleansed data. Data Visualization Tool's will allow for being more productive than using a Spreadsheet tool because of the additional features.

Capabilities

Data Visualization

Ideal Project

Data Visualization tools will excel at instances where there is a project where data has already been prepared. This will work better with low work units and lower complexity. However, because it has more features, it allows for some additional complexity. Will still struggle with higher volume of work.

Cost and Availability

Data Storytelling Tools must be purchased. This makes them more costly and less accessible.

Programming Language

Overview

Where Programming Languages like R and Python have the greatest learning curve, they also allow for a person to be the most productive and makes it easier to produce higher quality end products. Additionally, as R and Python are both open source, that makes them free and very accessible.

One of the greatest challenges to getting started with R and Python has been setting them up on a computer. However, there are free offerings for both R and Python allowing for them to be run in a browser without the need for setting up a personal computer.

Capabilities

Data Preparation
Data Visualization

Ideal Project

Programming Languages have packages that support a wide variety of project types. However, taking into consideration the learning curve and the amount of time it takes to set up a project, an ideal projects for Programming Languages are ones that require reproducibility, processing a large amount of items, and have complex steps.

Cost and Accessibility

There is no cost to using R and Python. It is accessible for anyone to use that has access to a computer with internet.

Overall Scoring

Below is a list of scores for several types of tools.


Tool	Learning Curve	Productivity
Spreadsheet Tool (Excel, Google Sheets)	10	1
Data Visualization (Tableau, Juicebox, PowerBI)	6	5
Programming Language (Python, R)	1	10

Conclusion

For projects that have a low volume of work with low complexity, a Spreadsheet Tool will be sufficient. However, if either the complexity or volume of work is high, over the lifetime of a project it would be beneficial to explore using a combination of programming languages and data visualization tools.

Impactful Data > Big Data

2022-11-22T00:00:00+00:00

Preface

In the world of data, data is categorized as either "big data" or just "data." This can be problematic since when an organization doesn't have "big data," more often than not they do not take time to automate processing their data nor take care in efficiently managing it. On the other hand, an organization may have what they consider "big data" and they spend a significant amount of effort and money managing it, even if there is no clear use-case. Because of this, I think the focus should shift from "big data" to "impactful data."

What is data?

Before explaining what impactful data is, a quick definition about what is data. As a Data Engineer, to me data is a collection of values to represent a transaction, measurement, or an event along with information to provide context. An example of this could be data coming off sensors placed in a river. The sensor is set to collect measuremetns of pH, Temperature, and Dissolved Oxygen. Each measurement has a time stamp associated with it along with information about when and where the sensor was placed, when it was collected, quality check information about the sensor, etc. For a more detailed definition of data, Wikipedia has a good article around it.

What is impactful data?

To simply put it, impactful data has a clear use-case that can help solve a problem and/or answer a question that will help inform a decision. This could range from being used to create a KPI metric dashboard, data collected in a scientific study, to data that can help drive Government policy. In the absence of a clear use-case, my recommendation would be to document the data and archive it in cheaper object storage, such as AWS S3.

Should you automate processing and management of small volumes of impactful data?

Yes! Regardless of the volume of the impactful data, taking the time to streamline processing and managing the data can have a wide variety of benefits.

Automating Data Processing

Automating data processing reduces the time it takes to gain access to the data and insights from it. It also reduces the amount of potential human error when manually processing data. This could start as a simple Python or R script run locally. As the volume and frequency of data increases, this could evolve into building a Serverless Data Pipeline in AWS using services like AWS Lambda, AWS Step Functions, and/or AWS Fargate. Depending on situation, there are pros and cons between running locally and deploying a data pipeline in a cloud platform. Regardless of the implementation, automating data processing leads to faster access to the data and insights along with less human errors.

Data Management

When does it make sense to build a data warehouse and marts? Well..."it depends." If you are pulling in data from multiple desperate sources that you need to normalize together into a single harmonized data model, then a data warehouse and curated data marts may be appropriate. If the data is coming from a single source, it may be sufficient to streamline processing it and serving the results up via a Quicksight Dashboard.

When does volume of data matter?

There are some cases where the volume data will matter such as when evaluating a data platform architecture or exploring using Machine Learning to solve a problem.

Evaluating Data Platform Architecture

When building a data platform, it is important to take into consideration the 3 V's of data (volume, velocity, and variety). This will inform the kinds of technologies a Data Engineer should consider using in order to ensure the data stack can handle the use-case. An example being a data platform to manage terabytes of data, a Data Engineer should consider frameworks that support parallel computing such as Spark with AWS EMR versus if I am managing gigabytes of data where I could use something simpler to use such Lambda functions to process the data and AWS Athena to query the data.

Machine Learning Applications

When Machine Learning came onto the scene, it brought with it the need for large volumes of data in order to have sufficient data to train an algorithm to perform a task such as image classification, natural language processing, etc. In these cases, the larger the training dataset, more often than not the better the trained algorithm will perform. This is of course assuming the training dataset is well suited for the problem to be solved but that would deserve it's own blog post.

Conclusion

It is important to think critically about the actual value of a dataset over the perceived value. Also, any dataset that is considered valuable, there are numerous benefits to invest time and energy into automating the processing and management of the data. This will help improve data quality and ensure data is accessible in a timely manner.

venv is a Data Professional's Best Friend

2021-02-06T00:00:00+00:00

Preface

Have you ever tried to share a project with a colleague and they struggled to run it? Have you ever spent hours trying to get a machine learning library to work? If so, there are solutions out there that can help with this! In this post, I will talk about a utility, virtual environments, that can be used to manage specific packages.

venv/renv

When starting up a project, the first step I take is to spin up a virtual environment. In python, my personal favorite is the virtual environment utility built directly into python, python venv. In R, I like to use the package renv supported by RStudio. I do this for the following reasons:

Clean slate: When installing packages in a virtual environment, you can either use the latest versions of packages without worrying about impacting other projects or choose to use specific versions of packages.
Collaboration: If I am working on a project with others, I can share the virtual environment configuration files, and they can spin up a virtual environment and run my code with the same packages.
Reproducibility: When I set the project down and I want to pick it up at a later time, I can ensure I am using the same packages as when I first started the project.

For python, you can run:

cd /path/to/project
# creates the virtual environment
python -m venv venv
# activates the virtual environment in your terminal
source venv/bin/activate

Once the environment is created and activated, you can then install all of the packages you need for your project and save them to a requirements.txt file that should be included with your project when sharing.

pip freeze > requirements.txt

For R, you can run:

# You must first install `renv`
Rscript -e "install.packages('renv')"

# Once installed, you can then run
cd /path/to/project
Rscript -e "renv::init()"

Once set up and activated, you can start working on your project and install packages as needed. When you are ready to snapshot your environment, you can run the following command:

Rscript -e "renv::snapshot()"

This will update the renv.lock file that should be used when sharing your code to allow setting up a new environment from it. This is most easily used when working with R in a project in RStudio.

Limitations

While virtual environments are great at managing packages needed to run code, it is limited to just that. There are other aspects of running a project such as application or operating system dependencies. To completely solve for reproducing the environment for running code, you can explore using docker, a utility that will allow for capturing your entire environment which includes the Operating System and supporting applications. There is a steeper learning curve to this, and I plan on dedicating entire post(s) to the topic of how docker is a more comprehensive solution to reproducibility.

Other virtual environment tools

For python, pipenv and poetry are virtual environment utilities that allow for managing various versions of packages in a project. You can manage packages ranging from selecting a specific version of a package to letting any version of that package be installed. While these tools can be helpful in managing packages and package dependencies, they come with additional complexity and a learning curve.

What are the different roles in data?

2019-10-04T00:00:00+00:00

Preface

The intended audience for this post are people interested in getting into data or businesses looking to use data to drive business value. With this in mind, I attempted to stay high level and provide enough information and key words for someone to find more in-depth resources on any given topic. Additionally, when highlighting skills, I focus on what I have observed being commonly used in industry based on experience, combing through job postings, and networking.

Working in data

In my experience, working in data can at times be a bit confusing and overwhelming because of the sheer breadth of skills and techniques that are needed to process, manage, and analyze data. In this post, I am going to group these sets of skills into different roles, highlight the order in which they should be built out in an organization, and emphasize the importance of only focusing on one role at a time. As someone who has done work in each of these areas, my effectiveness was significantly reduced when I had to wear multiple hats.The four roles I will be focusing on in this post are Data Analyst, Data Engineer, Data Scientist, and Machine Learning Engineer. While some of these roles can be broken down to be more specific, my goal of this blog post is to stay high level and emphasize broad concepts around each role. Additionally, there are other important roles such as Data Governance and Privacy and Data Curation which I will not be covering as this post is meant to be an introduction to a data team in an organization.

And one last thing, although I do not advise trying to perform multiple roles at once, I think it can be beneficial if given the opportunity to try different roles and see what you enjoy most.

The different roles

The Data Analyst - Describes Data

In my opinion, the Data Analyst is at the heart of a data team. Their focus should be to describe data to help drive business value.

A Data Analyst should be the first role a company should hire and the expectations should be for them to come in and work with stakeholders and leadership to understand a company's goals and business problems and how data can help drive value. Once a company's goals and business problems have been clearly defined, the Data Analyst should seek out data, wrangle it to make it tidy (see Tidy Data by Hadley Wickham), and then deliver reports and/or dashboards to communicate the results. The Data Analyst is what I consider more of an operational role that helps a business measure key indicators to help understand the state of their business as well as make data driven decisions.

Key areas: data wrangling - data exploration - business intelligence - building dashboards (Key Performance Indicators (KPI), Operational) using tools like Tableau, PowerBI, Looker - building reports using tools like SQL Server Reporting Services (SSRS) - SQL - (optional but recommended) R or Python

The Data Engineer - Scales Data

The Data Engineer's role is to scale an organization's data ingestion and management.

Once a company has a firm understanding of their goals and business problems where data can help drive value, a Data Engineer can come in scale and build upon existing data processes. By scale, I mean the 3Vs of data(volume, variety, and velocity) and the ability to build automated data pipelines capable of bringing in data at the frequency a business needs to support applications, reports, dashboards, and analyses. Not only should a Data Engineer focus on Extract, transform, load (ETL) or Extract, load, transform (ELT), they should also have a good understanding of how to build optimized data warehouses/marts/stores to be useful for end users. Lastly, a Data Engineer should understand the concept of building production code, and by that I mean it needs to work consistently and reliably. In the event of an inevitable failure, it needs to fail reliably and have adequate logging so the issue can be debugged and resolved as efficiently as possible.

Key Areas: concept of production code - optimized data warehousing/mart/store - Extract, transform, load (ETL) or Extract, load, transform (ELT) - data pipelines - automation - streaming data - web scraping - tools like Talend or Informatica - SQL - Python - docker

---------------------------------------------------------------

Most companies can be highly successful doing Data Analytics and Engineering well. Without these two areas in place, a Data Scientist will be ineffective since they will simply be doing a combination of Data Analytics/Engineering. Often times this leads to mismatched skill sets and expectations on both the side of the organization as well as the Data Scientist.

---------------------------------------------------------------

The Data Scientist - Models Data

The role of the Data Scientist is to use machine learning and statistical techniques to develop models of data capable of predictions or prescriptions.

These models should have a focus on enhancing a business' existing processes or develop new products that are not as feasible without machine learning techniques. This is more of a strategic role where often times the goal is to build upon the existing Data Analytic work. Similar to that of a Data Analyst, a Data Scientist will need to do data exploration and visualization to gain a better understanding of the data. Additionally, they will need to interact with business stakeholders to gain a firm understanding of company goals and business problems to help guide what models can be built that can drive value.

Key Areas: feature engineering - statistical analysis - building / tuning / evaluating machine learning models - natural language processing - computer vision - R or Python - SQL

The Machine Learning Engineer - Scale Models

The role of the Machine Learning Engineer is to scale models by deploying and managing them in a production environment.

A Machine Learning Engineer should work closely with the Data Science, Data Engineering, and Software Development teams to understand how a model needs to be deployed and build out necessary tools, APIs, or batch processes. Additionally, a Machine Learning Engineer needs to think about putting in place tools that will evaluate model performance and look for model drift to evaluate if a model needs to be retrained (great blog about this topic The Ultimate Guide to Model Retraining). Like a Data Engineer, they need to have a concept of production code since once a model is deployed, it is often their responsibility to ensure predictions are not only being returned but that they meet the necessary expectations.

Key Areas: model drift - model retraining - the concept of production code - evaluation of models in production - machine learning - feature engineering - Python - SQL - docker - automation

Overall Suggestions when working in Data

Focus on the business problem - When starting data projects, it is easy to get distracted by all of the different available technologies. However, it is important to first focus on solving the business problem at hand versus using current "state of the art" solutions.
Be intentional with technology selection - Whenever introducing a new piece of technology or programming language to a company, it is crucial to consider the consequences. Adding new technologies increases the overall complexity of a company's technology stack as well as often times reducing collaboration and consumption of work across teams.
Source Control - Regardless of role, I highly recommend all code (including SQL) is managed in a source control platform. By managing code through a source control platform, you can ensure code is accessible and changes to it over time can be tracked and managed. Depending on your team's maturity and needs, I suggest checking out either Trunk Based or a Gitflow Development Strategy.
- Some popular source control platforms are GitHub, GitLab, and BitBucket
- In instances where Data Analysts or Data Scientists are doing analyses or ad-hoc requests, I recommend having a single repository where the code for these requests are stored and managed with loose processes around merging. This will help ensure the code is located in a more discoverable and central location. With that, a more restrictive repository(s) should be in place that requires reviews for managing code related to specific project work.
Cloud computing is becoming more common place when working with data in organizations. AWS, Microsoft Azure, and Google all offer free tiers that can allow you to get some hands on experience even if your organization does not use cloud computing.
Package management - With respect to R and Python, I suggest looking into utilities to manage packages to help with reproducibility.
- For R you I suggest using renv over packrat as it is much more efficient at managing packages.
- For Python, I suggest using venv. It is simple and I have never had issues using it yet.
- If you are new to Python, I also recommend checking out Anaconda. However, I will admit my experiences with using it to manage and share environments has not been great. I only recommend Anaconda as it allows you to more easily get started in Python.
Docker - While managing packages is helpful, docker allows you to manage the entire environment of a script which ensures reproducibility. Dev Ops people will love you.
- A good resource for doing this in R is An Introduction to Docker for R Users by Colin Fay
- A good resource for Python is Docker

Updates

2019-12-08

renv is now offered on cran, yay!
Changed the method of suggest python environment management to being venv based on research and testing for consistently and reliably setting up and management environments in python.

Ethical use of algorithms with data

2019-03-09T00:00:00+00:00

Preface

Before I talk about my views on ethics in the realm of Data Science, I first want to talk about how I got into Data Science. I spent the first two years of my career doing some analysis accompanied by mostly Data Engineering before I knew it was Data Engineering. At this two year mark in 2014, I was at a point where I wanted to figure out where I wanted to take my career next. After a bit of searching, I landed on the field of Data Science since it seemed like a perfect fit with my love of statistics and working with data. Like with anything, my first approach was to search and try and find as much information on the topic as possible. I took a few Coursera courses, followed top data scientists on twitter, read blogs, and listened to podcasts. More often than not, the topics were usually around applications of machine learning, interesting papers on machine learning models, or interviews with industry experts. However, periodically there was a post or podcast that would catch my eye on a topic like "gender biased word embeddings." I quickly realized that although Data Science has the propensity for good, it also has a potential to harm individuals or entire communities. I began actively searching for these issues and came upon a couple of good articles that covered a wide variety of issues such as Biased Algorithms Are Everywhere, and No One Seems to Care

However, before I go any further, I would like to introduce two concepts that are at the center of this post.

Machine Learning: "Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task." - https://en.wikipedia.org/wiki/Machine_learning
Algorithmic Bias: "Algorithmic bias occurs when a computer system reflects the implicit values of the humans who are involved in coding, collecting, selecting, or using data to train the algorithm." - https://en.wikipedia.org/wiki/Algorithmic_bias

With that covered, I will move on to discuss a handful of cases where models are either biased, unethically created, or unethically used.

Gender biased word embeddings

Before I talk about how a word embedding can be gender biased, first I would like to discuss what is a word embedding. A word embedding is a useful natural language processing tool where a model can represent the relationships between words as mathematical values to allow for associations such as "man:king" with "woman:queen" and "paris:france" with "tokyo:japan". Cool, right? However, this bias has been extended to "man:programmer" with "woman:homemaker". Not cool. A common model used is Word2Vec developed by Google back in 2013 trained on Google News texts. Unfortunately, because the data this model was trained on was gender biased, so are the results. But there is hope! Researchers have done work to both quantify the bias and come up with methods to "debias" the embeddings in Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. For more on this, I recommend How Vector Space Mathematics Reveals the Hidden Sexism in Language.

Mortgage loan interest rate bias

I find this to be a more traditional instance where financial institutions set out to use "big data" with "machine learning" to find ways to infer interest rates based on geography or characteristics of applicants. This is referred to as "Algorithmic Strategic Pricing". A result of this, based on a study done by UC Berkly, African Americans and Latino borrowers pay more on purchase and refinance loans than White and Asian ethnicity borrowers. "The lenders may not be specifically targeting minorities in their pricing schemes, but by profiling non-shopping applicants they end up targeting them" said Adair Morse.

For more information, you can check out Minority homebuyers face widespread statistical lending discrimination, study finds or the study itself Consumer-Lending Discrimination in the Era of FinTech

Image recognition bias

Image recognition has become an everyday utility in society with many big tech companies using it in their product offerings like Google, Microsoft, and most notably, Facebook. However, without any industry benchmarks to ensure that these facial recognition applications perform well on people of all races, genders, and age, there are instances where either the systems simply do not work or the systems are offensive.

A more notable instance of a facial recognition system failing to work was discovered by Joy Buolamwini, who is an African American PhD student at MIT's Center for Civic Media. At the time of her discovery, she was a Computer Science undergraduate at Georgia Tech. In her undergrad, she was working on a research project to teach a computer to play "peek-a-boo" and found out that although the system had no issues recognizing her lighter skinned roommates, it had difficulty working with her. Her solution to this was to wear a white Halloween mask which would then detect her as white. More on this can be found here Ghosts in the Machine

This issue is not limited to Joy Buolamwini's research, but could also be seen in Microsoft's Kinect for the X-box. Back in 2010, it was observed that Microsoft's Kinect often times would not work on people with darker skin. Is Microsoft’s Kinect Racist?. However, additionally worth noting is Microsoft advocating for there to be more regulation around image recognition in their blog post Facial recognition technology: The need for public regulation and corporate responsibility.

Microsoft Tay twitter bot

I like using Microsoft's Tay twitter bot as an example regarding ethics since it is an instance where the researchers themselves weren't being unethical, but failed to consider how their model could be interacted with and manipulated. To provide a brief summary, Microsoft's Tay twitter bot was a research experiment conducted where they built an artificial intelligence twitter bot that was supposed to learn how to mimic the speech of a 19 year old American girl by interacting with people on twitter. However, what they failed to consider was a series of internet users deciding to bombard the twitter bot with hateful speech. The end result required Microsoft to turn the twitter bot off after 16 hours.

This is an instance where although the researchers themselves were not being unethical, they failed to take into consideration how their twitter bot could be manipulated. As we build products, it is important not only to think about what the purpose of the model is but how it could be used to harm other people.

Facebook Cambridge Analytica scandal

An ethics blog post would be incomplete without mentioning the Facebook–Cambridge Analytica data scandal. To briefly summarize, this is an instance where an organization used a survey app through Facebook to collect information from users for supposedly academic purposes. However, through manipulating Facebook's app design, they were also able to collect the information of not only the users that agreed to the survey, but all of the users’ friends information as well. Furthermore, instead of using this information for academic purposes, they used it for both the Ted Cruz and Donald Trump political campaigns.

The two main takeaways here were that Cambridge Analytica both collected people's information without their consent and then used the information for purposes beyond the consent given. Needless to say, collecting people's information without their consent is clearly unethical. However, even when collecting people's personal information ethically , it is important that measures are taken to ensure their information is protected and not misused.

Ways to improve

A few closing thoughts on ways the Data Science industry can improve:

Build teams of people from diverse backgrounds to ensure underrepresented communities are not negatively impacted by biased models.
Audit algorithms AND the data sets used to train models.
Encourage companies to provide more information to users and researchers to help them better understand potential pitfalls and biases that may exist in their tools.

Additional sources

If you are still interested in looking more into this topic, I highly recommend checking out the following: