Importing Data with Python

Importing data is a key step in the data science workflow. It also has a huge responsibility. How you import (or connect to) a dataset has consequences for how you work with that data throughout a project, because a Pandas DataFrame (say) requires Pandas-specific code. Similarly, your data constrains your code - if it can fit in memory on a single computer, the constraints are different than if your data is so large that its storage must be distributed. Data-import is a key place where a data-project can go wrong. If you import a dataset without validating that it contains sensible values, a nasty surprise may await you….
Python has wonderful libraries for data manipulation and analysis. You can readily work with data sources of a variety of types. For example:
- that are too big to hold in memory;
- that are distributed across a network;
- that update rapidly;
- or that don’t easily conform to a tabular, relational form.
The Python data stack is sufficiently mature that there are multiple libraries for all of these settings. There are some moves to introduce a standardised syntax for some data-frame functionality across libraries. At Jumping Rivers, we have a number of Python training courses, and teach how to use Pandas, PySpark, Numpy and how to work with databases via SQL from Python.
Importing into memory
Firstly we will compare two in-memory data-frame packages: Pandas and Polars. We will work in a virtual environment (so that the packages installed are handled independently of the system-wide Python; see our Barbie-themed blog post on virtual environments).
# [Linux terminal commands]
# Create & activate a new virtual environment
python -m venv .venv
source .venv/bin/activate
# Install pandas and polars into the environment
pip install pandas polars
We will generate a simple dataset to import with the two packages.
PyPI download numbers for a specific Python package
can be obtained using the
pypistats
package. After
installing it, we will pull out the number of downloads for the package
pytest
- see our recent
blog posts for an
introduction to this testing library.
# [Linux terminal commands]
# Install pypistats
pip install pypistats
# Obtain download-statistics for `pytest` in tab-separated format
pypistats python_minor -f tsv pytest > data/pytest-downloads.tsv
The structure of that file is straight-forward. It records both the
number, and the percentage, of pytest
downloads across each minor
version of Python (“3.8”, “3.9” and so on) for the last 180 days (a
default time-span).
# [Linux terminal commands]
head data/pytest-downloads.tsv
## "category" "percent" "downloads"
## "3.11" "23.40%" 282,343,944
## "3.9" "18.78%" 226,548,604
## "3.10" "18.74%" 226,155,405
## "3.12" "15.48%" 186,819,921
## "null" "8.41%" 101,489,156
## "3.8" "6.13%" 73,965,471
## "3.13" "4.91%" 59,253,846
## "3.7" "3.36%" 40,551,618
## "3.6" "0.54%" 6,546,017
So it should be trivial to import it into Python using either Pandas or Polars.
files = {
"pytest_data": "data/pytest-downloads.tsv"
}
import pandas as pd
downloads_pd = pd.read_csv(files["pytest_data"], sep="\t")
downloads_pd.head()
## category percent downloads
## 0 3.11 23.40% 282,343,944
## 1 3.9 18.78% 226,548,604
## 2 3.10 18.74% 226,155,405
## 3 3.12 15.48% 186,819,921
## 4 NaN 8.41% 101,489,156
import polars as pl
downloads_pl = pl.read_csv(files["pytest_data"], separator="\t")
downloads_pl.head()
## shape: (5, 3)
## ┌──────────┬─────────┬─────────────┐
## │ category ┆ percent ┆ downloads │
## │ --- ┆ --- ┆ --- │
## │ str ┆ str ┆ str │
## ╞══════════╪═════════╪═════════════╡
## │ 3.11 ┆ 23.40% ┆ 282,343,944 │
## │ 3.9 ┆ 18.78% ┆ 226,548,604 │
## │ 3.10 ┆ 18.74% ┆ 226,155,405 │
## │ 3.12 ┆ 15.48% ┆ 186,819,921 │
## │ null ┆ 8.41% ┆ 101,489,156 │
## └──────────┴─────────┴─────────────┘
We can see that Pytest was downloaded ~ 280 million times on Python 3.11, and this accounted for about 23% of downloads.
The syntax for importing the dataset using the two libraries is almost
identical. The only difference for the read_csv()
function is that you
use sep=...
in Pandas and separator=...
in Polars.
Polars is
more memory efficient
than Pandas in a lot of settings. One of the memory-efficiencies that Polars
allows, is filtering by rows and columns during import. To take
advantage of this, you can use the scan_csv()
function in Polars. This
creates a lazy data-frame that can be directly manipulated by Polars
DataFrame methods. These method calls are applied at the point when the
data is loaded, rather than on an in-memory data-frame. The Polars
website has
far more details about the
lazy API, and the benefits it can bring to your work.
Here, we will be working with eagerly-loaded, in-memory, data-frames - a good choice during exploratory work.
Validating a dataset
Have we loaded what we wanted?
The Pandas/Polars .dtypes
attribute gives you information about the
data-types in each column of a data-frame:
print(downloads_pd.dtypes)
## category object
## percent object
## downloads object
## dtype: object
print(downloads_pl.dtypes)
## [String, String, String]
Again, the formatting of the output looks a little different between the two packages, but the results are broadly similar: all of our data—the version-strings in ‘category’, the percentage values and download counts—have been read as character strings. Pandas tell us they are ’object’s, but we know what that typically means they are strings:
type(
downloads_pd["category"][0]
)
## <class 'str'>
So Python has imported our data incorrectly. In a real project we would want to know about this, so we might validate the data in our datasets. We would also want to prevent data-import mistakes as far as possible (see later) by being more explicit about how each column is imported and converted.
Python has a range of packages for validating both the schema for, and the values in, a dataset.
For a general class, if you want to check the data-types that are stored in the fields, you could use Pydantic:
# [Terminal]
pip install pydantic
from pydantic import BaseModel, NonNegativeInt
class Person(BaseModel):
name: str
age: NonNegativeInt
Correct data-types cause no fuss:
Person(name="Russ", age=47)
## Person(name='Russ', age=47)
But incorrect data (here a negative age) throw errors:
Person(name="Buddy", age=-1)
## pydantic_core._pydantic_core.ValidationError: 1 validation error for Person
## age
## Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-1, input_type=int]
## For further information visit https://errors.pydantic.dev/2.11/v/greater_than_equal
There are extensions of Pydantic that work with data-frames. For example, Pandera can work with Pandas, Polars and several other data-frame libraries. You would need to install a different Pandera extension depending on the data-frame library you are working with (“pandera[pandas]” for Pandas, “pandera[polars]” for Polars etc).
# Terminal
pip install "pandera[pandas]"
import pandera.pandas as pa
schema = pa.DataFrameSchema({
"category": pa.Column(str, nullable=True),
"percent": pa.Column(float, checks=pa.Check.in_range(0, 100)),
"downloads": pa.Column(int)
})
schema.validate(downloads_pd)
## pandera.errors.SchemaError: non-nullable series 'percent' contains null values:
## 17 NaN
## 18 NaN
## Name: percent, dtype: object
Validation of the data has identified an issue with the dataset. There is a missing value in the “percent” column towards the end of the dataset. There are other issues - the two numeric columns are currently strings - but lets check out the issue that Pandera has identified first.
This is the end of the dataset:
downloads_pd.tail()
## category percent downloads
## 14 3.3 0.00% 2,259
## 15 2.6 0.00% 87
## 16 3.2 0.00% 68
## 17 Total NaN 1,206,596,056
## 18 Date range: 2024-12-31 - 2025-07-07 NaN NaN
There is some metadata included in the final couple of lines of the dataset that should be ignored at import. Let’s just ignore the final couple of lines at import (this isn’t a very robust solution, but is fine for now).
downloads_pd = pd.read_csv(files["pytest_data"], sep="\t", nrows = 17)
downloads_pd.tail()
## category percent downloads
## 12 3.40 0.00% 12,777
## 13 3.15 0.00% 2,882
## 14 3.30 0.00% 2,259
## 15 2.60 0.00% 87
## 16 3.20 0.00% 68
Now if we validate the Pandas data-frame, another issue has been identified.
schema.validate(downloads_pd)
## pandera.errors.SchemaError: expected series 'category' to have type str:
## failure cases:
## index failure_case
## 0 0 3.11
## 1 1 3.90
## 2 2 3.10
## 3 3 3.12
## 4 5 3.80
## 5 6 3.13
## 6 7 3.70
## 7 8 3.60
## 8 9 2.70
## 9 10 3.14
## 10 11 3.50
## 11 12 3.40
## 12 13 3.15
## 13 14 3.30
## 14 15 2.60
## 15 16 3.20
This is a more substantial issue - the Python version-strings (3.2, 3.3 and so on) have been converted into floating-point numbers during import. So now Python version “3.2” is Python 3.20 in the data-frame.
Rich Iannone has written a useful blog post comparing various data-validation libraries for Polars at the “Posit-dev” blog. The tools mentioned in his post can check more substantial matters than the missing-data, data-type and data-range issues that we mentioned above. In particular, the tool “pointblank” can create data-validation summary reports that can be used to report back to data-collection teams or to analysts. Python already had data-reporting tools like “great expectations” and “test-driven data analysis”.
Importing with data-type constraints
I knew the structure of the tab-separated file before attempting to load it with Pandas and Polars. That is, I knew that the percents looked like “12.34%” and that the download counts looked like “123,456,789” (with commas separating the thousands and millions). Neither package can automatically convert these number formats into the format that they require without a bit of help. Even if we explained what the post-import data-type should be for each column, the two libraries wouldn’t be able to parse the input data directly.
Both Polars and Pandas allow you to provide a pre-defined schema when
importing data. For Pandas, you provide a dtype
dictionary which
specifies what the output column data-type should be. For Polars, you
provide a schema
argument, where the data-types are specified in a
Polars-specific format (because the data is stored in Rust, the Python
int
and str
data-types don’t work for Polars).
If we import our data using a schema, Pandas and Polars will complain:
downloads_pd = pd.read_csv(
files["pytest_data"],
sep="\t",
nrows = 17,
dtype={"category": str, "percent": float, "downloads": int}
)
## ValueError: could not convert string to float: '23.40%'
downloads_pl = pl.read_csv(
files["pytest_data"],
separator="\t",
n_rows=17,
schema={"category": pl.Utf8, "percent": pl.Float64, "downloads": pl.Int64}
)
## polars.exceptions.ComputeError: could not parse `"23.40%"` as dtype `f64` at column 'percent' (column number 2)
##
## The current offset in the file is 7 bytes.
##
## You might want to try:
## - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
## - specifying correct dtype with the `schema_overrides` argument
## - setting `ignore_errors` to `True`,
## - adding `"23.40%"` to the `null_values` list.
##
## Original error: ```remaining bytes non-empty```
The errors arise because we need to also specify how to convert our data into the expected data-types when it isn’t obvious. This is done using ‘converters’ in Pandas:
downloads_pd = pd.read_csv(
files["pytest_data"],
sep="\t",
nrows = 17,
dtype={"category": str},
converters = {
# 12.34% -> 12.34
"percent": lambda x: float(x.strip("%")),
# 123,456,789 -> 123456789
"downloads": lambda x: int(x.replace(",", ""))
}
)
downloads_pd.head()
## category percent downloads
## 0 3.11 23.40 282343944
## 1 3.9 18.78 226548604
## 2 3.10 18.74 226155405
## 3 3.12 15.48 186819921
## 4 NaN 8.41 101489156
In Polars, the with_columns()
method allows conversion to the expected
data-types.
downloads_pl = pl.read_csv(
files["pytest_data"],
separator="\t",
n_rows=17
).with_columns(
# 12.34% -> 12.34
pl.col("percent").str.replace("%", "").cast(pl.Float64),
# 123,456,789 -> 123456789
pl.col("downloads").str.replace_all(",", "").cast(pl.Int64)
)
downloads_pl.head()
## shape: (5, 3)
## ┌──────────┬─────────┬───────────┐
## │ category ┆ percent ┆ downloads │
## │ --- ┆ --- ┆ --- │
## │ str ┆ f64 ┆ i64 │
## ╞══════════╪═════════╪═══════════╡
## │ 3.11 ┆ 23.4 ┆ 282343944 │
## │ 3.9 ┆ 18.78 ┆ 226548604 │
## │ 3.10 ┆ 18.74 ┆ 226155405 │
## │ 3.12 ┆ 15.48 ┆ 186819921 │
## │ null ┆ 8.41 ┆ 101489156 │
## └──────────┴─────────┴───────────┘
Now if we validate our datasets against the Pandera schema, we should have a little more success:
schema.validate(downloads_pd)
## category percent downloads
## 0 3.11 23.40 282343944
## 1 3.9 18.78 226548604
## 2 3.10 18.74 226155405
## 3 3.12 15.48 186819921
## 4 NaN 8.41 101489156
## 5 3.8 6.13 73965471
## 6 3.13 4.91 59253846
## 7 3.7 3.36 40551618
## 8 3.6 0.54 6546017
## 9 2.7 0.20 2371860
## 10 3.14 0.02 300816
## 11 3.5 0.02 231325
## 12 3.4 0.00 12777
## 13 3.15 0.00 2882
## 14 3.3 0.00 2259
## 15 2.6 0.00 87
## 16 3.2 0.00 68
Nice.
Summary
In this post we have looked at importing data with both Pandas and
Polars. We have seen that the import functions are similar
(read_csv(...)
) but that there are some subtle differences with how
you specify some things (the column separator argument, the data-schema
and so on). Explicit conversion of the imported data is performed
differently between the two packages as well.
Once your data has been imported, you should check that it contains what it is supposed to: are the columns you need all present, do they store the correct data-types, do the values within those columns sit within the expected range. Data validation libraries, like Pointblank and Panderas are really useful for checking data-schema and data-values before you do anything critical with your dataset.
