Importing Data with Python

Published: July 17, 2025

tags: python

Importing data is a key step in the data science workflow. It also has a huge responsibility. How you import (or connect to) a dataset has consequences for how you work with that data throughout a project, because a Pandas DataFrame (say) requires Pandas-specific code. Similarly, your data constrains your code - if it can fit in memory on a single computer, the constraints are different than if your data is so large that its storage must be distributed. Data-import is a key place where a data-project can go wrong. If you import a dataset without validating that it contains sensible values, a nasty surprise may await you….

Python has wonderful libraries for data manipulation and analysis. You can readily work with data sources of a variety of types. For example:

that are too big to hold in memory;
that are distributed across a network;
that update rapidly;
or that don’t easily conform to a tabular, relational form.

The Python data stack is sufficiently mature that there are multiple libraries for all of these settings. There are some moves to introduce a standardised syntax for some data-frame functionality across libraries. At Jumping Rivers, we have a number of Python training courses, and teach how to use Pandas, PySpark, Numpy and how to work with databases via SQL from Python.

Importing into memory

Firstly we will compare two in-memory data-frame packages: Pandas and Polars. We will work in a virtual environment (so that the packages installed are handled independently of the system-wide Python; see our Barbie-themed blog post on virtual environments).

# [Linux terminal commands]
# Create & activate a new virtual environment
python -m venv .venv
source .venv/bin/activate
# Install pandas and polars into the environment
pip install pandas polars

We will generate a simple dataset to import with the two packages. PyPI download numbers for a specific Python package can be obtained using the pypistats package. After installing it, we will pull out the number of downloads for the package pytest - see our recent blog posts for an introduction to this testing library.

# [Linux terminal commands]
# Install pypistats
pip install pypistats
# Obtain download-statistics for `pytest` in tab-separated format
pypistats python_minor -f tsv pytest > data/pytest-downloads.tsv

The structure of that file is straight-forward. It records both the number, and the percentage, of pytest downloads across each minor version of Python (“3.8”, “3.9” and so on) for the last 180 days (a default time-span).

# [Linux terminal commands]
head data/pytest-downloads.tsv

## "category"   "percent"   "downloads"
## "3.11"   "23.40%"    282,343,944
## "3.9"    "18.78%"    226,548,604
## "3.10"   "18.74%"    226,155,405
## "3.12"   "15.48%"    186,819,921
## "null"   "8.41%" 101,489,156
## "3.8"    "6.13%" 73,965,471
## "3.13"   "4.91%" 59,253,846
## "3.7"    "3.36%" 40,551,618
## "3.6"    "0.54%" 6,546,017

So it should be trivial to import it into Python using either Pandas or Polars.

files = {
  "pytest_data": "data/pytest-downloads.tsv"
}

import pandas as pd
downloads_pd = pd.read_csv(files["pytest_data"], sep="\t")
downloads_pd.head()

##   category percent    downloads
## 0     3.11  23.40%  282,343,944
## 1      3.9  18.78%  226,548,604
## 2     3.10  18.74%  226,155,405
## 3     3.12  15.48%  186,819,921
## 4      NaN   8.41%  101,489,156

import polars as pl
downloads_pl = pl.read_csv(files["pytest_data"], separator="\t")
downloads_pl.head()

## shape: (5, 3)
## ┌──────────┬─────────┬─────────────┐
## │ category ┆ percent ┆ downloads   │
## │ ---      ┆ ---     ┆ ---         │
## │ str      ┆ str     ┆ str         │
## ╞══════════╪═════════╪═════════════╡
## │ 3.11     ┆ 23.40%  ┆ 282,343,944 │
## │ 3.9      ┆ 18.78%  ┆ 226,548,604 │
## │ 3.10     ┆ 18.74%  ┆ 226,155,405 │
## │ 3.12     ┆ 15.48%  ┆ 186,819,921 │
## │ null     ┆ 8.41%   ┆ 101,489,156 │
## └──────────┴─────────┴─────────────┘

We can see that Pytest was downloaded ~ 280 million times on Python 3.11, and this accounted for about 23% of downloads.

The syntax for importing the dataset using the two libraries is almost identical. The only difference for the read_csv() function is that you use sep=... in Pandas and separator=... in Polars.

Polars is more memory efficient than Pandas in a lot of settings. One of the memory-efficiencies that Polars allows, is filtering by rows and columns during import. To take advantage of this, you can use the scan_csv() function in Polars. This creates a lazy data-frame that can be directly manipulated by Polars DataFrame methods. These method calls are applied at the point when the data is loaded, rather than on an in-memory data-frame. The Polars website has far more details about the lazy API, and the benefits it can bring to your work.

Here, we will be working with eagerly-loaded, in-memory, data-frames - a good choice during exploratory work.

Validating a dataset

Have we loaded what we wanted?

The Pandas/Polars .dtypes attribute gives you information about the data-types in each column of a data-frame:

print(downloads_pd.dtypes)

## category     object
## percent      object
## downloads    object
## dtype: object

print(downloads_pl.dtypes)

## [String, String, String]

Again, the formatting of the output looks a little different between the two packages, but the results are broadly similar: all of our data—the version-strings in ‘category’, the percentage values and download counts—have been read as character strings. Pandas tell us they are ’object’s, but we know what that typically means they are strings:

type(
  downloads_pd["category"][0]
)

## <class 'str'>

So Python has imported our data incorrectly. In a real project we would want to know about this, so we might validate the data in our datasets. We would also want to prevent data-import mistakes as far as possible (see later) by being more explicit about how each column is imported and converted.

Python has a range of packages for validating both the schema for, and the values in, a dataset.

For a general class, if you want to check the data-types that are stored in the fields, you could use Pydantic:

# [Terminal]
pip install pydantic

from pydantic import BaseModel, NonNegativeInt

class Person(BaseModel):
    name: str
    age: NonNegativeInt

Correct data-types cause no fuss:

Person(name="Russ", age=47)

## Person(name='Russ', age=47)

But incorrect data (here a negative age) throw errors:

Person(name="Buddy", age=-1)

## pydantic_core._pydantic_core.ValidationError: 1 validation error for Person
## age
##   Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-1, input_type=int]
##     For further information visit https://errors.pydantic.dev/2.11/v/greater_than_equal

There are extensions of Pydantic that work with data-frames. For example, Pandera can work with Pandas, Polars and several other data-frame libraries. You would need to install a different Pandera extension depending on the data-frame library you are working with (“pandera[pandas]” for Pandas, “pandera[polars]” for Polars etc).

# Terminal
pip install "pandera[pandas]"

import pandera.pandas as pa

schema = pa.DataFrameSchema({
    "category": pa.Column(str, nullable=True),
    "percent": pa.Column(float, checks=pa.Check.in_range(0, 100)),
    "downloads": pa.Column(int)
})

schema.validate(downloads_pd)

## pandera.errors.SchemaError: non-nullable series 'percent' contains null values:
## 17    NaN
## 18    NaN
## Name: percent, dtype: object

Validation of the data has identified an issue with the dataset. There is a missing value in the “percent” column towards the end of the dataset. There are other issues - the two numeric columns are currently strings - but lets check out the issue that Pandera has identified first.

This is the end of the dataset:

downloads_pd.tail()

##                                category percent      downloads
## 14                                  3.3   0.00%          2,259
## 15                                  2.6   0.00%             87
## 16                                  3.2   0.00%             68
## 17                                Total     NaN  1,206,596,056
## 18  Date range: 2024-12-31 - 2025-07-07     NaN            NaN

There is some metadata included in the final couple of lines of the dataset that should be ignored at import. Let’s just ignore the final couple of lines at import (this isn’t a very robust solution, but is fine for now).

downloads_pd = pd.read_csv(files["pytest_data"], sep="\t", nrows = 17)
downloads_pd.tail()

##     category percent downloads
## 12      3.40   0.00%    12,777
## 13      3.15   0.00%     2,882
## 14      3.30   0.00%     2,259
## 15      2.60   0.00%        87
## 16      3.20   0.00%        68

Now if we validate the Pandas data-frame, another issue has been identified.

schema.validate(downloads_pd)

## pandera.errors.SchemaError: expected series 'category' to have type str:
## failure cases:
##     index  failure_case
## 0       0          3.11
## 1       1          3.90
## 2       2          3.10
## 3       3          3.12
## 4       5          3.80
## 5       6          3.13
## 6       7          3.70
## 7       8          3.60
## 8       9          2.70
## 9      10          3.14
## 10     11          3.50
## 11     12          3.40
## 12     13          3.15
## 13     14          3.30
## 14     15          2.60
## 15     16          3.20

This is a more substantial issue - the Python version-strings (3.2, 3.3 and so on) have been converted into floating-point numbers during import. So now Python version “3.2” is Python 3.20 in the data-frame.

Rich Iannone has written a useful blog post comparing various data-validation libraries for Polars at the “Posit-dev” blog. The tools mentioned in his post can check more substantial matters than the missing-data, data-type and data-range issues that we mentioned above. In particular, the tool “pointblank” can create data-validation summary reports that can be used to report back to data-collection teams or to analysts. Python already had data-reporting tools like “great expectations” and “test-driven data analysis”.

Importing with data-type constraints

I knew the structure of the tab-separated file before attempting to load it with Pandas and Polars. That is, I knew that the percents looked like “12.34%” and that the download counts looked like “123,456,789” (with commas separating the thousands and millions). Neither package can automatically convert these number formats into the format that they require without a bit of help. Even if we explained what the post-import data-type should be for each column, the two libraries wouldn’t be able to parse the input data directly.

Both Polars and Pandas allow you to provide a pre-defined schema when importing data. For Pandas, you provide a dtype dictionary which specifies what the output column data-type should be. For Polars, you provide a schema argument, where the data-types are specified in a Polars-specific format (because the data is stored in Rust, the Python int and str data-types don’t work for Polars).

If we import our data using a schema, Pandas and Polars will complain:

downloads_pd = pd.read_csv(
    files["pytest_data"],
    sep="\t",
    nrows = 17,
    dtype={"category": str, "percent": float, "downloads": int}
)

## ValueError: could not convert string to float: '23.40%'

downloads_pl = pl.read_csv(
    files["pytest_data"],
    separator="\t",
    n_rows=17,
    schema={"category": pl.Utf8, "percent": pl.Float64, "downloads": pl.Int64}
)

## polars.exceptions.ComputeError: could not parse `"23.40%"` as dtype `f64` at column 'percent' (column number 2)
## 
## The current offset in the file is 7 bytes.
## 
## You might want to try:
## - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
## - specifying correct dtype with the `schema_overrides` argument
## - setting `ignore_errors` to `True`,
## - adding `"23.40%"` to the `null_values` list.
## 
## Original error: ```remaining bytes non-empty```

The errors arise because we need to also specify how to convert our data into the expected data-types when it isn’t obvious. This is done using ‘converters’ in Pandas:

downloads_pd = pd.read_csv(
    files["pytest_data"],
    sep="\t",
    nrows = 17,
    dtype={"category": str},
    converters = {
      # 12.34% -> 12.34
      "percent": lambda x: float(x.strip("%")),
      # 123,456,789 -> 123456789
      "downloads": lambda x: int(x.replace(",", ""))
    }
)

downloads_pd.head()

##   category  percent  downloads
## 0     3.11    23.40  282343944
## 1      3.9    18.78  226548604
## 2     3.10    18.74  226155405
## 3     3.12    15.48  186819921
## 4      NaN     8.41  101489156

In Polars, the with_columns() method allows conversion to the expected data-types.

downloads_pl = pl.read_csv(
    files["pytest_data"],
    separator="\t",
    n_rows=17
).with_columns(
  # 12.34% -> 12.34
  pl.col("percent").str.replace("%", "").cast(pl.Float64),
  # 123,456,789 -> 123456789
  pl.col("downloads").str.replace_all(",", "").cast(pl.Int64)
)

downloads_pl.head()

## shape: (5, 3)
## ┌──────────┬─────────┬───────────┐
## │ category ┆ percent ┆ downloads │
## │ ---      ┆ ---     ┆ ---       │
## │ str      ┆ f64     ┆ i64       │
## ╞══════════╪═════════╪═══════════╡
## │ 3.11     ┆ 23.4    ┆ 282343944 │
## │ 3.9      ┆ 18.78   ┆ 226548604 │
## │ 3.10     ┆ 18.74   ┆ 226155405 │
## │ 3.12     ┆ 15.48   ┆ 186819921 │
## │ null     ┆ 8.41    ┆ 101489156 │
## └──────────┴─────────┴───────────┘

Now if we validate our datasets against the Pandera schema, we should have a little more success:

schema.validate(downloads_pd)

##    category  percent  downloads
## 0      3.11    23.40  282343944
## 1       3.9    18.78  226548604
## 2      3.10    18.74  226155405
## 3      3.12    15.48  186819921
## 4       NaN     8.41  101489156
## 5       3.8     6.13   73965471
## 6      3.13     4.91   59253846
## 7       3.7     3.36   40551618
## 8       3.6     0.54    6546017
## 9       2.7     0.20    2371860
## 10     3.14     0.02     300816
## 11      3.5     0.02     231325
## 12      3.4     0.00      12777
## 13     3.15     0.00       2882
## 14      3.3     0.00       2259
## 15      2.6     0.00         87
## 16      3.2     0.00         68

Nice.

Summary

In this post we have looked at importing data with both Pandas and Polars. We have seen that the import functions are similar (read_csv(...)) but that there are some subtle differences with how you specify some things (the column separator argument, the data-schema and so on). Explicit conversion of the imported data is performed differently between the two packages as well.

Once your data has been imported, you should check that it contains what it is supposed to: are the columns you need all present, do they store the correct data-types, do the values within those columns sit within the expected range. Data validation libraries, like Pointblank and Panderas are really useful for checking data-schema and data-values before you do anything critical with your dataset.

Importing Data with Python

Importing into memory

Validating a dataset

Importing with data-type constraints

Summary

Recent Posts

Top Tags

Authors