Skip to content

Working with Tabular Data in Python

High-performance data processing and schema validation for tabular data built on Polars (a Rust-based DataFrame library).

Terminal window
pip install fairspec

The table package provides core utilities for working with tabular data:

  • normalize_table - Convert table data to match a schema
  • denormalize_table - Convert normalized data back to raw format
  • infer_table_schema_from_table - Automatically infer schema from table data
  • inspect_table - Get table structure information
  • query_table - Query tables using SQL-like syntax

For example:

from fairspec import load_csv_table, infer_table_schema_from_table, Resource
table = load_csv_table(Resource(data="data.csv"))
schema = infer_table_schema_from_table(table)

Automatically infer Table Schema from data:

import polars as pl
from fairspec import infer_table_schema_from_table
table = pl.DataFrame({
"id": ["1", "2", "3"],
"price": ["10.50", "25.00", "15.75"],
"date": ["2023-01-15", "2023-02-20", "2023-03-25"],
"active": ["true", "false", "true"],
}).lazy()
schema = infer_table_schema_from_table(table, sample_rows=100, confidence=0.9)
# Result: automatically detected integer, number, date, and boolean types

Convert table data to match a Table Schema (type conversion):

import polars as pl
from fairspec import normalize_table
from fairspec_metadata import TableSchema, IntegerColumnProperty, NumberColumnProperty, BooleanColumnProperty, DateColumnProperty
table = pl.DataFrame({
"id": ["1", "2", "3"],
"price": ["10.50", "25.00", "15.75"],
"active": ["true", "false", "true"],
"date": ["2023-01-15", "2023-02-20", "2023-03-25"],
}).lazy()
schema = TableSchema(properties={
"id": IntegerColumnProperty(),
"price": NumberColumnProperty(),
"active": BooleanColumnProperty(),
"date": DateColumnProperty(),
})
normalized = normalize_table(table, schema)
result = normalized.collect()
# Result has properly typed columns:
# { id: 1, price: 10.50, active: True, date: Date("2023-01-15") }

Convert normalized data back to raw format (for saving):

from fairspec import denormalize_table
denormalized = denormalize_table(table, schema, native_types=["string", "number", "boolean"])

Define schemas with column properties and constraints:

from fairspec_metadata import TableSchema, IntegerColumnProperty, StringColumnProperty
schema = TableSchema(
properties={
"id": IntegerColumnProperty(minimum=1),
"name": StringColumnProperty(minLength=1, maxLength=100),
"email": StringColumnProperty(pattern=r"^[^@]+@[^@]+\.[^@]+$"),
"age": IntegerColumnProperty(minimum=0, maximum=150),
"status": StringColumnProperty(enum=["active", "inactive", "pending"]),
},
required=["id", "name", "email"],
primaryKey=["id"],
)

Customize how schemas are inferred:

from fairspec import infer_table_schema_from_table
schema = infer_table_schema_from_table(
table,
sample_rows=100,
confidence=0.9,
keep_strings=False,
column_types={"id": "integer", "status": "categorical"},
)

Define missing value indicators:

from fairspec_metadata import TableSchema, NumberColumnProperty
schema = TableSchema(
properties={"value": NumberColumnProperty()},
missingValues=["", "N/A", "null", -999],
)

Define table-level constraints:

from fairspec_metadata import TableSchema, IntegerColumnProperty, StringColumnProperty, UniqueKey
schema = TableSchema(
properties={
"user_id": IntegerColumnProperty(),
"email": StringColumnProperty(),
},
primaryKey=["user_id"],
uniqueKeys=[UniqueKey(columnNames=["email"])],
)
  • string - Text data
  • integer - Whole numbers
  • number - Decimal numbers
  • boolean - True/false values
  • date - Calendar dates
  • datetime - Date and time
  • time - Time of day
  • duration - Time spans
  • geojson - GeoJSON geometries
  • wkt - Well-Known Text geometries
  • wkb - Well-Known Binary geometries
  • array - Fixed-length arrays
  • list - Variable-length lists
  • object - JSON objects
  • email - Email addresses
  • url - URLs
  • categorical - Categorical data
  • base64 - Base64 encoded data
  • hex - Hexadecimal data

The package uses LazyFrame from Polars for efficient processing:

import polars as pl
from fairspec_table import Table
# Table is an alias for pl.LazyFrame
table: Table = pl.DataFrame({"id": [1, 2, 3]}).lazy()