Curate DataFrames and AnnDatas¶
Curating a dataset with LaminDB means three things:
Validate that the dataset matches a desired schema
In case the dataset doesn’t validate, standardize it, e.g., by fixing typos or mapping synonyms
Annotate the dataset by linking it against metadata entities so that it becomes queryable
Curate a DataFrame¶
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-curate
Let’s start with a DataFrame that we’d like to validate.
import lamindb as ln
import bionty as bt
import pandas as pd
df = pd.DataFrame(
{
"cell_medium": pd.Categorical(["DMSO", "IFNG", "DMSO"]),
"temperature": [37.2, 36.3, 38.2],
"cell_type": pd.Categorical(
[
"cerebral pyramidal neuron",
"astrocytic glia",
"oligodendrocyte",
]
),
"assay_ontology_id": pd.Categorical(
["EFO:0008913", "EFO:0008913", "EFO:0008913"]
),
"donor": ["D0001", "D0002", "D0003"],
},
index=["obs1", "obs2", "obs3"],
)
df
Show code cell output
→ connected lamindb: testuser1/test-curate
cell_medium | temperature | cell_type | assay_ontology_id | donor | |
---|---|---|---|---|---|
obs1 | DMSO | 37.2 | cerebral pyramidal neuron | EFO:0008913 | D0001 |
obs2 | IFNG | 36.3 | astrocytic glia | EFO:0008913 | D0002 |
obs3 | DMSO | 38.2 | oligodendrocyte | EFO:0008913 | D0003 |
Define a schema to validate this dataset.
schema = ln.Schema(
name="My example schema",
features=[
ln.Feature(name="cell_medium", dtype=ln.ULabel).save(),
ln.Feature(name="temperature", dtype=float).save(),
ln.Feature(name="cell_type", dtype=bt.CellType).save(),
ln.Feature(
name="assay_ontology_id", dtype=bt.ExperimentalFactor.ontology_id
).save(),
ln.Feature(name="donor", dtype=str).save(),
],
).save()
# look at the schema
schema.features.df()
Show code cell output
uid | name | dtype | is_type | unit | description | array_rank | array_size | array_shape | proxy_dtype | synonyms | _expect_many | _curation | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | nYZllzQv3t10 | cell_medium | cat[ULabel] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-02-20 07:27:55.121000+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
2 | uAWtVzxIjNiQ | temperature | float | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-02-20 07:27:55.128000+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
3 | XkQE9we6nWew | cell_type | cat[bionty.CellType] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-02-20 07:27:55.551000+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
4 | MTroVI1sIY6A | assay_ontology_id | cat[bionty.ExperimentalFactor.ontology_id] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-02-20 07:27:55.557000+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
5 | krNOWxd8QnGT | donor | str | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-02-20 07:27:55.562000+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
curator = ln.curators.DataFrameCurator(df, schema)
The validate()
method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).
try:
curator.validate()
except ln.errors.ValidationError as error:
print(error)
Show code cell output
• saving validated records of 'cell_type'
✓ added 2 records from public with CellType.name for "cell_type": 'oligodendrocyte', 'astrocyte'
• saving validated records of 'assay_ontology_id'
✓ added 1 record from public with ExperimentalFactor.ontology_id for "assay_ontology_id": 'EFO:0008913'
• mapping "cell_medium" on ULabel.name
! 2 terms are not validated: 'DMSO', 'IFNG'
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_medium")
• mapping "cell_type" on CellType.name
! 2 terms are not validated: 'cerebral pyramidal neuron', 'astrocytic glia'
1 synonym found: "astrocytic glia" → "astrocyte"
→ curate synonyms via .standardize("cell_type") for remaining terms:
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
2 terms are not validated: 'cerebral pyramidal neuron', 'astrocytic glia'
1 synonym found: "astrocytic glia" → "astrocyte"
→ curate synonyms via .standardize("cell_type") for remaining terms:
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
# check the non-validated terms
curator.cat.non_validated
{'cell_medium': ['DMSO', 'IFNG'],
'cell_type': ['cerebral pyramidal neuron', 'astrocytic glia']}
For cell_type
, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.
First, let’s standardize synonym “astrocytic glia” as suggested
curator.cat.standardize("cell_type")
✓ standardized 1 synonym in "cell_type": "astrocytic glia" → "astrocyte"
# now we have only one non-validated cell type left
curator.cat.non_validated
{'cell_medium': ['DMSO', 'IFNG'], 'cell_type': ['cerebral pyramidal neuron']}
For “cerebral pyramidal neuron”, let’s understand which cell type in the public ontology might be the actual match.
# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup
Show code cell output
Lookup objects from the public:
.cell_medium
.cell_type
.assay_ontology_id
.columns
Example:
→ categories = curator.lookup()["cell_type"]
→ categories.alveolar_type_1_fibroblast_cell
To look up public ontologies, use .lookup(public=True)
# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron
Show code cell output
CellType(ontology_id='CL:4023111', name='cerebral cortex pyramidal neuron', definition='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', synonyms=None, parents=array(['CL:0010012', 'CL:0000598'], dtype=object))
# fix the cell type
df.cell_type = df.cell_type.replace(
{"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name}
)
Show code cell output
/tmp/ipykernel_3431/471877978.py:2: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
df.cell_type = df.cell_type.replace(
For donor, we want to add the new donors: “D0001”, “D0002”, “D0003”
# this adds donors that were _not_ validated
curator.cat.add_new_from("cell_medium")
Show code cell output
✓ added 2 records with ULabel.name for "cell_medium": 'DMSO', 'IFNG'
# validate again
curator.validate()
Show code cell output
• saving validated records of 'cell_type'
✓ added 1 record from public with CellType.name for "cell_type": 'cerebral cortex pyramidal neuron'
✓ "cell_medium" is validated against ULabel.name
✓ "cell_type" is validated against CellType.name
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
Save a curated artifact.
artifact = curator.save_artifact(key="my_datasets/my_curated_dataset.parquet")
! no run & transform got linked, call `ln.track()` & re-run
• path content will be copied to default storage upon `save()` with key 'my_datasets/my_curated_dataset.parquet'
✓ storing artifact 'UB00wwHM8t0ZXrmd0000' at '/home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/UB00wwHM8t0ZXrmd0000.parquet'
! run input wasn't tracked, call `ln.track()` and re-run
✓ 5 unique terms (100.00%) are validated for name
→ returning existing schema with same hash: Schema(uid='DuVOvYiVxNMLL9URVvzS', name='My example schema', n=5, itype='Feature', is_type=False, hash='rpA3KqTt2WVzAU95xEMxAw', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=2025-02-20 07:27:55 UTC)
! updated otype from None to DataFrame
artifact.describe()
Artifact .parquet/DataFrame ├── General │ ├── .uid = 'UB00wwHM8t0ZXrmd0000' │ ├── .key = 'my_datasets/my_curated_dataset.parquet' │ ├── .size = 4752 │ ├── .hash = '2NOTv-2Lu54mWj8GrSgNeQ' │ ├── .n_observations = 3 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/UB00wwHM8t0ZXrmd0000.parquet │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-02-20 07:28:01 ├── Dataset features/schema │ └── columns • 5 [Feature] │ assay_ontology_id cat[bionty.ExperimentalF… single-cell RNA sequencing │ cell_medium cat[ULabel] DMSO, IFNG │ cell_type cat[bionty.CellType] astrocyte, cerebral cortex pyramidal neu… │ temperature float │ donor str └── Labels └── .cell_types bionty.CellType oligodendrocyte, astrocyte, cerebral cor… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel DMSO, IFNG
Curate an AnnData¶
Here we additionally specify which var_index
to validate against.
import anndata as ad
X = pd.DataFrame(
{
"ENSG00000081059": [1, 2, 3],
"ENSG00000276977": [4, 5, 6],
"ENSG00000198851": [7, 8, 9],
"ENSG00000010610": [10, 11, 12],
"ENSG00000153563": [13, 14, 15],
"ENSGcorrupted": [16, 17, 18],
},
index=df.index, # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata
Show code cell output
AnnData object with n_obs × n_vars = 3 × 6
obs: 'cell_medium', 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
# define var schema
var_schema = ln.Schema(
name="my_var_schema",
itype=bt.Gene.ensembl_gene_id,
dtype=int,
).save()
# define composite schema
anndata_schema = ln.Schema(
name="small_dataset1_anndata_schema",
otype="AnnData",
components={"obs": schema, "var": var_schema},
).save()
var_schema.itype
'bionty.Gene.ensembl_gene_id'
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
try:
curator.validate()
except ln.errors.ValidationError as error:
print(error)
Show code cell output
✓ "cell_medium" is validated against ULabel.name
✓ "cell_type" is validated against CellType.name
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
✓ created 1 Organism record from Bionty matching name: 'human'
Invalid identifiers for bionty.Gene.ensembl_gene_id: ['ENSGcorrupted']
Subset the AnnData
to validated genes only:
adata_validated = adata[:, ~adata.var.index.isin(["ENSGcorrupted"])].copy()
Now let’s validate the subsetted object:
curator = ln.curators.AnnDataCurator(adata_validated, anndata_schema)
try:
curator.validate()
except ln.errors.ValidationError as error:
print(error)
Show code cell output
✓ "cell_medium" is validated against ULabel.name
✓ "cell_type" is validated against CellType.name
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
The validated object can be subsequently saved as an Artifact
:
artifact = curator.save_artifact(key="my_datasets/my_curated_anndata.h5ad")
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
• path content will be copied to default storage upon `save()` with key 'my_datasets/my_curated_anndata.h5ad'
✓ storing artifact 'KHSOmqXsc7qT3h690000' at '/home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/KHSOmqXsc7qT3h690000.h5ad'
! run input wasn't tracked, call `ln.track()` and re-run
• parsing feature names of X stored in slot 'var'
✓ 5 unique terms (100.00%) are validated for ensembl_gene_id
✓ linked: Schema(uid='cCJNHsCVHpSWGNKabGDI', n=5, dtype='int', itype='bionty.Gene', is_type=False, hash='nmFTQkXy239ruKDl8gDLSw', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=<django.db.models.expressions.DatabaseDefault object at 0x7efc9808c1d0>)
• parsing feature names of slot 'obs'
✓ 5 unique terms (100.00%) are validated for name
→ returning existing schema with same hash: Schema(uid='DuVOvYiVxNMLL9URVvzS', name='My example schema', n=5, itype='Feature', is_type=False, hash='rpA3KqTt2WVzAU95xEMxAw', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=2025-02-20 07:27:55 UTC)
! updated otype from None to DataFrame
✓ linked: Schema(uid='DuVOvYiVxNMLL9URVvzS', name='My example schema', n=5, itype='Feature', is_type=False, otype='DataFrame', hash='rpA3KqTt2WVzAU95xEMxAw', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=2025-02-20 07:27:55 UTC)
✓ saved 1 feature set for slot: 'var'
Saved artifact has been annotated with validated features and labels:
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'KHSOmqXsc7qT3h690000' │ ├── .key = 'my_datasets/my_curated_anndata.h5ad' │ ├── .size = 24048 │ ├── .hash = 'le9mfXgkyLtqJCDZdLMCwQ' │ ├── .n_observations = 3 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/KHSOmqXsc7qT3h690000.h5ad │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-02-20 07:28:07 ├── Dataset features/schema │ ├── var • 5 [bionty.Gene] │ │ TCF7 int │ │ PDCD1 int │ │ CD3E int │ │ CD4 int │ │ CD8A int │ └── obs • 5 [Feature] │ assay_ontology_id cat[bionty.ExperimentalF… single-cell RNA sequencing │ cell_medium cat[ULabel] DMSO, IFNG │ cell_type cat[bionty.CellType] astrocyte, cerebral cortex pyramidal neu… │ temperature float │ donor str └── Labels └── .cell_types bionty.CellType oligodendrocyte, astrocyte, cerebral cor… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel DMSO, IFNG
We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:
Defining validation criteria
Validating data against existing registries
Adding new validated entries to registries
Annotating artifacts with validated metadata
By following these steps, you can ensure your data is standardized and well-curated.
If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.
!rm -rf ./test-curate
!lamin delete --force test-curate
• deleting instance testuser1/test-curate