MLOps Setup – OVHcloud + DVC + MLflow (Audio Dataset)
This guide is written for a first MLOps project with audio data. The goal is simple: make sure that at any point in time, you can reproduce exactly which audio files and labels were used to train a given model — even six months later, even after corrections.
Architecture
Audio + Labels → OVHcloud Object Storage (S3) Versioning → DVC Tracking → MLflow Code → Git
Why this split? Audio files are large binary files — Git was not designed for them. Git handles your code and configuration. DVC (Data Version Control — an open-source tool that brings Git-like versioning to data files) handles your data, using the same mental model (add, push, pull) but storing files on a cloud backend (here OVHcloud S3). MLflow (an open-source platform for tracking ML experiments) records what happened during each training run: which dataset version, which hyperparameters, which metrics.
Together, these three tools give you full traceability: code + data + results.
1. Create Object Storage on OVHcloud
OVHcloud is a European cloud provider whose Object Storage is compatible with the S3 protocol — the standard storage API originally created by AWS. This means any tool that supports S3 (DVC, boto3, the AWS CLI…) works with OVHcloud out of the box, just by pointing it to a different endpoint URL.
1.1 Create Public Cloud Project
- Log into OVHcloud Manager
- Create a Public Cloud project
- Activate billing
OVHcloud separates products into "Public Cloud" projects. Object Storage lives inside one of these projects. You need to create the project before you can create a bucket.
1.2 Create Object Storage
- Go to Object Storage
- Create a container (bucket), e.g.:
ml-audio-datasets - Select region (EU recommended)
⚠️ Endpoint URL
Depending on your region and OVHcloud offer, the endpoint format may vary — e.g. s3.<region>.io.cloud.ovh.net or s3.<region>.perf.cloud.ovh.net. Always verify the exact URL in your OVHcloud Manager under Object Storage → your bucket → S3 endpoint. Using the wrong URL is a common source of connection errors.
The right bucket configuration depends on your project phase:
| R&D phase | Production phase | |
|---|---|---|
| Versioning | ✅ Enable | ✅ Enable |
| Object Lock (WORM) | ❌ Skip | ✅ Enable at creation |
R&D phase — versioning only is sufficient. You will frequently delete mislabeled files, reorganize the dataset structure, or correct errors in bulk. Object Lock would prevent all of this, leaving you with files you cannot clean up. Versioning alone already protects you against accidental deletions — deleted files remain recoverable.
Production phase — when your dataset is stable and used to train models that go to clinical or production use, create a dedicated bucket with Object Lock enabled. This guarantees that any dataset version linked to a deployed model can never be altered or deleted. Note that Object Lock also requires setting a retention policy (a minimum duration during which files are protected) to be fully effective — without it, the lock has no enforcement period.
⚠️ Important
Object Lock must be enabled at bucket creation — it cannot be activated afterwards. Plan your bucket naming accordingly, e.g. ml-audio-datasets-rd and ml-audio-datasets-prod.
Once the bucket is created, enable versioning:
aws s3api put-bucket-versioning \
--bucket ml-audio-datasets \
--versioning-configuration Status=Enabled \
--endpoint-url https://s3.<region>.io.cloud.ovh.netVersioning keeps a history of every file modification inside the bucket. Even if a file is overwritten, the previous version is preserved. This is your safety net against accidental data loss.
1.3 Generate S3 Credentials
In OVHcloud, S3 credentials are created via OpenStack Users (not IAM as in AWS — the concept is equivalent but the interface differs). Create a dedicated user with restricted permissions following the principle of least privilege: only grant the permissions your pipeline actually needs.
You will obtain:
BUCKET_NAME=ml-audio-datasets
ACCESS_KEY=xxxxxxxx
SECRET_KEY=xxxxxxxx
ENDPOINT_URL=https://s3.<region>.io.cloud.ovh.netSave these securely (password manager, never in Git).
2. Install Dependencies
Install all required packages in one command:
pip install dvc[s3] mlflow PyYAMLdvc[s3] installs DVC with its S3 backend. mlflow is the experiment tracker. PyYAML is required to parse dataset.dvc in your training script — it is not always installed by default with DVC or MLflow, so it is safer to declare it explicitly.
Initialize your Git + DVC repository:
git init
dvc init
git commit -m "init dvc"dvc init creates a .dvc/ folder inside your Git repo. This folder contains DVC's configuration. It must be committed to Git — it is not sensitive.
3. Configure DVC Remote (OVH S3)
# Public config — committed to Git
dvc remote add -d ovh s3://ml-audio-datasets
dvc remote modify ovh endpointurl https://s3.<region>.io.cloud.ovh.net
# Credentials — local only, never committed
dvc remote modify ovh --local access_key_id <ACCESS_KEY>
dvc remote modify ovh --local secret_access_key <SECRET_KEY>A remote in DVC is the cloud location where your data files are actually stored — here your OVHcloud bucket. The name ovh is just a label you choose freely.
The --local flag writes the credentials to .dvc/config.local, a file that DVC automatically excludes from Git. This way, your bucket address is shared with your team, but your secret keys never leave your machine.
.dvc/config.local is automatically added to .gitignore by DVC. Only the endpoint and bucket name are committed — never your keys.
4. Structure Dataset
dataset/
│
├── audio/
│ ├── file1.wav
│ └── file2.wav
│
└── labels/
├── file1.txt
└── file2.txtKeeping audio and labels in a single dataset/ folder lets DVC version them together as one atomic unit. If you version them separately, you risk a mismatch between audio v2 and labels v1.
Audio and label files must share identical filenames — this makes pairing deterministic and avoids ambiguity during training.
Create a .dvcignore file at the root of your repository to exclude temporary or system files from DVC tracking:
# .dvcignore
.DS_Store # macOS system files
Thumbs.db # Windows thumbnail cache
*.tmp
*.log.dvcignore works exactly like .gitignore but for DVC. Without it, files like .DS_Store get included in the dataset hash — meaning a Mac user simply opening the folder would silently change the hash without touching any actual audio or label file, breaking reproducibility.
5. Add Dataset to DVC
dvc add dataset/
git add dataset.dvc .gitignore
git commit -m "dataset v1"
dvc pushdvc add computes a hash of your dataset and creates a dataset.dvc file — a small text file that describes the dataset without containing the actual data. This .dvc file is your "pointer" to the data, and it is the only thing committed to Git.
dvc push uploads the actual files to OVHcloud. Your collaborators can later run dvc pull to download the exact same files.
6. Run MLflow Locally
mlflow uiAccess at: http://localhost:5000
MLflow is your experiment logbook. Every time you train a model, you open a run and record: which dataset version, which hyperparameters, which metrics, and which model file was produced. Running it locally is fine to start — for a team, you would deploy it on a shared server so everyone sees the same runs.
7. Log Dataset Version During Training
The key idea:
Never rely on memory or convention to know which dataset was used for a given model. Log it explicitly and automatically inside every training run.
import subprocess
import yaml
import mlflow
import dvc.api
# --- Dataset version (DVC) ---
with open("dataset.dvc") as f:
dvc_file = yaml.safe_load(f)
# When versioning a directory (dvc add dataset/), DVC creates an internal
# manifest file listing all files with their individual hashes.
# The hash stored in dataset.dvc points to that manifest — it ends with .dir
out = dvc_file["outs"][0]
dvc_hash = out.get("md5") or out.get("etag") # md5 for local/SSH, etag for S3
if not dvc_hash.endswith(".dir"):
print(f"Warning: unexpected hash format: {dvc_hash}")
# This may indicate you versioned a single file instead of a directory
# Optional: resolve the full S3 URL of the dataset on the remote.
# Specify the remote explicitly to avoid ambiguity.
# Wrapped in try/except because this requires network access —
# it will fail in offline or CI environments without remote access.
try:
dataset_url = dvc.api.get_url("dataset/", remote="ovh")
except Exception:
dataset_url = "remote-unavailable"
# --- Code version (Git) ---
try:
git_commit = subprocess.check_output(
["git", "rev-parse", "HEAD"]
).decode().strip()
except subprocess.CalledProcessError:
git_commit = "no-git-commit" # safety: repo with no commit yet
with mlflow.start_run():
# Reproducibility triangle: data + code + results
mlflow.log_param("dataset_hash", dvc_hash) # which data (version identifier)
mlflow.log_param("dataset_url", dataset_url) # where data lives on the remote
mlflow.log_param("git_commit", git_commit) # which code
mlflow.log_artifact("dataset.dvc") # full descriptor — enables exact dvc pull
# Training parameters
mlflow.log_param("model", "my_model_name")
mlflow.log_param("learning_rate", 0.001)
# Results
mlflow.log_metric("accuracy", 0.91)
mlflow.log_artifact("model.pkl")From any MLflow run, you can restore the exact state of both your data (dvc pull) and your code (git checkout <git_commit>) that produced a given model. This is the full reproducibility triangle: dataset version + code version + experiment results.
8. Dataset Update Workflow
When labels are corrected or new audio is added:
# Modify your files locally, then:
dvc add dataset/ # recompute the hash, update dataset.dvc
git commit -m "dataset v2 – corrected labels for session 3"
dvc push # upload new/changed files to OVHcloudDVC only uploads files that have actually changed — unchanged audio files are not re-uploaded.
Old versions remain accessible on OVHcloud (versioning is enabled). You can always restore a previous version with git checkout <commit> -- dataset.dvc && dvc pull.
Cleaning up old versions with dvc gc
Over time, old dataset versions accumulate in your OVHcloud bucket and can significantly increase storage costs. DVC provides a garbage collector to remove unused versions:
# Preview what would be deleted (dry run — safe to run)
dvc gc --workspace --cloud --remote ovh --dry
# Actually delete unused versions from the remote
dvc gc --workspace --cloud --remote ovh⚠️ Destructive operation
dvc gc on the remote permanently deletes data files from OVHcloud. There is no undo. Always run with --dry first to review what will be removed. Only versions referenced by your current Git workspace are kept — make sure all important dataset versions are tagged in Git before running.
9. Sharing Access
| Method | Use case |
|---|---|
| Read-only OpenStack credentials | Collaborators who need dvc pull |
| Pre-signed URLs | One-off file sharing without credentials |
Repo clone + dvc pull | Full reproducibility for reviewers |
Recommended: create a dedicated read-only OpenStack user in OVHcloud for collaborators.
In OVHcloud, access control is managed via OpenStack Users (equivalent to IAM users in AWS). Apply the principle of least privilege: a collaborator who only needs to download data should have read-only access to the bucket — they should not be able to upload, overwrite, or delete files. This protects your dataset integrity while allowing team access.
Result
| What | How |
|---|---|
| EU cloud storage | OVHcloud Object Storage |
| Dataset versioning | DVC |
| Experiment tracking | MLflow |
| Reproducibility | dataset.dvc hash + git commit logged in every MLflow run |
| Security | Credentials local-only, never in Git |
| Data integrity | Bucket versioning (+ optional Object Lock with retention policy in production) |
| Sharing | Read-only OpenStack user |
At this point, your MLOps foundation is solid. The next steps typically include: automating training pipelines (DVC pipelines or Airflow), deploying MLflow on a shared server, and adding CI/CD to trigger retraining on dataset updates.
Need Help Setting Up Your MLOps Pipeline?
Whether you're working with audio, medical signals, or industrial sensor data — we can help you build a reproducible pipeline tailored to your domain.