Hidacs Sàrl
MLOps
February 202614 min read

MLOps Setup – OVHcloud + DVC + MLflow (Audio Dataset)

This guide is written for a first MLOps project with audio data. The goal is simple: make sure that at any point in time, you can reproduce exactly which audio files and labels were used to train a given model — even six months later, even after corrections.

Architecture

Audio + Labels → OVHcloud Object Storage (S3)
Versioning     → DVC
Tracking       → MLflow
Code           → Git

Why this split? Audio files are large binary files — Git was not designed for them. Git handles your code and configuration. DVC (Data Version Control — an open-source tool that brings Git-like versioning to data files) handles your data, using the same mental model (add, push, pull) but storing files on a cloud backend (here OVHcloud S3). MLflow (an open-source platform for tracking ML experiments) records what happened during each training run: which dataset version, which hyperparameters, which metrics.

Together, these three tools give you full traceability: code + data + results.

1. Create Object Storage on OVHcloud

OVHcloud is a European cloud provider whose Object Storage is compatible with the S3 protocol — the standard storage API originally created by AWS. This means any tool that supports S3 (DVC, boto3, the AWS CLI…) works with OVHcloud out of the box, just by pointing it to a different endpoint URL.

1.1 Create Public Cloud Project

  • Log into OVHcloud Manager
  • Create a Public Cloud project
  • Activate billing

OVHcloud separates products into "Public Cloud" projects. Object Storage lives inside one of these projects. You need to create the project before you can create a bucket.

1.2 Create Object Storage

  • Go to Object Storage
  • Create a container (bucket), e.g.: ml-audio-datasets
  • Select region (EU recommended)

⚠️ Endpoint URL

Depending on your region and OVHcloud offer, the endpoint format may vary — e.g. s3.<region>.io.cloud.ovh.net or s3.<region>.perf.cloud.ovh.net. Always verify the exact URL in your OVHcloud Manager under Object Storage → your bucket → S3 endpoint. Using the wrong URL is a common source of connection errors.

The right bucket configuration depends on your project phase:

R&D phaseProduction phase
Versioning✅ Enable✅ Enable
Object Lock (WORM)❌ Skip✅ Enable at creation

R&D phase — versioning only is sufficient. You will frequently delete mislabeled files, reorganize the dataset structure, or correct errors in bulk. Object Lock would prevent all of this, leaving you with files you cannot clean up. Versioning alone already protects you against accidental deletions — deleted files remain recoverable.

Production phase — when your dataset is stable and used to train models that go to clinical or production use, create a dedicated bucket with Object Lock enabled. This guarantees that any dataset version linked to a deployed model can never be altered or deleted. Note that Object Lock also requires setting a retention policy (a minimum duration during which files are protected) to be fully effective — without it, the lock has no enforcement period.

⚠️ Important

Object Lock must be enabled at bucket creation — it cannot be activated afterwards. Plan your bucket naming accordingly, e.g. ml-audio-datasets-rd and ml-audio-datasets-prod.

Once the bucket is created, enable versioning:

aws s3api put-bucket-versioning \
  --bucket ml-audio-datasets \
  --versioning-configuration Status=Enabled \
  --endpoint-url https://s3.<region>.io.cloud.ovh.net

Versioning keeps a history of every file modification inside the bucket. Even if a file is overwritten, the previous version is preserved. This is your safety net against accidental data loss.

1.3 Generate S3 Credentials

In OVHcloud, S3 credentials are created via OpenStack Users (not IAM as in AWS — the concept is equivalent but the interface differs). Create a dedicated user with restricted permissions following the principle of least privilege: only grant the permissions your pipeline actually needs.

You will obtain:

BUCKET_NAME=ml-audio-datasets
ACCESS_KEY=xxxxxxxx
SECRET_KEY=xxxxxxxx
ENDPOINT_URL=https://s3.<region>.io.cloud.ovh.net

Save these securely (password manager, never in Git).

2. Install Dependencies

Install all required packages in one command:

pip install dvc[s3] mlflow PyYAML

dvc[s3] installs DVC with its S3 backend. mlflow is the experiment tracker. PyYAML is required to parse dataset.dvc in your training script — it is not always installed by default with DVC or MLflow, so it is safer to declare it explicitly.

Initialize your Git + DVC repository:

git init
dvc init
git commit -m "init dvc"

dvc init creates a .dvc/ folder inside your Git repo. This folder contains DVC's configuration. It must be committed to Git — it is not sensitive.

3. Configure DVC Remote (OVH S3)

# Public config — committed to Git
dvc remote add -d ovh s3://ml-audio-datasets
dvc remote modify ovh endpointurl https://s3.<region>.io.cloud.ovh.net

# Credentials — local only, never committed
dvc remote modify ovh --local access_key_id <ACCESS_KEY>
dvc remote modify ovh --local secret_access_key <SECRET_KEY>

A remote in DVC is the cloud location where your data files are actually stored — here your OVHcloud bucket. The name ovh is just a label you choose freely.

The --local flag writes the credentials to .dvc/config.local, a file that DVC automatically excludes from Git. This way, your bucket address is shared with your team, but your secret keys never leave your machine.

.dvc/config.local is automatically added to .gitignore by DVC. Only the endpoint and bucket name are committed — never your keys.

4. Structure Dataset

dataset/
│
├── audio/
│   ├── file1.wav
│   └── file2.wav
│
└── labels/
    ├── file1.txt
    └── file2.txt

Keeping audio and labels in a single dataset/ folder lets DVC version them together as one atomic unit. If you version them separately, you risk a mismatch between audio v2 and labels v1.

Audio and label files must share identical filenames — this makes pairing deterministic and avoids ambiguity during training.

Create a .dvcignore file at the root of your repository to exclude temporary or system files from DVC tracking:

# .dvcignore
.DS_Store         # macOS system files
Thumbs.db         # Windows thumbnail cache
*.tmp
*.log

.dvcignore works exactly like .gitignore but for DVC. Without it, files like .DS_Store get included in the dataset hash — meaning a Mac user simply opening the folder would silently change the hash without touching any actual audio or label file, breaking reproducibility.

5. Add Dataset to DVC

dvc add dataset/
git add dataset.dvc .gitignore
git commit -m "dataset v1"
dvc push

dvc add computes a hash of your dataset and creates a dataset.dvc file — a small text file that describes the dataset without containing the actual data. This .dvc file is your "pointer" to the data, and it is the only thing committed to Git.

dvc push uploads the actual files to OVHcloud. Your collaborators can later run dvc pull to download the exact same files.

6. Run MLflow Locally

mlflow ui

Access at: http://localhost:5000

MLflow is your experiment logbook. Every time you train a model, you open a run and record: which dataset version, which hyperparameters, which metrics, and which model file was produced. Running it locally is fine to start — for a team, you would deploy it on a shared server so everyone sees the same runs.

7. Log Dataset Version During Training

The key idea:

Never rely on memory or convention to know which dataset was used for a given model. Log it explicitly and automatically inside every training run.

import subprocess
import yaml
import mlflow
import dvc.api

# --- Dataset version (DVC) ---
with open("dataset.dvc") as f:
    dvc_file = yaml.safe_load(f)

# When versioning a directory (dvc add dataset/), DVC creates an internal
# manifest file listing all files with their individual hashes.
# The hash stored in dataset.dvc points to that manifest — it ends with .dir
out = dvc_file["outs"][0]
dvc_hash = out.get("md5") or out.get("etag")  # md5 for local/SSH, etag for S3

if not dvc_hash.endswith(".dir"):
    print(f"Warning: unexpected hash format: {dvc_hash}")
    # This may indicate you versioned a single file instead of a directory

# Optional: resolve the full S3 URL of the dataset on the remote.
# Specify the remote explicitly to avoid ambiguity.
# Wrapped in try/except because this requires network access —
# it will fail in offline or CI environments without remote access.
try:
    dataset_url = dvc.api.get_url("dataset/", remote="ovh")
except Exception:
    dataset_url = "remote-unavailable"

# --- Code version (Git) ---
try:
    git_commit = subprocess.check_output(
        ["git", "rev-parse", "HEAD"]
    ).decode().strip()
except subprocess.CalledProcessError:
    git_commit = "no-git-commit"  # safety: repo with no commit yet

with mlflow.start_run():
    # Reproducibility triangle: data + code + results
    mlflow.log_param("dataset_hash", dvc_hash)     # which data (version identifier)
    mlflow.log_param("dataset_url", dataset_url)    # where data lives on the remote
    mlflow.log_param("git_commit", git_commit)      # which code
    mlflow.log_artifact("dataset.dvc")              # full descriptor — enables exact dvc pull

    # Training parameters
    mlflow.log_param("model", "my_model_name")
    mlflow.log_param("learning_rate", 0.001)

    # Results
    mlflow.log_metric("accuracy", 0.91)
    mlflow.log_artifact("model.pkl")

From any MLflow run, you can restore the exact state of both your data (dvc pull) and your code (git checkout <git_commit>) that produced a given model. This is the full reproducibility triangle: dataset version + code version + experiment results.

8. Dataset Update Workflow

When labels are corrected or new audio is added:

# Modify your files locally, then:
dvc add dataset/       # recompute the hash, update dataset.dvc
git commit -m "dataset v2 – corrected labels for session 3"
dvc push               # upload new/changed files to OVHcloud

DVC only uploads files that have actually changed — unchanged audio files are not re-uploaded.

Old versions remain accessible on OVHcloud (versioning is enabled). You can always restore a previous version with git checkout <commit> -- dataset.dvc && dvc pull.

Cleaning up old versions with dvc gc

Over time, old dataset versions accumulate in your OVHcloud bucket and can significantly increase storage costs. DVC provides a garbage collector to remove unused versions:

# Preview what would be deleted (dry run — safe to run)
dvc gc --workspace --cloud --remote ovh --dry

# Actually delete unused versions from the remote
dvc gc --workspace --cloud --remote ovh

⚠️ Destructive operation

dvc gc on the remote permanently deletes data files from OVHcloud. There is no undo. Always run with --dry first to review what will be removed. Only versions referenced by your current Git workspace are kept — make sure all important dataset versions are tagged in Git before running.

9. Sharing Access

MethodUse case
Read-only OpenStack credentialsCollaborators who need dvc pull
Pre-signed URLsOne-off file sharing without credentials
Repo clone + dvc pullFull reproducibility for reviewers

Recommended: create a dedicated read-only OpenStack user in OVHcloud for collaborators.

In OVHcloud, access control is managed via OpenStack Users (equivalent to IAM users in AWS). Apply the principle of least privilege: a collaborator who only needs to download data should have read-only access to the bucket — they should not be able to upload, overwrite, or delete files. This protects your dataset integrity while allowing team access.

Result

WhatHow
EU cloud storageOVHcloud Object Storage
Dataset versioningDVC
Experiment trackingMLflow
Reproducibilitydataset.dvc hash + git commit logged in every MLflow run
SecurityCredentials local-only, never in Git
Data integrityBucket versioning (+ optional Object Lock with retention policy in production)
SharingRead-only OpenStack user

At this point, your MLOps foundation is solid. The next steps typically include: automating training pipelines (DVC pipelines or Airflow), deploying MLflow on a shared server, and adding CI/CD to trigger retraining on dataset updates.

Need Help Setting Up Your MLOps Pipeline?

Whether you're working with audio, medical signals, or industrial sensor data — we can help you build a reproducible pipeline tailored to your domain.

Related Articles

Machine Learning

Fine-Tuning Audio ML Models: Plan, Audit, Diagnose

A structured framework for audio ML fine-tuning: how to plan a new pipeline, audit an existing script, and diagnose an underperforming model.

Read more →
Industrial Acoustics

The Silent Sentinels: When Materials Speak, Quality Control Listens

Four unexpected uses of passive acoustic monitoring in manufacturing: from 3D-printed concrete to cryogenic aerospace composites, materials speak through sound.

Read more →
Machine Learning

Wav2Vec2 & XLSR Model Guide

A comprehensive guide to Wav2Vec2 model variants, their use cases, and best practices for speech processing tasks.

Read more →