home / skills / sickn33 / antigravity-awesome-skills / azure-storage-file-datalake-py

azure-storage-file-datalake-py skill

This skill helps you manage Azure Data Lake Storage Gen2 with Python by guiding file system, directory, and file operations.

npx playbooks add skill sickn33/antigravity-awesome-skills --skill azure-storage-file-datalake-py

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.3 KB

---
name: azure-storage-file-datalake-py
description: |
  Azure Data Lake Storage Gen2 SDK for Python. Use for hierarchical file systems, big data analytics, and file/directory operations.
  Triggers: "data lake", "DataLakeServiceClient", "FileSystemClient", "ADLS Gen2", "hierarchical namespace".
package: azure-storage-file-datalake
---

# Azure Data Lake Storage Gen2 SDK for Python

Hierarchical file system for big data analytics workloads.

## Installation

```bash
pip install azure-storage-file-datalake azure-identity
```

## Environment Variables

```bash
AZURE_STORAGE_ACCOUNT_URL=https://<account>.dfs.core.windows.net
```

## Authentication

```python
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient

credential = DefaultAzureCredential()
account_url = "https://<account>.dfs.core.windows.net"

service_client = DataLakeServiceClient(account_url=account_url, credential=credential)
```

## Client Hierarchy

| Client | Purpose |
|--------|---------|
| `DataLakeServiceClient` | Account-level operations |
| `FileSystemClient` | Container (file system) operations |
| `DataLakeDirectoryClient` | Directory operations |
| `DataLakeFileClient` | File operations |

## File System Operations

```python
# Create file system (container)
file_system_client = service_client.create_file_system("myfilesystem")

# Get existing
file_system_client = service_client.get_file_system_client("myfilesystem")

# Delete
service_client.delete_file_system("myfilesystem")

# List file systems
for fs in service_client.list_file_systems():
    print(fs.name)
```

## Directory Operations

```python
file_system_client = service_client.get_file_system_client("myfilesystem")

# Create directory
directory_client = file_system_client.create_directory("mydir")

# Create nested directories
directory_client = file_system_client.create_directory("path/to/nested/dir")

# Get directory client
directory_client = file_system_client.get_directory_client("mydir")

# Delete directory
directory_client.delete_directory()

# Rename/move directory
directory_client.rename_directory(new_name="myfilesystem/newname")
```

## File Operations

### Upload File

```python
# Get file client
file_client = file_system_client.get_file_client("path/to/file.txt")

# Upload from local file
with open("local-file.txt", "rb") as data:
    file_client.upload_data(data, overwrite=True)

# Upload bytes
file_client.upload_data(b"Hello, Data Lake!", overwrite=True)

# Append data (for large files)
file_client.append_data(data=b"chunk1", offset=0, length=6)
file_client.append_data(data=b"chunk2", offset=6, length=6)
file_client.flush_data(12)  # Commit the data
```

### Download File

```python
file_client = file_system_client.get_file_client("path/to/file.txt")

# Download all content
download = file_client.download_file()
content = download.readall()

# Download to file
with open("downloaded.txt", "wb") as f:
    download = file_client.download_file()
    download.readinto(f)

# Download range
download = file_client.download_file(offset=0, length=100)
```

### Delete File

```python
file_client.delete_file()
```

## List Contents

```python
# List paths (files and directories)
for path in file_system_client.get_paths():
    print(f"{'DIR' if path.is_directory else 'FILE'}: {path.name}")

# List paths in directory
for path in file_system_client.get_paths(path="mydir"):
    print(path.name)

# Recursive listing
for path in file_system_client.get_paths(path="mydir", recursive=True):
    print(path.name)
```

## File/Directory Properties

```python
# Get properties
properties = file_client.get_file_properties()
print(f"Size: {properties.size}")
print(f"Last modified: {properties.last_modified}")

# Set metadata
file_client.set_metadata(metadata={"processed": "true"})
```

## Access Control (ACL)

```python
# Get ACL
acl = directory_client.get_access_control()
print(f"Owner: {acl['owner']}")
print(f"Permissions: {acl['permissions']}")

# Set ACL
directory_client.set_access_control(
    owner="user-id",
    permissions="rwxr-x---"
)

# Update ACL entries
from azure.storage.filedatalake import AccessControlChangeResult
directory_client.update_access_control_recursive(
    acl="user:user-id:rwx"
)
```

## Async Client

```python
from azure.storage.filedatalake.aio import DataLakeServiceClient
from azure.identity.aio import DefaultAzureCredential

async def datalake_operations():
    credential = DefaultAzureCredential()
    
    async with DataLakeServiceClient(
        account_url="https://<account>.dfs.core.windows.net",
        credential=credential
    ) as service_client:
        file_system_client = service_client.get_file_system_client("myfilesystem")
        file_client = file_system_client.get_file_client("test.txt")
        
        await file_client.upload_data(b"async content", overwrite=True)
        
        download = await file_client.download_file()
        content = await download.readall()

import asyncio
asyncio.run(datalake_operations())
```

## Best Practices

1. **Use hierarchical namespace** for file system semantics
2. **Use `append_data` + `flush_data`** for large file uploads
3. **Set ACLs at directory level** and inherit to children
4. **Use async client** for high-throughput scenarios
5. **Use `get_paths` with `recursive=True`** for full directory listing
6. **Set metadata** for custom file attributes
7. **Consider Blob API** for simple object storage use cases

Overview

This skill provides a concise, practical guide to using the Azure Data Lake Storage Gen2 SDK for Python to manage hierarchical file systems for big data workloads. It explains authentication, client hierarchy, common file and directory operations, and async usage patterns. Ideal for developers building analytics pipelines, ETL jobs, or automated file workflows on ADLS Gen2.

How this skill works

The skill shows how to authenticate with DefaultAzureCredential and instantiate DataLakeServiceClient against an account endpoint. It details the client hierarchy (service, filesystem, directory, file) and demonstrates creating, listing, uploading, downloading, renaming, deleting, and setting properties/ACLs. It also covers append/flush semantics for large uploads and the async client for high-throughput scenarios.

When to use it

You need hierarchical file semantics for big data analytics and distributed processing.
Uploading large files in chunks or appending data incrementally.
Automating ETL pipelines that require metadata and ACL management.
Listing and recursively traversing large directory trees.
When you need async, high-throughput file operations in Python.

Best practices

Use DefaultAzureCredential to centralize authentication across environments.
Prefer append_data + flush_data for large or streaming uploads to avoid memory spikes.
Set ACLs at directory level and propagate to children for consistent access control.
Use the async client when performing many concurrent operations or high-throughput transfers.
Store important attributes in metadata rather than embedding them into filenames for easier querying.

Example use cases

ETL pipeline writes intermediate parquet files into a hierarchical filesystem and sets metadata for downstream consumers.
A log ingestion service appends event batches to large files and flushes periodically to commit data.
A data science job recursively lists a project directory to discover input datasets and versions.
A server-less function downloads a small range from a large file for fast preview without reading the whole object.
Automated provisioning script creates file systems and directories, sets ACLs, and seeds initial control files.

FAQ

How do I authenticate from local development?

Use DefaultAzureCredential (Azure CLI, Visual Studio, or environment variables are picked up automatically). Set AZURE_STORAGE_ACCOUNT_URL to your dfs endpoint.

When should I use the Blob API instead?

Use Blob API for simple object storage without hierarchical semantics; choose ADLS Gen2 when you need directories, ACLs, and big-data-friendly operations.