home / skills / adaptationio / skrillz / railway-troubleshooting

railway-troubleshooting skill

/skills/railway-troubleshooting

npx playbooks add skill adaptationio/skrillz --skill railway-troubleshooting

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
10.9 KB
---
name: railway-troubleshooting
description: Railway debugging and issue resolution. Use when deployments fail, builds error, services crash, performance degrades, or networking issues occur.
---

# Railway Troubleshooting

Systematic debugging and issue resolution for Railway.com deployments.

## Overview

This skill provides decision trees, diagnostic workflows, and recovery procedures for Railway platform issues. It covers build failures, runtime crashes, networking problems, database issues, and performance degradation.

## Quick Start

Use this decision tree to diagnose and resolve Railway issues:

```
Railway Issue?
│
├── Deployment Failed?
│   ├── Build Error → Operation 1: Diagnose Build Failures
│   ├── Deploy Error → Operation 1: Diagnose Deployment Failures
│   ├── Health Check Failed → Check service health endpoint
│   └── Timeout → Check build/deploy timeouts in settings
│
├── Service Crashing?
│   ├── Immediate crash → Operation 2: Debug Runtime Crashes
│   ├── Crash after time → Check memory limits, memory leaks
│   ├── Restart loop → Check startup command, dependencies
│   └── Exit code errors → Check application logs for specifics
│
├── Networking Issues?
│   ├── Service unreachable → Operation 3: Troubleshoot Networking
│   ├── Intermittent connectivity → Check DNS, service discovery
│   ├── SSL errors → Check domain configuration, certificates
│   └── Timeout errors → Check port configuration, firewalls
│
├── Build Issues?
│   ├── Nixpacks detection wrong → Operation 4: Fix Build Errors
│   ├── Dependencies failing → Check package.json, requirements.txt
│   ├── Build commands failing → Verify build scripts
│   └── Cache issues → Clear build cache, force rebuild
│
└── Database Problems?
    ├── Connection refused → Operation 5: Resolve Database Issues
    ├── Timeout errors → Check connection pools, query performance
    ├── Performance slow → Check indices, query optimization
    └── Data corruption → Check backups, recovery procedures
```

## Operations

### Operation 1: Diagnose Deployment Failures

Identify and resolve deployment failures through systematic log analysis.

**When to use**: Deployment status shows failed, builds succeed but deploys fail, health checks failing.

**Workflow**:

1. **Check Deployment Status**
   ```bash
   # CLI approach
   railway status
   railway logs --deployment

   # API approach (see references/debug-workflow.md for GraphQL)
   # Query deployment status and recent deploys
   ```

2. **Analyze Deploy Logs**
   - Check for port binding issues (Railway expects PORT env var)
   - Verify health check endpoint responding
   - Check startup command execution
   - Identify timeout issues

3. **Common Deploy Failures**
   - Port not bound: App must listen on `process.env.PORT`
   - Health check timeout: Increase timeout or fix endpoint
   - Missing environment variables: Check service variables
   - Startup command wrong: Verify start command in settings

4. **Fix and Redeploy**
   - Apply fix to code/configuration
   - Trigger new deployment
   - Monitor deployment logs
   - Verify service healthy

**See**: `references/common-errors.md` for specific error messages and solutions.

### Operation 2: Debug Runtime Crashes

Investigate and resolve service crashes and restart loops.

**When to use**: Service shows restarting, exit codes in logs, OOM errors, crash reports.

**Workflow**:

1. **Gather Crash Information**
   ```bash
   # Get runtime logs
   railway logs --tail 500

   # Check service metrics
   railway metrics

   # Use diagnostic script
   ./scripts/diagnose.sh [service-id] --verbose
   ```

2. **Identify Crash Pattern**
   - Immediate crash: Startup issue (missing deps, config error)
   - Crash after time: Memory leak, resource exhaustion
   - Intermittent crash: Race condition, external dependency
   - Exit code 137: Out of Memory (OOM) killed

3. **Check Resource Limits**
   - Memory usage trending up → Memory leak
   - CPU at 100% → Infinite loop, CPU-intensive operation
   - Disk full → Log rotation issue, temp files
   - Connection limits → Database pool exhausted

4. **Common Crash Causes**
   - OOM: Increase memory limit or fix memory leak
   - Missing dependencies: Check package installation
   - Uncaught exceptions: Add error handling
   - External service down: Add retry logic, circuit breakers

**See**: `references/debug-workflow.md` for systematic debugging steps.

### Operation 3: Troubleshoot Networking

Resolve networking issues including service discovery, DNS, and connectivity.

**When to use**: Services can't reach each other, DNS resolution fails, external access issues, SSL errors.

**Workflow**:

1. **Verify Service Discovery**
   ```bash
   # Check private networking enabled
   # Services use: [service-name].[project-name].railway.internal

   # Test DNS resolution
   railway run nslookup [service-name].[project-name].railway.internal
   ```

2. **Check Network Configuration**
   - Private networking enabled in project settings
   - Service names correct (use Railway-provided names)
   - Port configuration matches application
   - Environment variables for service URLs set

3. **Debug External Access**
   - Domain configured correctly in service settings
   - DNS records pointing to Railway
   - SSL certificate provisioned (check domain settings)
   - Generate domain option enabled for public access

4. **Common Network Issues**
   - Service discovery: Use full internal domain name
   - Port mismatch: App must listen on PORT env var
   - SSL not working: Allow time for cert provisioning (5-10 min)
   - Timeout: Check for firewall rules, rate limiting

**See**: `references/common-errors.md` Network Errors section.

### Operation 4: Fix Build Errors

Resolve build failures, nixpacks configuration issues, and dependency problems.

**When to use**: Build fails, wrong builder detected, dependencies not installing, build commands fail.

**Workflow**:

1. **Check Build Logs**
   ```bash
   railway logs --build

   # Identify build phase failure:
   # - Detection phase: Nixpacks provider detection
   # - Install phase: Dependencies installation
   # - Build phase: Build commands execution
   ```

2. **Verify Builder Configuration**
   - Check nixpacks.toml or railway.toml for custom config
   - Verify build command in service settings
   - Check for language version specification
   - Ensure correct provider detected (Node, Python, Go, etc.)

3. **Fix Dependency Issues**
   - Lock file present (package-lock.json, yarn.lock, requirements.txt)
   - Dependencies compatible with build environment
   - Private packages have auth configured
   - Build dependencies vs runtime dependencies separated

4. **Force Rebuild if Needed**
   ```bash
   # Clear cache and rebuild
   ./scripts/force-rebuild.sh [service-id] --no-cache

   # Or via CLI
   railway up --detach
   ```

**Common Build Errors**:
- Wrong nixpacks provider: Add nixpacks.toml with correct provider
- Dependency resolution: Update lock files, fix version conflicts
- Build timeout: Optimize build, increase timeout in settings
- Cache issues: Clear build cache with force rebuild

**See**: `references/common-errors.md` Build Errors section.

### Operation 5: Resolve Database Issues

Debug database connection problems, timeouts, and performance issues.

**When to use**: Connection refused, database timeouts, slow queries, connection pool exhausted.

**Workflow**:

1. **Verify Database Connection**
   ```bash
   # Check database service status
   railway status

   # Test connection with database URL
   railway run psql $DATABASE_URL -c "SELECT 1"
   ```

2. **Check Connection Configuration**
   - DATABASE_URL environment variable set correctly
   - Connection pool size appropriate for service plan
   - Connection timeout settings reasonable
   - SSL mode configured if required

3. **Debug Connection Issues**
   - Connection refused: Database not started, wrong host/port
   - Timeout: Network issue, slow queries, pool exhausted
   - Auth failed: Wrong credentials, user permissions
   - Too many connections: Pool size exceeded, connection leak

4. **Performance Troubleshooting**
   - Slow queries: Check query plans, add indices
   - High CPU: Identify expensive queries, optimize
   - Connection pool exhausted: Increase pool size or fix leaks
   - Disk space: Clean up old data, increase storage

**Emergency Recovery**:
- Restart database service: `railway restart [service-id]`
- Check backups: Railway auto-backups available
- Scale vertically: Upgrade database plan if needed
- Connection leak: Restart application services

**See**: `references/recovery-procedures.md` for emergency procedures.

## Related Skills

- `railway-auth`: Authentication setup for Railway CLI/API
- `railway-logs`: Advanced log querying and analysis
- `railway-deployment`: Deployment workflows and strategies
- `railway-api`: GraphQL API queries and operations

## When to Use This Skill

Use railway-troubleshooting when you encounter:
- ❌ Deployment failures or build errors
- 🔄 Service restart loops or crashes
- 🌐 Networking or connectivity issues
- 🐛 Runtime errors or performance problems
- 💾 Database connection or query issues
- ⚡ Performance degradation
- 🔧 Configuration or environment issues

## Quick Diagnostic

Run the diagnostic script for automated issue detection:

```bash
cd /mnt/c/data/github/skrillz/.claude/skills/railway-troubleshooting/scripts
./diagnose.sh [service-id] --verbose
```

The script will:
- Check service health status
- Analyze recent deployment logs
- Scan for common error patterns
- Check resource utilization
- Provide specific recommendations

## Additional Resources

- **Common Errors Guide**: `references/common-errors.md` - 20+ documented errors with solutions
- **Debug Workflow**: `references/debug-workflow.md` - Systematic debugging methodology
- **Recovery Procedures**: `references/recovery-procedures.md` - Emergency recovery steps
- **Diagnostic Script**: `scripts/diagnose.sh` - Automated diagnostics
- **Force Rebuild**: `scripts/force-rebuild.sh` - Clear cache and rebuild

## Best Practices

1. **Always check logs first**: Build logs, deploy logs, runtime logs
2. **Verify environment variables**: Missing vars cause most deployment failures
3. **Check resource limits**: Memory/CPU limits appropriate for workload
4. **Test locally first**: Reproduce issues locally when possible
5. **Monitor metrics**: Use Railway dashboard for trends
6. **Document solutions**: Update common-errors.md with new patterns
7. **Use private networking**: For inter-service communication
8. **Enable health checks**: Catch deployment issues early

## Support

For issues not covered by this skill:
- Railway Documentation: https://docs.railway.com
- Railway Discord: Active community support
- Railway Status: https://status.railway.com
- GitHub Issues: https://github.com/railwayapp/railway/issues