home / skills / transilienceai / communitytools / html_content_analysis
/projects/techstack_identification/.claude/skills/html_content_analysis
This skill analyzes HTML to extract technology signals from meta tags, comments, script patterns, and structured data for security research.
npx playbooks add skill transilienceai/communitytools --skill html_content_analysisReview the files below or copy the command above to add this skill to your agents.
---
name: html-content-analysis
description: Parses HTML for meta tags, generator comments, and script URL patterns
tools: Bash, WebFetch, Grep
model: inherit
hooks:
PreToolUse:
- matcher: "WebFetch"
hooks:
- type: command
command: "../../../hooks/skills/pre_network_skill_hook.sh"
PostToolUse:
- matcher: "WebFetch"
hooks:
- type: command
command: "../../../hooks/skills/post_evidence_capture_hook.sh"
---
# HTML Content Analysis Skill
## Purpose
Parse HTML documents to extract technology signals from meta tags, generator comments, script URLs, CSS frameworks, and structural patterns.
## Operations
### 1. extract_meta_generator
Find `<meta name="generator">` tags.
**Command (conceptual):**
```bash
curl -s {url} | grep -oP '<meta[^>]*name=["\']generator["\'][^>]*content=["\'][^"\']+["\']'
```
**Generator Patterns:**
```json
{
"WordPress": {
"pattern": "WordPress[\\s]?([\\d.]+)?",
"tech": "WordPress",
"extract_version": true,
"confidence": 95
},
"Drupal": {
"pattern": "Drupal[\\s]?([\\d.]+)?",
"tech": "Drupal",
"extract_version": true,
"confidence": 95
},
"Joomla": {
"pattern": "Joomla!?[\\s]?([\\d.]+)?",
"tech": "Joomla",
"extract_version": true,
"confidence": 95
},
"Shopify": {
"pattern": "Shopify",
"tech": "Shopify",
"confidence": 95
},
"Wix.com": {
"pattern": "Wix\\.com",
"tech": "Wix",
"confidence": 95
},
"Squarespace": {
"pattern": "Squarespace",
"tech": "Squarespace",
"confidence": 95
},
"Ghost": {
"pattern": "Ghost[\\s]?([\\d.]+)?",
"tech": "Ghost",
"extract_version": true,
"confidence": 95
},
"Hugo": {
"pattern": "Hugo[\\s]?([\\d.]+)?",
"tech": "Hugo",
"extract_version": true,
"confidence": 95
},
"Jekyll": {
"pattern": "Jekyll[\\s]?([\\d.]+)?",
"tech": "Jekyll",
"extract_version": true,
"confidence": 95
},
"Gatsby": {
"pattern": "Gatsby[\\s]?([\\d.]+)?",
"tech": "Gatsby",
"extract_version": true,
"implies": ["React"],
"confidence": 95
}
}
```
### 2. scan_html_comments
Look for technology hints in HTML comments.
**Comment Patterns:**
```json
{
"Powered by": {
"pattern": "<!--[^>]*[Pp]owered by ([^>-]+)-->",
"extract_group": 1,
"confidence": 80
},
"Generated by": {
"pattern": "<!--[^>]*[Gg]enerated by ([^>-]+)-->",
"extract_group": 1,
"confidence": 80
},
"Built with": {
"pattern": "<!--[^>]*[Bb]uilt with ([^>-]+)-->",
"extract_group": 1,
"confidence": 75
},
"WordPress": {
"pattern": "<!--[^>]*wp-content[^>]*-->",
"tech": "WordPress",
"confidence": 85
},
"Drupal": {
"pattern": "<!--[^>]*drupal[^>]*-->",
"tech": "Drupal",
"confidence": 85
},
"Magento": {
"pattern": "<!--[^>]*Magento[^>]*-->",
"tech": "Magento",
"confidence": 90
}
}
```
### 3. analyze_script_urls
Identify framework-specific paths in script tags.
**Script URL Patterns:**
```json
{
"/wp-content/": {"tech": "WordPress", "confidence": 95},
"/wp-includes/": {"tech": "WordPress", "confidence": 95},
"/sites/default/files/": {"tech": "Drupal", "confidence": 90},
"/misc/drupal.js": {"tech": "Drupal", "confidence": 95},
"/_next/": {"tech": "Next.js", "confidence": 95, "implies": ["React"]},
"/_nuxt/": {"tech": "Nuxt.js", "confidence": 95, "implies": ["Vue.js"]},
"/static/js/main.": {"tech": "Create React App", "confidence": 85},
"/assets/application-": {"tech": "Ruby on Rails", "confidence": 80},
"/bundles/": {"tech": "ASP.NET", "confidence": 75},
"/Scripts/": {"tech": "ASP.NET", "confidence": 75},
"jquery": {"tech": "jQuery", "confidence": 90},
"bootstrap": {"tech": "Bootstrap", "confidence": 90},
"angular": {"tech": "Angular", "confidence": 85},
"react": {"tech": "React", "confidence": 85},
"vue": {"tech": "Vue.js", "confidence": 85}
}
```
### 4. detect_css_frameworks
Find CSS framework classes in HTML.
**CSS Framework Patterns:**
```json
{
"Bootstrap": {
"classes": ["btn btn-", "container", "navbar", "row", "col-", "card", "modal"],
"min_matches": 3,
"confidence": 85
},
"Tailwind CSS": {
"classes": ["bg-", "text-", "flex", "p-", "m-", "w-", "h-", "rounded-"],
"min_matches": 5,
"confidence": 85
},
"Foundation": {
"classes": ["button", "callout", "top-bar", "grid-x", "cell"],
"min_matches": 3,
"confidence": 85
},
"Bulma": {
"classes": ["button is-", "columns", "column", "hero", "box"],
"min_matches": 3,
"confidence": 85
},
"Material UI": {
"classes": ["MuiButton", "MuiGrid", "MuiPaper", "MuiTypography"],
"min_matches": 2,
"confidence": 90
},
"Chakra UI": {
"classes": ["chakra-", "css-"],
"min_matches": 2,
"confidence": 80
},
"Ant Design": {
"classes": ["ant-btn", "ant-card", "ant-table", "ant-form"],
"min_matches": 2,
"confidence": 90
}
}
```
### 5. extract_structured_data
Find JSON-LD and structured data.
**Process:**
1. Find `<script type="application/ld+json">` tags
2. Parse JSON content
3. Extract @type and properties
4. Note e-commerce, organization, or product data
**Structured Data Signals:**
```json
{
"Product": {"indicates": "E-commerce site"},
"Organization": {"indicates": "Business website"},
"LocalBusiness": {"indicates": "Local business"},
"Article": {"indicates": "Blog/News site"},
"WebApplication": {"indicates": "Web app"}
}
```
## Output
```json
{
"skill": "html_content_analysis",
"domain": "string",
"results": {
"pages_analyzed": "number",
"meta_generators": [
{
"url": "string",
"generator": "WordPress 6.4.2",
"tech": "WordPress",
"version": "6.4.2",
"confidence": 95
}
],
"html_comments": [
{
"url": "string",
"comment": "Powered by Django",
"tech": "Django",
"confidence": 80
}
],
"script_analysis": {
"urls_analyzed": "number",
"frameworks_detected": [
{
"tech": "Next.js",
"source": "/_next/ path pattern",
"confidence": 95
}
],
"cdns_used": ["cdnjs.cloudflare.com", "unpkg.com"]
},
"css_frameworks": [
{
"name": "Bootstrap",
"classes_matched": ["btn", "container", "navbar", "row"],
"confidence": 85
},
{
"name": "Tailwind CSS",
"classes_matched": ["bg-white", "text-gray-500", "flex"],
"confidence": 85
}
],
"structured_data": [
{
"type": "Organization",
"url": "string",
"signals": "Business website"
}
],
"technologies_summary": [
{
"name": "string",
"category": "CMS|Framework|CSS|Library",
"confidence": "number",
"sources": ["array"]
}
]
},
"evidence": [
{
"type": "meta_generator",
"url": "string",
"value": "string",
"timestamp": "ISO-8601"
},
{
"type": "script_url",
"url": "string",
"path": "string"
}
]
}
```
## Rate Limiting
- Page fetches: 10/minute per domain
- HTML parsing: No limit (local processing)
## Error Handling
- Malformed HTML: Use lenient parser
- Large pages: Limit to first 5MB
- Encoding issues: Detect and handle charset
- Continue on parse errors
## Security Considerations
- Only fetch public pages
- Do not execute scripts
- Sanitize extracted content
- Log all fetches for audit
This skill parses HTML to extract technology signals from meta generator tags, HTML comments, script and asset URLs, CSS classes, and JSON-LD structured data. It turns noisy page content into concise technology indicators and confidence scores for security research, bug bounty triage, and reconnaissance. The skill is tuned to detect CMS, frameworks, libraries, CSS frameworks, and common e-commerce or organization structured data.
The skill fetches HTML (respecting rate limits) and uses a lenient parser to locate <meta name="generator"> tags, HTML comments, <script> and <link> URLs, class attributes, and application/ld+json blocks. Pattern dictionaries map matched values to technologies with confidence scores and optional version extraction. Results include per-page evidence, a technologies summary, and detected structured-data types.
How reliable are the detections?
Detections are heuristic and include confidence scores; treat them as leads and validate with additional probes before taking action.
Does the skill execute JavaScript or fetch linked resources?
No. It only parses static HTML. It does not execute scripts or fetch resource bodies beyond basic URL scanning to avoid side effects.