Schema.org Crawler REST API Reference
curl https://testing.nlweb.ai/api/status \
-H "X-API-Key: YOUR_API_KEY_HERE"
All API endpoints (except login/logout) require authentication. You can authenticate in two ways:
Include your API key in the X-API-Key header:
curl https://testing.nlweb.ai/api/sites \
-H "X-API-Key: YOUR_API_KEY_HERE"
Login via GitHub OAuth at /login. The session cookie will authenticate your requests.
GET /api/me (works with session cookie)api_key from the response# Get your API key (after OAuth login)
curl https://testing.nlweb.ai/api/me \
--cookie "session=YOUR_SESSION_COOKIE"
# Response:
{
"user_id": "github:12345",
"email": "user@example.com",
"name": "Your Name",
"provider": "github",
"api_key": "Abc123XyZ_YourApiKeyHere...",
"created_at": "2024-01-15T10:30:00",
"last_login": "2024-01-20T14:22:00"
}
List all sites you're monitoring.
curl https://testing.nlweb.ai/api/sites \
-H "X-API-Key: YOUR_API_KEY"
[
{
"site_url": "https://example.com",
"process_interval_hours": 24,
"last_processed": "2024-01-20T10:30:00",
"is_active": true,
"created_at": "2024-01-15T08:00:00"
}
]
Add a new site to monitor. The crawler will automatically discover schema maps from robots.txt.
| Parameter | Type | Required | Description |
|---|---|---|---|
| site_url | string | Required | Full URL of the site (e.g., https://example.com) |
| interval_hours | integer | Optional | How often to reprocess (default: 24 hours) |
curl -X POST https://testing.nlweb.ai/api/sites \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"site_url": "https://example.com",
"interval_hours": 24
}'
{
"success": true,
"site_url": "https://example.com"
}
https://example.com/robots.txt, look for schemaMap: directives, download the schema maps, and queue all JSON-LD files for processing.
Remove a site from monitoring. This deletes all files, IDs, and vector DB entries.
| Parameter | Description |
|---|---|
| url | URL-encoded site URL (e.g., https%3A%2F%2Fexample.com) |
curl -X DELETE "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com" \
-H "X-API-Key: YOUR_API_KEY"
{
"success": true,
"schema_maps_removed": 2,
"files_queued_for_removal": 150
}
Manually trigger processing for a site (bypasses scheduled interval).
curl -X POST "https://testing.nlweb.ai/api/process/https%3A%2F%2Fexample.com" \
-H "X-API-Key: YOUR_API_KEY"
{
"success": true,
"message": "Processing started for https://example.com"
}
Manually add a specific schema map to a site (useful if not in robots.txt).
| Parameter | Type | Required | Description |
|---|---|---|---|
| schema_map_url | string | Required | Full URL to the schema_map.xml file |
curl -X POST "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com/schema-files" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"schema_map_url": "https://example.com/custom_schema_map.xml"
}'
{
"success": true,
"site_url": "https://example.com",
"schema_map_url": "https://example.com/custom_schema_map.xml",
"files_added": 50,
"files_queued": 50
}
Remove a specific schema map and all its files from a site.
| Parameter | Type | Required | Description |
|---|---|---|---|
| schema_map_url | string | Required | URL of the schema map to remove |
curl -X DELETE "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com/schema-files" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"schema_map_url": "https://example.com/custom_schema_map.xml"
}'
{
"success": true,
"deleted_count": 50,
"files_queued_for_removal": 50
}
Get all files for a specific site.
curl "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com/files" \
-H "X-API-Key: YOUR_API_KEY"
[
{
"file_url": "https://example.com/schema/products/1.json",
"schema_map": "https://example.com/schema_map.xml",
"last_read_time": "2024-01-20T10:30:00",
"number_of_items": 25,
"is_manual": false,
"is_active": true
}
]
Get all files across all your sites.
curl https://testing.nlweb.ai/api/files \
-H "X-API-Key: YOUR_API_KEY"
[
{
"site_url": "https://example.com",
"file_url": "https://example.com/schema/products/1.json",
"schema_map": "https://example.com/schema_map.xml",
"is_active": true,
"is_manual": false,
"number_of_items": 25,
"last_read_time": "2024-01-20T10:30:00",
"id_count": 25
}
]
Get all @id values from a specific JSON-LD file.
curl "https://testing.nlweb.ai/api/files/https%3A%2F%2Fexample.com%2Fschema%2Fproducts%2F1.json/ids" \
-H "X-API-Key: YOUR_API_KEY"
{
"file_url": "https://example.com/schema/products/1.json",
"ids": [
"https://example.com/product/123",
"https://example.com/product/124",
"https://example.com/offer/456"
],
"count": 3
}
Get overall system status and statistics.
curl https://testing.nlweb.ai/api/status \
-H "X-API-Key: YOUR_API_KEY"
{
"master_started_at": "2024-01-20T08:00:00",
"master_uptime_seconds": 43200,
"sites": [
{
"site_url": "https://example.com",
"is_active": true,
"last_processed": "2024-01-20T10:30:00",
"total_files": 150,
"manual_files": 0,
"total_ids": 3750
}
]
}
Get queue processing status (pending, processing, failed jobs).
curl https://testing.nlweb.ai/api/queue/status \
-H "X-API-Key: YOUR_API_KEY"
{
"queue_type": "servicebus",
"pending_jobs": 42,
"processing_jobs": 3,
"failed_jobs": 0,
"total_jobs": 45,
"jobs": [
{
"id": "job-20240120-103045-123456",
"status": "processing",
"type": "process_file",
"site": "https://example.com",
"file_url": "https://example.com/schema/1.json",
"queued_at": "2024-01-20T10:30:45",
"processing_time": 15
}
],
"error": null
}
Get status of all worker pods (Kubernetes only).
curl https://testing.nlweb.ai/api/workers \
-H "X-API-Key: YOUR_API_KEY"
[
{
"name": "crawler-worker-5d7c8f9-abc12",
"ip": "10.244.1.5",
"phase": "Running",
"status": {
"worker_id": "crawler-worker-5d7c8f9-abc12",
"started_at": "2024-01-20T08:00:00",
"current_job": {
"type": "process_file",
"file_url": "https://example.com/schema/1.json"
},
"total_jobs_processed": 1250,
"total_jobs_failed": 3,
"last_job_at": "2024-01-20T10:30:00",
"last_job_status": "success",
"status": "processing"
},
"error": null
}
]
Get your user information including your API key.
curl https://testing.nlweb.ai/api/me \
-H "X-API-Key: YOUR_API_KEY"
{
"user_id": "github:12345",
"email": "user@example.com",
"name": "Your Name",
"provider": "github",
"api_key": "Abc123XyZ_YourApiKeyHere...",
"created_at": "2024-01-15T10:30:00",
"last_login": "2024-01-20T14:22:00"
}
{
"error": "Authentication required. Provide X-API-Key header or login via OAuth."
}
{
"error": "User not found"
}
{
"error": "site_url is required"
}
{
"error": "Internal server error message"
}
Currently there are no rate limits, but we recommend:
Schema maps should follow this XML format:
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url contentType="structuredData/schema.org">
<loc>https://example.com/schema/products/1.json</loc>
</url>
<url contentType="structuredData/schema.org">
<loc>https://example.com/schema/products/2.json</loc>
</url>
</urlset>
robots.txt Discovery:
Add to your robots.txt:
schemaMap: https://example.com/schema_map.xml