📚 API Documentation

Schema.org Crawler REST API Reference

🚀 Quick Start

Step 1: Get Your API Key

  1. Go to /login and sign in with GitHub
  2. After login, go to Dashboard
  3. Click on your profile/user menu
  4. Copy your API key (auto-generated on first login)

Step 2: Make Your First API Call

curl https://testing.nlweb.ai/api/status \
  -H "X-API-Key: YOUR_API_KEY_HERE"
✓ That's it! You're ready to use all API endpoints.

📖 Table of Contents

🔐 Authentication

All API endpoints (except login/logout) require authentication. You can authenticate in two ways:

Method 1: API Key (Recommended for Programmatic Access)

Include your API key in the X-API-Key header:

curl https://testing.nlweb.ai/api/sites \
  -H "X-API-Key: YOUR_API_KEY_HERE"

Method 2: OAuth Session (For Web UI)

Login via GitHub OAuth at /login. The session cookie will authenticate your requests.

Getting Your API Key

  1. Login via OAuth: Go to /login and sign in with GitHub
  2. Retrieve your key: Make a request to GET /api/me (works with session cookie)
  3. Use the key: Copy the api_key from the response
# Get your API key (after OAuth login)
curl https://testing.nlweb.ai/api/me \
  --cookie "session=YOUR_SESSION_COOKIE"

# Response:
{
  "user_id": "github:12345",
  "email": "user@example.com",
  "name": "Your Name",
  "provider": "github",
  "api_key": "Abc123XyZ_YourApiKeyHere...",
  "created_at": "2024-01-15T10:30:00",
  "last_login": "2024-01-20T14:22:00"
}
Note: API keys are auto-generated when you first login with OAuth. They never expire and are unique per user.

🌐 Sites API

GET /api/sites

List all sites you're monitoring.

Request

curl https://testing.nlweb.ai/api/sites \
  -H "X-API-Key: YOUR_API_KEY"

Response

[
  {
    "site_url": "https://example.com",
    "process_interval_hours": 24,
    "last_processed": "2024-01-20T10:30:00",
    "is_active": true,
    "created_at": "2024-01-15T08:00:00"
  }
]
POST /api/sites

Add a new site to monitor. The crawler will automatically discover schema maps from robots.txt.

Request Body

Parameter Type Required Description
site_url string Required Full URL of the site (e.g., https://example.com)
interval_hours integer Optional How often to reprocess (default: 24 hours)

Example

curl -X POST https://testing.nlweb.ai/api/sites \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "site_url": "https://example.com",
    "interval_hours": 24
  }'

Response

{
  "success": true,
  "site_url": "https://example.com"
}
What happens next: The crawler will fetch https://example.com/robots.txt, look for schemaMap: directives, download the schema maps, and queue all JSON-LD files for processing.
DELETE /api/sites/{url}

Remove a site from monitoring. This deletes all files, IDs, and vector DB entries.

URL Parameters

Parameter Description
url URL-encoded site URL (e.g., https%3A%2F%2Fexample.com)

Example

curl -X DELETE "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com" \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "success": true,
  "schema_maps_removed": 2,
  "files_queued_for_removal": 150
}
POST /api/process/{url}

Manually trigger processing for a site (bypasses scheduled interval).

Example

curl -X POST "https://testing.nlweb.ai/api/process/https%3A%2F%2Fexample.com" \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "success": true,
  "message": "Processing started for https://example.com"
}

📄 Schema Files API

POST /api/sites/{url}/schema-files

Manually add a specific schema map to a site (useful if not in robots.txt).

Request Body

Parameter Type Required Description
schema_map_url string Required Full URL to the schema_map.xml file

Example

curl -X POST "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com/schema-files" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "schema_map_url": "https://example.com/custom_schema_map.xml"
  }'

Response

{
  "success": true,
  "site_url": "https://example.com",
  "schema_map_url": "https://example.com/custom_schema_map.xml",
  "files_added": 50,
  "files_queued": 50
}
DELETE /api/sites/{url}/schema-files

Remove a specific schema map and all its files from a site.

Request Body

Parameter Type Required Description
schema_map_url string Required URL of the schema map to remove

Example

curl -X DELETE "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com/schema-files" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "schema_map_url": "https://example.com/custom_schema_map.xml"
  }'

Response

{
  "success": true,
  "deleted_count": 50,
  "files_queued_for_removal": 50
}
GET /api/sites/{url}/files

Get all files for a specific site.

Example

curl "https://testing.nlweb.ai/api/sites/https%3A%2F%2Fexample.com/files" \
  -H "X-API-Key: YOUR_API_KEY"

Response

[
  {
    "file_url": "https://example.com/schema/products/1.json",
    "schema_map": "https://example.com/schema_map.xml",
    "last_read_time": "2024-01-20T10:30:00",
    "number_of_items": 25,
    "is_manual": false,
    "is_active": true
  }
]

📁 Files API

GET /api/files

Get all files across all your sites.

Example

curl https://testing.nlweb.ai/api/files \
  -H "X-API-Key: YOUR_API_KEY"

Response

[
  {
    "site_url": "https://example.com",
    "file_url": "https://example.com/schema/products/1.json",
    "schema_map": "https://example.com/schema_map.xml",
    "is_active": true,
    "is_manual": false,
    "number_of_items": 25,
    "last_read_time": "2024-01-20T10:30:00",
    "id_count": 25
  }
]
GET /api/files/{url}/ids

Get all @id values from a specific JSON-LD file.

Example

curl "https://testing.nlweb.ai/api/files/https%3A%2F%2Fexample.com%2Fschema%2Fproducts%2F1.json/ids" \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "file_url": "https://example.com/schema/products/1.json",
  "ids": [
    "https://example.com/product/123",
    "https://example.com/product/124",
    "https://example.com/offer/456"
  ],
  "count": 3
}

📊 Monitoring API

GET /api/status

Get overall system status and statistics.

Example

curl https://testing.nlweb.ai/api/status \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "master_started_at": "2024-01-20T08:00:00",
  "master_uptime_seconds": 43200,
  "sites": [
    {
      "site_url": "https://example.com",
      "is_active": true,
      "last_processed": "2024-01-20T10:30:00",
      "total_files": 150,
      "manual_files": 0,
      "total_ids": 3750
    }
  ]
}
GET /api/queue/status

Get queue processing status (pending, processing, failed jobs).

Example

curl https://testing.nlweb.ai/api/queue/status \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "queue_type": "servicebus",
  "pending_jobs": 42,
  "processing_jobs": 3,
  "failed_jobs": 0,
  "total_jobs": 45,
  "jobs": [
    {
      "id": "job-20240120-103045-123456",
      "status": "processing",
      "type": "process_file",
      "site": "https://example.com",
      "file_url": "https://example.com/schema/1.json",
      "queued_at": "2024-01-20T10:30:45",
      "processing_time": 15
    }
  ],
  "error": null
}
GET /api/workers

Get status of all worker pods (Kubernetes only).

Example

curl https://testing.nlweb.ai/api/workers \
  -H "X-API-Key: YOUR_API_KEY"

Response

[
  {
    "name": "crawler-worker-5d7c8f9-abc12",
    "ip": "10.244.1.5",
    "phase": "Running",
    "status": {
      "worker_id": "crawler-worker-5d7c8f9-abc12",
      "started_at": "2024-01-20T08:00:00",
      "current_job": {
        "type": "process_file",
        "file_url": "https://example.com/schema/1.json"
      },
      "total_jobs_processed": 1250,
      "total_jobs_failed": 3,
      "last_job_at": "2024-01-20T10:30:00",
      "last_job_status": "success",
      "status": "processing"
    },
    "error": null
  }
]

👤 User API

GET /api/me

Get your user information including your API key.

Example

curl https://testing.nlweb.ai/api/me \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "user_id": "github:12345",
  "email": "user@example.com",
  "name": "Your Name",
  "provider": "github",
  "api_key": "Abc123XyZ_YourApiKeyHere...",
  "created_at": "2024-01-15T10:30:00",
  "last_login": "2024-01-20T14:22:00"
}
Important: This is the only endpoint where you can retrieve your API key. Keep it secure!

❌ Error Responses

Authentication Error (401)

{
  "error": "Authentication required. Provide X-API-Key header or login via OAuth."
}

Not Found (404)

{
  "error": "User not found"
}

Bad Request (400)

{
  "error": "site_url is required"
}

Server Error (500)

{
  "error": "Internal server error message"
}

⚡ Rate Limits

Currently there are no rate limits, but we recommend:

📋 Schema Map XML Format

Schema maps should follow this XML format:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url contentType="structuredData/schema.org">
    <loc>https://example.com/schema/products/1.json</loc>
  </url>
  <url contentType="structuredData/schema.org">
    <loc>https://example.com/schema/products/2.json</loc>
  </url>
</urlset>

robots.txt Discovery:

Add to your robots.txt:

schemaMap: https://example.com/schema_map.xml