Skip to content

Self-Hosting: Run Your Own LLM with DiscoGen

Run your own LLM for DiscoGen on your hardware. This guide sets up a proof-of-concept inference stack on a single GPU using Docker: Ollama for inference, LiteLLM as an API proxy, and Cloudflared for tunnel access.

Cloud LLM (OpenAI, Anthropic)Self-hosted
CostPay per token. Scales linearly with volume.Fixed GPU cost. Process unlimited tokens for a flat hourly rate.
Data privacyPrompts and domain data sent to third-party APIsEverything stays on your network
LatencyDepends on provider load and rate limitsDedicated GPU, no queue, no throttling
Model controlLimited to provider’s model catalogRun any open-source model, swap anytime
  • GPU: NVIDIA with 96GB+ VRAM recommended (16GB minimum for smaller models). Don’t have one? Rent on-demand from Lambda (used in this guide), RunPod, Vast.ai, or CoreWeave. Most providers come with NVIDIA drivers pre-installed.
  • Docker with nvidia-container-toolkit. The setup script installs both if missing.

The setup script detects your VRAM and offers these defaults:

VRAMModel
80GB+Llama 4 ScoutMoE, 17B active params. Best single-GPU option for structured output
40-80GBLlama 3.1 70BDense 70B. Reliable JSON generation, well-tested
20-40GBQwen3 32BComparable to 72B models at half the VRAM
8-20GBLlama 3.1 8BLightweight. Good for testing the pipeline

Or select Custom model to enter any model name from ollama.com/library.

Terminal window
mkdir discolike-self-hosting && cd discolike-self-hosting
curl -LO https://api.discolike.com/v1/docs/self-hosting/docker-compose.yml
curl -LO https://api.discolike.com/v1/docs/self-hosting/setup.sh
chmod +x setup.sh && ./setup.sh

The script detects your GPU and lets you pick a model that fits:

Setup script detecting NVIDIA GH200 GPU and showing model selection

It then generates API keys, starts the Docker stack, pulls the model (this can take a while depending on your connection), and verifies inference end-to-end:

Inference verification, test prompt returns successfully

When complete, you get your tunnel URL, model name, and API key:

Final screen with tunnel URL, model name, and API key

  1. Go to Settings → Integrations → AI Providers → Add

  2. Fill in the form:

    FieldValue
    ProviderBring Your Own Model
    Integration Namee.g. llama4_scout
    Base URLYour tunnel URL, without the /v1 suffix
    Model Nameopenai/llama4:scout
    API KeyThe LiteLLM key from setup
    Add AI Integration form with Bring Your Own Model provider, tunnel URL, and model name
  3. Click Save. DiscoLike sends a test prompt to validate the connection.

Serper gives DiscoGen access to live web results alongside your domain data. Useful for enriching companies with recent news, funding rounds, or hiring signals.

  1. Grab an API key from serper.dev. A paid plan is recommended as the free tier has strict rate limits that will slow down DiscoGen batch processing.

  2. Go to Settings → Integrations → Search Providers → Add

    FieldValue
    ProviderSerper
    Integration Namee.g. serper
    API KeyYour Serper key
    Search Modelserper/search (auto-populated)
    Add Search Integration form with Serper provider and API key
  3. Click Save. DiscoLike validates with a test search.

Open DiscoGen and select your self-hosted model from the Model dropdown. If you added Serper, toggle Enable web search, pick a Search Depth, and select your provider under Search Source.

DiscoGen showing self-hosted model selected with Serper web search enabled

Hit Submit. DiscoGen sends each domain through your model and streams results into the table.

DiscoGen results table showing enriched company data from self-hosted Llama 4 Scout

For a persistent subdomain that survives restarts, replace the quick tunnel with a named one:

Terminal window
cloudflared tunnel login
cloudflared tunnel create discolike-llm
cloudflared tunnel route dns discolike-llm llm.yourdomain.com
cloudflared tunnel run --url http://localhost:4000 discolike-llm

Then update your Base URL in the DiscoLike integration to https://llm.yourdomain.com.

Terminal window
docker compose up -d # Start everything
docker compose down # Stop everything
docker compose logs -f # Follow logs
docker compose logs cloudflared # Get current tunnel URL
docker exec ollama ollama list # List installed models
docker exec ollama ollama pull qwen3:32b # Pull another model

Switching models? Edit litellm-config.yaml, run docker compose restart litellm, and update the Model Name in DiscoLike (with the openai/ prefix).

SymptomFix
Validation failsModel name must start with openai/. Base URL must not end with /v1
Model won’t loadnvidia-smi. If VRAM is full, re-run ./setup.sh and pick a smaller model
LiteLLM returns 500docker compose logs litellm --tail 50. Usually Ollama is still loading
Tunnel unreachabledocker compose logs cloudflared. If the URL changed, update DiscoLike

Quick validation test from your terminal:

Terminal window
curl https://your-tunnel-url/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "llama4:scout", "messages": [{"role": "user", "content": "hi"}], "max_tokens": 5}'

If this returns a response, the stack is working. Check your DiscoLike configuration if validation still fails.