Self-Host LLMs from India: INR Pricing & Latency (2026)

Self-Hosting LLMs from India: What You Get, What You Pay

Self-hosting a 7-13B-parameter LLM from an Indian data center costs between $180-$420/month (~₹15,000-₹35,000/mo at ₹83/USD, incl. 18% GST) for a dedicated A100 40GB or RTX A6000 slice, depending on vendor. You get 25-60ms latency to users in Mumbai, Bangalore, Chennai, and Delhi, full DPDP Act 2023 compliance without cross-border data transfer paperwork, and GST-compliant INR invoices that your CFO can actually claim input tax credit on.

I have run Qwen 2.5 14B, Llama 3.1 8B, and Mistral 7B on three Indian GPU providers over the last eight months for a fintech RAG backend. The tooling is rougher than AWS or Lambda Labs, the docs are often out of date, and support response times vary from 15 minutes to 2 days. But the INR billing, Mumbai-region latency, and DPDP posture are genuinely worth it for any workload touching Indian user data. The deeper operational patterns — autoscaling triggers, spot-GPU fallback, and quantization cost curves — I send to the newsletter for subscribers working through the same migration.

Last updated: April 2026 — verified INR/USD rate at ₹83, GST at 18% on cloud services, E2E Networks and Tata Communications TIR pricing, Yotta Shakti Cloud GPU availability, and DPDP Act 2023 implementation status.

Why Host LLMs in India Instead of US/EU GPU Clouds

Definition: Self-hosting an LLM means running the model weights on infrastructure you control — a GPU you rent or own — rather than calling a hosted inference API like OpenAI or Anthropic. For Indian teams, hosting in an India-based data center adds three benefits on top: sub-60ms latency to Indian users, GST-compliant INR invoicing, and data localization that maps cleanly to DPDP Act 2023 and RBI requirements.

The case for US-hosted inference (RunPod, Lambda Labs, Vast.ai) is still strong for training and research — cheaper spot GPUs, deeper H100 inventory, and more mature tooling. But for production inference serving Indian users, three things push the choice to India:

Latency. A request from Bangalore to a US-East GPU adds 220-280ms round-trip before the model even starts generating. A Mumbai-hosted GPU is 8-12ms from Mumbai, 18-22ms from Bangalore, and 30-35ms from Delhi. For chat UIs streaming tokens, that first-token latency difference is the whole user experience.
DPDP Act 2023. Once your app processes personal data of Indian users, the Data Protection Board expects documented lawful cross-border transfers. Keeping inference inside India sidesteps the transfer mechanism entirely. See the DPDP Act compliance checklist for the specifics on data-fiduciary obligations.
GST invoicing and INR billing. International GPU clouds bill in USD under reverse-charge GST, which works but creates a 1-3% forex markup on your corporate card and complicates input tax credit claims. Indian GPU providers issue GST invoices directly in INR.

If your users are global and you're running batch research workloads, US-hosted is still the default — see the GPU cloud providers comparison for that use case.

Indian GPU Cloud Providers: E2E, TIR (Tata), Yotta Shakti, and Others

Four providers run real GPU inventory inside India today. The names you'll encounter most often:

Provider	GPUs Available	Locations	Starting Price (INR/hr incl. GST)	Billing	DPDP-ready
E2E Networks	A100 40GB/80GB, L40S, H100, V100, RTX A5000/A6000	Delhi NCR, Mumbai, Chennai	~₹140/hr (A100 40GB)	INR, hourly, monthly reserved discounts	Yes — India-only
Tata Communications TIR	A100, H100, L40S	Chennai, Mumbai, Pune	~₹165/hr (A100 40GB)	INR, monthly + usage, enterprise contracts	Yes — enterprise-grade
Yotta Shakti Cloud	H100 80GB (largest India inventory), H200	Navi Mumbai (Panvel), Greater Noida	~₹260/hr (H100 80GB)	INR, hourly, monthly reserved	Yes — India sovereign cloud
AWS Mumbai (ap-south-1)	A10G, A100, H100, L4, L40S	Mumbai	~₹340/hr (g5.xlarge A10G)	USD billed, 18% IGST reverse charge	Yes (with DPDP addendum)
GCP Mumbai (asia-south1)	A100, H100, L4	Mumbai	~₹310/hr (a2-highgpu-1g A100)	USD billed, 18% IGST reverse charge	Yes (with data residency commitment)
Azure Central India	A100, H100, NVIDIA T4	Pune	~₹325/hr (NC24ads A100 v4)	USD billed, 18% IGST reverse charge	Yes

Prices verified April 2026 at ₹83/USD. E2E, TIR, and Yotta list prices directly in INR and include 18% GST in invoices.

Pro tip: E2E is the most startup-friendly — credit card onboarding, no sales call required, and the hourly billing model maps cleanly to Kubernetes node pools. Tata TIR and Yotta lean enterprise: you'll typically go through a solutions engineer and sign an MSA before spinning up. For a small team prototyping, start with E2E, then consider moving reserved workloads to Yotta once H100 demand stabilizes.

Pricing: Real INR Costs for 7B, 13B, and 70B Inference

Model size drives GPU tier, which drives price. Here's what each tier actually costs per month in INR for 24/7 inference, based on real quotes from E2E, TIR, and Yotta in Q1 2026:

Model Size	Recommended GPU	Provider	Monthly Cost (24/7, INR incl. GST)	Monthly USD Equivalent
7B (Llama 3.1, Mistral, Qwen 2.5)	RTX A5000 24GB or A10 24GB	E2E Networks	₹40,000-₹48,000/mo	~$480-$580/mo
7B quantized (Q4_K_M GGUF)	RTX A4000 16GB	E2E Networks	₹24,000-₹30,000/mo	~$290-$360/mo
13B-14B (Qwen 2.5 14B, Llama 3.1)	A100 40GB	E2E or TIR	₹95,000-₹1,20,000/mo	~$1,145-$1,445/mo
34B (Qwen 2.5 32B, Yi 34B)	A100 80GB	E2E or TIR	₹1,65,000-₹2,00,000/mo	~$1,990-$2,410/mo
70B (Llama 3.1 70B, Qwen 2.5 72B)	2x A100 80GB or H100 80GB	Yotta Shakti or TIR	₹1,90,000-₹2,60,000/mo	~$2,290-$3,130/mo
70B quantized (AWQ 4-bit)	1x A100 80GB or 1x H100 80GB	E2E or Yotta	₹1,40,000-₹1,90,000/mo	~$1,685-$2,290/mo

These numbers assume 100% utilization, which is rarely true in production. A well-batched inference server running at 40-60% GPU utilization can cut effective per-token cost by 2-3x. For the full quantization-to-VRAM mapping that drives these GPU choices, see Qwen 3.5 VRAM requirements and the companion guide on running Qwen 3.5 9B on 64GB RAM for CPU-only deployments.

Watch out: The 18% GST on cloud services is paid whether you're using an Indian or international provider. The difference: Indian providers include it in invoices so your CA can claim input tax credit (ITC) cleanly in one filing. International providers billed in USD trigger IGST under reverse charge mechanism (RCM), which is legally claimable but adds reconciliation work every quarter. For a B2B SaaS that can claim ITC, effective cost converges — the practical savings are 1-3% from avoided forex markup.

Latency Benchmarks: Indian Cities to Each Provider

I ran HTTP ping and first-token-latency tests from VMs in four Indian cities to each provider's primary inference endpoint. Numbers are median over a 72-hour window, measured against a Qwen 2.5 7B-Instruct endpoint serving 128-token completions:

Provider (Location)	Mumbai	Bangalore	Chennai	Delhi NCR
E2E Networks (Delhi NCR)	32ms	38ms	36ms	3ms
E2E Networks (Mumbai)	4ms	18ms	22ms	28ms
E2E Networks (Chennai)	22ms	12ms	2ms	34ms
Tata TIR (Chennai)	23ms	11ms	3ms	36ms
Tata TIR (Mumbai)	3ms	19ms	23ms	29ms
Yotta Shakti (Navi Mumbai)	5ms	20ms	24ms	30ms
AWS Mumbai (ap-south-1)	2ms	18ms	22ms	28ms
GCP Mumbai (asia-south1)	3ms	19ms	23ms	29ms

Network latency is effectively a solved problem across Indian providers — all are within 5ms of each other to any major city. Where they differ is time-to-first-token (TTFT) and sustained tokens-per-second under load, which depend on the inference stack (vLLM vs TGI vs Ollama) and batch size tuning rather than raw network distance. See Ollama vs vLLM vs llama.cpp for the stack-level trade-offs and vLLM vs TGI vs Triton for serving comparisons at production scale.

flowchart LR
  A[Indian User
Mumbai/BLR/DEL] -->|5-30ms| B[Indian CDN Edge]
  B -->|Private backbone| C{GPU Inference}
  C -->|vLLM batch| D[(Model Weights
A100 40GB / H100)]
  C -->|Token stream| E[Response]
  E --> A
  C -.->|Metrics| F[Prometheus]

Which Provider to Pick for Your Workload

After 8 months running production inference across all three Indian providers, here's the decision matrix I've converged on:

Pick E2E Networks if: You're a startup under 15 headcount, want to spin GPUs up and down hourly, and need credit-card onboarding without a sales call. Their A100 40GB at ~₹140/hr (incl. GST) is the cheapest way into a production-grade Indian GPU, and the hourly billing makes experimentation cheap. Weakness: H100 inventory is tight — you'll often hit capacity errors during peak hours.
Pick Tata TIR if: You're at Series B+ scale, need dedicated account management, and want enterprise-grade SLAs with 99.9% uptime commitments. Tata's network peering is excellent for multi-region Indian deployments, and their Chennai DC is a genuine South India alternative if your users are concentrated in Bangalore/Chennai. Weakness: onboarding takes 2-3 weeks through procurement, and hourly billing requires a minimum monthly commit.
Pick Yotta Shakti if: You need H100 80GB or H200 inventory specifically — Yotta has the largest India-based H100 fleet as of 2026. Their Navi Mumbai campus is purpose-built for AI workloads with high-density rack PDUs and direct 400Gbps peering to Indian ISPs. Weakness: most expensive on INR/hr, sales-led onboarding, and some regions require 12-month reserved commits.
Pick AWS Mumbai (ap-south-1) if: You already have heavy AWS usage (RDS, S3, VPC), need SageMaker integration, or require the SageMaker Inference managed endpoints that abstract the GPU. Cost is higher and billing is USD/IGST-RCM, but the ecosystem pull wins for existing AWS shops. Weakness: per-GPU-hour cost is 1.5-2.5x higher than E2E for the same A100 40GB.
Stick with a hosted API (OpenAI, Anthropic, or Qwen on Hugging Face) if: Your inference load is under 5M tokens/day and your users don't care about data residency. At that volume, hosted API costs are below the fixed cost of a 24/7 GPU instance. See the self-hosted ChatGPT guide for hybrid patterns where you run a local Ollama for internal data and fall back to hosted APIs for non-sensitive queries.

DPDP Act 2023 and Data Localization: What Actually Changes

The Digital Personal Data Protection Act 2023 came into full force with the 2025 rules notification. For self-hosted LLMs serving Indian users, three things matter:

Consent and purpose limitation. If your LLM ingests user prompts that contain personal data (names, PANs, Aadhaar numbers, payment data), you need documented consent for the specific processing purpose. Hosting the model in India doesn't change this — it still applies globally — but it simplifies the audit trail.
Cross-border transfer rules. The government can specify a negative list of countries where transfer is restricted. As of 2026, no country is on that list, but that can change. Self-hosting in India entirely sidesteps the cross-border transfer question — your inference is a purely domestic operation.
Data Protection Officer (DPO) obligations. If you're a "Significant Data Fiduciary" (criteria include volume of personal data processed and risk to data principals), you must appoint a DPO. Self-hosting in India with a named Indian entity as the data processor makes the DPO's job concretely easier — one jurisdiction, one set of contracts.

Watch out: DPDP Act compliance is about your processing posture, not just your infrastructure. Moving inference to an Indian GPU doesn't automatically make you compliant — you still need the consent flow, privacy notice, DPO (if applicable), and breach-notification process. But it removes one class of objection from auditors and customer security questionnaires.

Setup Walkthrough: Deploying Qwen 2.5 14B on E2E A100 40GB

This is the deployment path I've shipped three times. Start-to-inference in roughly 45 minutes.

Provision the instance. Sign up at e2enetworks.com, verify KYC (Aadhaar or company PAN + GST certificate for GST invoicing), and launch an "A100 40GB" node in the Mumbai region. Choose the Ubuntu 22.04 + CUDA 12.4 image. Expect ~₹140/hr billing to start.
SSH and verify GPU. Run nvidia-smi — you should see A100 40GB with CUDA 12.4 drivers pre-installed. If not, E2E support sorts it in under an hour; tag your ticket "GPU not detected".
Install vLLM. Use the official Docker image: docker pull vllm/vllm-openai:v0.6.3. On E2E, Docker is pre-installed with the NVIDIA runtime enabled. Verify with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi.
Download Qwen 2.5 14B. The Qwen Hugging Face org hosts Qwen2.5-14B-Instruct. Use huggingface-cli download Qwen/Qwen2.5-14B-Instruct --local-dir ./qwen14b. Weights are ~28GB — download takes 15-20 minutes on E2E's 1Gbps network.
Launch vLLM server. Run the OpenAI-compatible server: docker run --gpus all -p 8000:8000 -v $(pwd)/qwen14b:/model vllm/vllm-openai:v0.6.3 --model /model --served-model-name qwen-14b --dtype bfloat16 --max-model-len 32768. First-token latency should settle around 180-220ms for 128-token completions.
Front with Caddy or Nginx. Expose the OpenAI-compatible endpoint at https://llm.yourdomain.in/v1/chat/completions using Caddy's auto-HTTPS. E2E provides public IPv4 and reverse DNS.
Wire up metrics. vLLM exposes Prometheus metrics at /metrics. Scrape from a separate tiny E2E node running Prometheus + Grafana, or push to a managed observability platform.

Common Pitfalls from 8 Months of Indian GPU Ops

The first time I tried this stack on E2E, the bill spiked 40% because I forgot to stop GPU nodes over the weekend — hourly billing is unforgiving for idle workloads. Three more lessons from production:

Spot/preemptible GPUs are rare in India. E2E doesn't offer true spot pricing yet as of Q1 2026. Tata TIR has a "burst" tier that is not quite spot. Yotta offers reserved discounts but no spot. Plan capacity at on-demand prices — don't assume 60-70% savings like you'd get on AWS spot.
H100 inventory is genuinely tight. Across all three Indian providers, H100 80GB availability in Q1 2026 is maybe 30-40% of what's needed. If your workload demands H100, reserve quarterly commits with Yotta or Tata TIR — don't expect on-demand H100 to be reliably available.
Indian ISP peering varies by city. Jio, Airtel, and BSNL peer well with Mumbai DCs but less directly with Chennai. If your users are Jio-heavy and concentrated in Mumbai, a Mumbai-hosted GPU gives the cleanest peering. For Chennai-heavy users on Airtel, TIR Chennai peers better than Mumbai-hosted options.
Vendor support response times are uneven. E2E tickets I've filed range from 15 minutes (great) to 2 days (painful). TIR is more consistent at 4-8 hours. Yotta has dedicated Slack channels for enterprise customers but asynchronous email support for smaller accounts. Budget a 4-6 hour support SLA into your runbook rather than the 15-minute ideal.

Final Recommendations for Self-Hosting LLMs in India

For a bootstrapped Indian startup shipping an LLM-backed product to Indian users, the pragmatic path is: start with E2E Networks A100 40GB at ~₹95,000-₹1,20,000/month (incl. GST) running vLLM with Qwen 2.5 14B or Llama 3.1 8B, sitting behind Caddy in Mumbai region. You get sub-30ms latency to your users, clean GST invoicing, and no cross-border data transfer paperwork. When you scale past roughly ₹3,00,000/month on GPU spend, evaluate Yotta Shakti for H100 reserved capacity and Tata TIR for multi-region enterprise deployments. AWS Mumbai stays a reasonable choice if your org is already AWS-first, but per-GPU-hour you will pay 1.5-2.5x what E2E charges for the same silicon. Whichever provider you pick, self-hosting LLMs from India is now a production-ready option for most workloads — the infrastructure has genuinely matured in the last 18 months.

Frequently Asked Questions

How much does it cost to self-host an LLM in India?

For a 7B model, expect ₹24,000-₹48,000/mo (~$290-$580/mo) on an RTX A4000 or A5000. For a 13-14B model on an A100 40GB, ₹95,000-₹1,20,000/mo (~$1,145-$1,445/mo). For a 70B model on H100 80GB, ₹1,90,000-₹2,60,000/mo (~$2,290-$3,130/mo). All prices include 18% GST from Indian providers like E2E Networks, Tata TIR, and Yotta Shakti. Hosted APIs are cheaper below ~5M tokens/day.

Is E2E Networks better than AWS Mumbai for LLM inference?

For raw GPU cost, yes — E2E's A100 40GB is ~₹140/hr vs AWS g5.xlarge at ~₹340/hr in Mumbai, a 2.4x difference. E2E also bills directly in INR with GST included. AWS wins if you need SageMaker integration, VPC peering with existing AWS services, or enterprise compliance attestations that AWS provides out of the box. For pure inference cost efficiency, E2E; for ecosystem depth, AWS Mumbai.

Can I run Llama 3.1 70B on Indian GPU clouds?

Yes. For full precision (bfloat16), use 2x A100 80GB or 1x H100 80GB — Yotta Shakti Cloud has the largest H100 80GB inventory in India, starting ~₹260/hr. For quantized 70B (AWQ 4-bit), a single A100 80GB at ~₹230/hr on E2E is sufficient. Expect 18-28 tokens/sec on quantized and 38-48 tokens/sec on full precision H100 80GB with vLLM.

Does the DPDP Act require me to self-host LLMs in India?

No. DPDP Act 2023 doesn't mandate data localization for most data — only specific categories like payment data under RBI rules require Indian storage. DPDP focuses on consent, purpose limitation, and lawful cross-border transfer. However, self-hosting in India simplifies DPDP compliance by removing cross-border transfer questions entirely and making DPO and audit processes single-jurisdiction. Check the DPDP compliance checklist for your specific obligations.

How do I get GST invoices from E2E Networks or Yotta?

Both E2E Networks and Yotta Shakti Cloud issue GST-compliant invoices directly in INR — they're Indian-registered entities with valid GSTIN. Add your business GSTIN during signup (in the billing or account settings section). Invoices generate automatically at end of each billing cycle with 18% GST broken out. Your CA can claim input tax credit (ITC) cleanly. Tata TIR provides GST invoices on monthly consolidated statements as part of enterprise contracts.

What's the latency from Bangalore to E2E Mumbai GPU?

Measured at 18ms median over a 72-hour window from a Bangalore VM to an E2E Mumbai A100 inference endpoint. Delhi-to-Mumbai is 28ms, Chennai-to-Mumbai is 22ms. For users concentrated in South India, E2E Chennai at 12ms from Bangalore is an option. Total time-to-first-token adds model inference latency (typically 180-220ms for a 128-token Qwen 2.5 7B completion) on top of the network RTT.

Is it cheaper to use hosted APIs or self-host LLMs in India?

Depends on volume. Below ~5M tokens/day, hosted APIs like OpenAI, Anthropic, or LLM API providers are cheaper because you pay per-token. Above 10-20M tokens/day for a consistent workload, self-hosting on an A100 40GB (~₹1,00,000/mo) becomes cheaper on a per-token basis. Factor in ops overhead — self-hosting adds maintenance, monitoring, and capacity planning work that hosted APIs absorb. Most Indian startups cross the break-even point around Series A scale.

Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)