Error "api.thegraph.com | 502: Bad gateway" while deploying subgraphs

Incident Report for The Graph

Update

We are continuing to monitor the changes made for the recent IPFS deployment issues, and the system is now much more reliable. Users have reported successful deployments with fewer retries (2-3 compared to 10-20 previously), and there have been no widespread complaints in the last few hours.

Our SRE team has implemented the following recent enhancements:

- Restarted IPFS with optimized connection settings.
- Modified IPFS endpoints to better manage traffic.
- Created new dashboards to monitor errors and connection timeouts in real-time.
- Reviewed and tweaked rules to ensure community node traffic is handled efficiently.

These changes have led to a noticeable improvement in deployment success rates. However, some users may still experience occasional connection timeouts, which we are actively addressing. We’re continuing to monitor the system closely and will make additional adjustments as needed. If you encounter any issues, please let us know.

Thank you for your patience and support!
Posted May 14, 2025 - 18:23 UTC

Monitoring

We’ve made substantial progress with the recent IPFS deployment issues, and the system is now demonstrating significantly improved reliability.

Our Site Reliability Engineering team has implemented several key enhancements, including:
- Applied targeted rules to block suspicious traffic and reduce system load.
- Upgraded IPFS Kubo on both testnet and mainnet to include critical stability improvements.
- Adjusted nginx connection limits to eliminate "Cannot assign requested address" errors, improving proxy stability.
- Resolved a misconfigured nginx caching rule that was returning incorrect IPFS hashes for different files.

These improvements have resulted in more consistent and successful IPFS deployments. We continue to actively monitor system performance and are working on further optimizations to maintain long-term stability.

Thank you for your continued patience and support.
Posted May 13, 2025 - 22:42 UTC

Update

We've upgraded our IPFS to address deployment issues caused by memory limits being exceeded. This update includes fixes for resource leaks that were contributing to the problem. We've also blocked several suspicious IP addresses that may have been overloading the system. While IPFS stability has improved, the root issue is not yet fully resolved. We appreciate your continued patience as we work toward a complete fix.
Posted May 12, 2025 - 19:26 UTC

Identified

We're identified an issue in which our internal IPFS proxy's in-memory cache and aggressive retry logic are causing elevated load and intermittent timeouts. Our engineering team is working to implement improved exponential back-off in our fetch workflows and is evaluating more durable / decoupled caching solutions to ensure continued stability. We appreciate your continued patience as we work to resolve this.
Posted May 07, 2025 - 14:29 UTC

Investigating

We are currently investigating the issue. It should work on multiple retries.
Posted May 06, 2025 - 13:20 UTC
This incident affects: Upgrade Indexer - Miscellaneous (IPFS).