## Issue Summary When models are deleted from a SQLMesh project, the `OptimizedQueryCache` retains references to cache files for those deleted models. On subsequent `Context` initialization, SQLMesh attempts to `stat()` these stale cache files, resulting in a `FileNotFoundError` that prevents the Context from being created. ## Environment Details - **SQLMesh Version**: 0.227.1 - **Python Version**: 3.12 - **Deployment**: Airflow on AWS ECS with persistent volumes - **Cache Location**: `/tmp/airflow/sqlmesh_cache/my_project/.cache` - **State Backend**: PostgreSQL (shared between multiple projects) ## Error Details ```python Traceback (most recent call last): File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/context.py", line 468, in __init__ self.load() File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/context.py", line 696, in load update_model_schemas( File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/model/schema.py", line 28, in update_model_schemas optimized_query_cache: OptimizedQueryCache = OptimizedQueryCache(cache_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/model/cache.py", line 97, in __init__ self._file_cache: FileCache[OptimizedQueryCacheEntry] = FileCache( ^^^^^^^^^^ File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/utils/cache.py", line 66, in __init__ if not file.stem.startswith(self._cache_version) or file.stat().st_atime < threshold: ^^^^^^^^^^^ File "/usr/local/lib/python3.12/pathlib.py", line 840, in stat return os.stat(self, follow_symlinks=follow_symlinks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/sqlmesh/my_project/.cache/optimized_query/0_227_27_28_100__my_project_my_model_name_2986343564' ``` ## Root Cause Analysis ### The Problem 1. **Model Deletion Workflow**: - When a model is removed from the SQLMesh project, the snapshot is removed from the state database - The physical table is dropped - **However**, the corresponding `optimized_query` cache files are NOT automatically cleaned up 2. **Cache Initialization**: - During `Context.__init__()`, the `OptimizedQueryCache` is initialized - `FileCache.__init__()` scans the cache directory and calls `file.stat()` on every file it finds - This happens BEFORE any model selection or filtering logic runs 3. **Container Lifecycle**: - In environments with persistent volumes (non-ephemeral containers), the cache accumulates over time - Unlike environments using `/tmp` that gets cleared on container restart, persistent cache directories retain stale entries indefinitely ### Why Existing Solutions Don't Work 1. **`sqlmesh clean`**: Requires creating a `Context` to run, which fails with the same `FileNotFoundError` 2. **`no_auto_upstream=True`**: This flag only affects model selection during execution, not cache initialization 3. **Janitor process**: Handles database state cleanup (interval compaction), not filesystem cache ## Reproduction Steps 1. Create a SQLMesh project with a model (e.g., `my_project.my_model_name`) 2. Run the model at least once to generate cache entries 3. Remove the model from the project 4. In an environment with persistent cache (`cache_dir` configured to non-tmp location): - Try to create a new `Context`: `Context(paths=project_path)` 5. Observe `FileNotFoundError` on stale cache file ## Impact ### Affected Scenarios - **CI/CD Pipelines**: Failed deployments when models are removed - **Development**: Developers unable to run SQLMesh locally after pulling changes that remove models - **Production**: Airflow DAGs failing when models are deprecated - **Shared State Databases**: When multiple projects share a state DB but have separate cache directories, cross-project model references can cause issues ### Severity **High** - Prevents Context initialization entirely, blocking all SQLMesh operations until manual intervention. ## Current Workaround Manual deletion of the `optimized_query` cache directory before Context initialization: ```python import shutil from pathlib import Path # Clear stale cache files to prevent FileNotFoundError during Context initialization. # This happens when models are removed but cache references still exist. # Manual deletion is required because `sqlmesh clean` itself needs Context initialization. cache_path = Path(SQLMESH_CACHE_DIR) if cache_path.exists(): optimized_query_cache = cache_path / "optimized_query" if optimized_query_cache.exists(): shutil.rmtree(optimized_query_cache) # Now safe to create Context context = Context(paths=SQLMESH_PROJECT_PATH) ``` ### Limitations of Workaround - Requires code changes in every place Context is created - Loses all cached query optimizations on every run - Not discoverable - developers hit the error before finding the solution ## Suggested Fixes ### Option 1: Graceful Degradation (Recommended) Modify `FileCache.__init__()` to handle missing files gracefully: ```python # In sqlmesh/utils/cache.py for file in cache_dir.glob(f"{self._cache_version}*"): try: stat_result = file.stat() if stat_result.st_atime < threshold: file.unlink(missing_ok=True) except FileNotFoundError: # File was deleted between glob and stat - this is expected for stale entries continue ``` **Pros**: - No breaking changes - Handles race conditions naturally - Self-healing behavior **Cons**: - Leaves stale files until next access ### Option 2: Automatic Cleanup on Model Deletion Hook into the model deletion workflow to clean up associated cache files: ```python # When a snapshot is invalidated/removed def _cleanup_model_cache(self, model_name: str) -> None: """Remove cache entries for a deleted model.""" if self.config.cache_dir: cache_pattern = f"*__{model_name.replace('.', '_')}*" for cache_file in self.config.cache_dir.glob(cache_pattern): cache_file.unlink(missing_ok=True) ``` **Pros**: - Proactive cleanup - No stale files accumulate **Cons**: - More complex implementation - Need to track model name → cache file mapping ### Option 3: Validate Cache on Context Load Add a validation step during Context initialization that removes invalid cache entries: ```python def validate_and_clean_cache(cache_dir: Path, valid_models: set[str]) -> None: """Remove cache entries for models that no longer exist.""" for cache_file in cache_dir.glob("*"): model_name = extract_model_name(cache_file.name) if model_name not in valid_models: cache_file.unlink(missing_ok=True) ``` **Pros**: - Ensures cache consistency - Opportunity for other validation checks **Cons**: - Adds overhead to Context initialization - Requires model discovery before cache is usable ## Related Issues - Issue #5187: Custom materialization type cache clearing workaround - Shared state database scenarios where multiple projects can create cross-project cache contamination (see Additional Context section below) ## Additional Context ### Multi-Project Shared State Database When multiple SQLMesh projects share a PostgreSQL state database but maintain separate cache directories, the following can occur: 1. Project A and Project B share state DB 2. During `plan()`, the virtual layer update promotes ALL snapshots in the environment (including other project's models) 3. If Project A's cache directory is under the project path (e.g., `/opt/airflow/sqlmesh/project_a/.cache`), it can accumulate references to Project B's models 4. When Project B removes a model, Project A's cache still has references to it This issue can be mitigated with isolated `/tmp` cache directories per project: ```python SQLMESH_CACHE_DIR = f"/tmp/airflow/sqlmesh_cache/{project_name}/.cache" ``` However, the core issue remains: cache entries are not tied to model lifecycle. ## References - **Error Location**: `sqlmesh/utils/cache.py:66` in `FileCache.__init__()` - **Related Issue**: #5187 (Custom materialization type cache clearing workaround)