Metrics Update Process

Overview#

Drill4J uses an ETL (Extract, Transform, Load) process to transform raw data collected by agents into actionable metrics. The system maintains two separate database schemas:

raw_data - Stores data sent by agents in its original format without any processing.
metrics - Contains processed data that powers the dashboards and API responses. This schema is populated and maintained by the ETL pipeline.

The ETL process runs automatically on a schedule and can also be triggered manually. It reads from raw_data, performs necessary transformations and calculations, and updates the metrics schema. This architecture allows for data reprocessing if needed and separates data collection from data analysis concerns.

Scheduled Run#

ETL process runs automatically using a Cron job. The schedule is controlled by the DRILL_SCHEDULER_ETL_JOB_CRON environment variable passed to the Drill4J Backend.

Note: Applying changes to the environment variable requires restarting the Drill4J Backend instance for the new schedule to take effect.

Best Practice: Adjust the ETL schedule frequency so that job execution finishes before the next run time. Although the job won't start a new run until the previous one finishes, we recommend leaving extra buffer time.

On-Demand Run#

You can manually trigger the ETL process without waiting for the scheduled execution. The ETL process supports two modes of operation:

Incremental Updates#

By default, the ETL process performs incremental updates, processing only new data from raw_data that hasn't been transformed yet. This is the most efficient approach for day-to-day operations.

When to use:

Regular scheduled updates
Quick catch-up after a brief period
Before querying impacted tests/methods API endpoints

API Request:

curl -X 'POST' \
  'http://localhost:8090/api/metrics/refresh?reset=false' \
  -H 'accept: application/json' \
  -H 'X-API-KEY: [YOUR_API_KEY]' \
  -d ''

Complete Restart#

A complete restart clears all data from the metrics schema and reprocesses everything from scratch based on available data in raw_data.

When to use:

When the settings affecting the metrics changed (metrics period expanded, rules for ignoring methods and classes changed)
When retrospective changes have occurred
When data integrity issues are suspected
After schema migrations or updates

API Request:

curl -X 'POST' \
  'http://localhost:8090/api/metrics/refresh?reset=true' \
  -H 'accept: application/json' \
  -H 'X-API-KEY: [YOUR_API_KEY]' \
  -d ''

Warning: This operation is resource-intensive and can take a long time to complete (from hours to days, depending on the amount of data).

Best Practice: Schedule complete restarts during maintenance windows when metrics access is not critical.

Data Retention and Cleanup#

Drill4J automatically manages data retention for both raw_data and metrics schemas using dedicated cleanup jobs. This prevents unlimited data growth and maintains optimal database performance.

Configuring Retention Periods#

Each agent group has its own retention settings that control how long data is kept in each schema.

Viewing Current Settings:

curl -X 'GET' \
  'http://localhost:8090/api/group-settings/[GROUP_ID]' \
  -H 'accept: application/json' \
  -H 'X-API-KEY: [YOUR_API_KEY]'

Example Response:

{
  "retentionPeriodDays": 30,  // cleanup window for raw_data schema
  "metricsPeriodDays": 30      // cleanup window for metrics schema
}

Updating Retention Settings:

Warning: Please note the body must contain all parameters with appropriate values - payload is not merged - it overwrites all settings.

curl -X 'PUT' \
  'http://localhost:8090/api/group-settings/[GROUP_ID]' \
  -H 'accept: application/json' \
  -H 'X-API-KEY: [YOUR_API_KEY]' \
  -H 'Content-Type: application/json' \
  -d '{
    "retentionPeriodDays": 45,
    "metricsPeriodDays": 45
  }'

Configuring Cleanup Schedule#

The cleanup jobs run on a schedule controlled by the DRILL_SCHEDULER_DATA_RETENTION_JOB_CRON environment variable.

Example Configuration:

# Runs cleanup jobs once a day at midnight
DRILL_SCHEDULER_DATA_RETENTION_JOB_CRON=0 0 0 * * ?

By default, the cleanup job is set to run daily at 01:00.

Note: Changes to the environment variable require restarting the Drill4J Backend instance.

Best Practice: The retentionPeriodDays for raw_data should be greater than or equal to the metricsPeriodDays. This ensures that all data required for a complete ETL reprocessing is available, allowing safe full metric recalculation if needed.

Fine-Tuning Performance#

The ETL pipeline can be tuned for optimal performance based on your infrastructure and data volume. These parameters control memory usage, database interaction, and throughput.

Extraction Limit#

Environment Variable: DRILL_ETL_EXTRACTION_LIMIT
Purpose: Controls page size for extraction queries
Behavior: Adds a LIMIT to the SQL extraction query used for each page. The extractor will keep requesting the next pages until there is no more data to extract
Impact: Query latency and memory/CPU load per extraction request
Default: 1000000
Tuning Guidance:
- Decrease the limit if single extraction queries are slow
- Increase the limit if ETL is spending too much time paging and the database can handle larger result sets

Fetch Size#

Environment Variable: DRILL_ETL_FETCH_SIZE
Purpose: JDBC fetch size hint for SQL queries used by the data extractor
Behavior: Determines how many rows are fetched from the database per round trip
Impact: Network latency and database load
Default: 2000
Tuning Guidance:
- Increase for better throughput on fast networks (5000-10000)
- Decrease for slower networks or smaller result sets (500-1000)

Buffer Size#

Environment Variable: DRILL_ETL_BUFFER_SIZE
Purpose: Size of the in-memory buffer between data extractor and loaders
Behavior: Prevents unbounded memory growth. When the buffer is full, the extractor suspends, giving loaders time to process
Impact: Affects throughput and memory usage
Default: 2000
Tuning Guidance:
- Increase for faster processing if memory allows (5000-20000)
- Decrease if experiencing memory pressure (500-1000)

Transformation Buffer Size#

Environment Variable: DRILL_ETL_TRANSFORMATION_BUFFER_SIZE
Purpose: Controls how many aggregated rows the transformer accumulates in memory to pass aggregated results to loaders.
Behavior: The transformer groups and aggregates rows until this threshold is reached, then emits aggregated items downstream.
Impact:
- Larger values can improve throughput when aggregation significantly reduces cardinality, because loaders write fewer items
- Too-large values may increase heap usage and GC overhead and can lead to OOM on large/high-cardinality datasets
- Too-small values reduce aggregation opportunities and can increase the number of items written, slowing down loading
Default: 2000
Tuning Guidance:
- Increase (e.g., 4000–20000) if you have enough memory
- Decrease (e.g., 500–1000) if you observe memory pressure
- If increasing the buffer doesn’t reduce load volume, you’re likely dealing with high-cardinality keys (too many unique methods/tests).

Batch Size#

Environment Variable: DRILL_ETL_BATCH_SIZE
Purpose: Number of items grouped into a single write batch/transaction used by data loaders
Behavior: Controls commit frequency and transaction size
Impact: Write performance and transaction overhead
Default: 1000
Tuning Guidance:
- Increase for better write performance (2000-5000)
- Decrease to reduce transaction lock time (100-500)

Note: Applying changes to the environment variable requires restarting the Drill4J Backend instance for the new schedule to take effect.

Tracking and Monitoring#

The ETL process provides comprehensive logging to help you monitor execution, troubleshoot issues, and optimize performance.

Logging Levels#

The ETL logging supports multiple levels:

INFO: Logs only ETL start and completion events.
DEBUG: In addition to INFO, logs when each extractor and loader starts and finishes.
TRACE: In addition to DEBUG, logs every batch commit during loading.

Tracking Progress#

You can track the real-time progress of ETL executions by calling the Metrics Refresh Status API:

curl -X 'GET' \
  'http://localhost:8090/api/metrics/refresh-status' \
  -H 'accept: application/json' \
  -H 'X-API-KEY: [YOUR_API_KEY]'

Example Response:

"data": {
  "status": "LOADING", // Current ETL status
  "lastProcessedAt": "2025-11-01T09:59:30.000Z", // Timestamp of the last processed data item
  "lastRunAt": "2025-11-01T10:00:00.000Z", // Timestamp when the last run started
  "lastDuration": 1220, // Duration of the last ETL run in milliseconds
  "lastRowsProcessed": 18500 // Number of rows processed in the last ETL run
}

ETL Statuses:

EXTRACTING: ETL process is extracting data from raw_data schema.
LOADING: ETL is actively loading data.
SUCCESS: ETL completed successfully.
FAILED: ETL run ended with an error. Check the Drill4J Admin container logs for details. It will try again on the next scheduled run, but it will continue to fail until the underlying issue is fixed.

Troubleshooting#

ETL Pipeline Fails#

Symptoms:

ETL log shows errors
Manual refresh API returns errors
Metrics not updating

Solutions:

Check Database Connectivity:
- Verify database connection credentials
- Test database accessibility from the Backend instance
- Check firewall rules and network connectivity
Verify Schema Existence:
- Check database user has necessary permissions (SELECT, INSERT, UPDATE, DELETE)

ETL Running Slowly#

Symptoms:

ETL process execution time keeps increasing
Data processing delay continues to grow
Metrics stop reflecting the most recent data

Solutions:

Data Volume:
- Review retention settings - older data may not be needed
- Review methods ignore rules - consider excluding unneeded classes and methods
Database Performance:
- Consider increasing database resources (CPU, memory, IOPS)
- Consider database maintenance (VACUUM, ANALYZE, etc.)
Review Performance Parameters:
- Consider increasing bufferSize, fetchSize, or batchSize.
- Monitor memory usage when adjusting parameters

Metrics Data Inconsistency#

Symptoms:

Dashboard showing unexpected values
API results don't match raw data

Solutions:

Perform Complete Refresh:
- Use reset=true API call to reprocess all data
Check Data Retention:
- Review retentionPeriodDays and metricsPeriodDays settings
- Verify cleanup jobs haven't removed needed data
Investigate Errors:
- Review ETL logs for failure counts
- Review ETL Metadata table for error messages