Compliance Data Synthesis: From Raw Records to RegTech Edge

Every compliance team we've worked with has faced the same frustration: the data is there, but it's not telling a coherent story. Transaction logs sit in one database, customer onboarding documents in another, and audit trail entries are scattered across spreadsheets that no one remembers how to update. Raw records are abundant, but insight is scarce. The gap between having data and being able to act on it is exactly where compliance data synthesis comes in—a structured approach to transforming messy, heterogeneous records into a unified, analysis-ready foundation for RegTech tools.

This guide is for compliance analysts, data engineers, and risk managers who already understand the basics of regulatory reporting and are looking to move beyond manual aggregation. We'll assume you've dealt with inconsistent date formats, missing fields, and conflicting identifiers. Our goal is to show you how to build a synthesis pipeline that turns raw records into a genuine edge—not just for passing audits, but for detecting patterns that matter before they become problems.

Why Synthesis Matters Now: The Cost of Raw Data

The regulatory environment has shifted. Regulators expect faster, more granular reporting, and they're using their own analytics to spot anomalies. A bank that takes weeks to reconcile its transaction data is already behind. The stakes are higher because the volume of data is growing—AML alerts, KYC updates, trade surveillance logs—and the signal-to-noise ratio is dropping. Without deliberate synthesis, teams drown in false positives or, worse, miss real risks because the relevant signals are buried across silos.

Consider a typical mid-sized financial institution. It runs a legacy core banking system, a separate CRM for customer onboarding, and a third-party AML monitoring tool. Each system exports data in different formats: CSV exports from the core system, JSON from the CRM, and a proprietary XML from the AML tool. The compliance team spends 40% of its time just mapping fields and reconciling discrepancies. That time could be spent investigating real alerts. Synthesis is not a nice-to-have; it's the only way to keep pace with both regulatory demands and internal risk appetite.

We've seen teams that try to skip synthesis by feeding raw data directly into machine learning models. The results are predictable: garbage in, garbage out. Models trained on inconsistent data amplify biases, miss rare events, and generate false positives that erode trust. Synthesis is the prerequisite for any advanced analytics. It's the step that turns data from a liability into an asset.

The Hidden Costs of Fragmented Data

Beyond the obvious time sink, fragmented data creates blind spots. A customer might be flagged as high-risk in one system but appear clean in another because the linking identifier (e.g., tax ID vs. internal customer number) doesn't match. Synthesis forces a common identity resolution. It also surfaces data quality issues early—missing fields, outliers, or impossible values—before they cascade into reporting errors. Every compliance team has a story about a regulatory filing that was rejected due to a format mismatch. Synthesis prevents that by enforcing a canonical schema at the ingestion layer.

Why Now, Not Later

Two trends are accelerating the need for synthesis. First, regulators are adopting RegTech themselves, using AI to analyze submissions. A regulator that can spot inconsistencies across thousands of filings will penalize institutions with sloppy data. Second, the cost of storage and compute has dropped, making it feasible to keep and process more raw data. But volume without structure is noise. Synthesis is the filter that turns volume into actionable intelligence. Teams that invest in synthesis now will have a head start when regulators demand real-time data feeds.

The Core Mechanism: From Raw to Unified

At its simplest, compliance data synthesis is a three-step process: ingest, normalize, enrich. But the devil is in the details. Ingest involves connecting to each source system and pulling data either via API, batch export, or manual upload. Normalization maps each source's fields to a common schema—dates become ISO 8601, currencies become three-letter codes (USD, EUR), and customer names are split into structured first/last. Enrichment adds context: linking transactions to customer profiles, calculating rolling averages, or tagging entities with risk scores.

The key insight is that synthesis is not just about merging data; it's about building a coherent representation of the underlying reality. A transaction is not just a row of numbers; it's an event involving a customer, a counterparty, a product, and a context. Synthesis reconstructs that event by joining data from multiple sources. For example, a wire transfer from the core system might need the customer's KYC status from the CRM and the counterparty's sanction status from a screening tool. Only when all three are combined can you assess whether the transaction is suspicious.

Schema Design: The Foundation

The most critical decision in synthesis is choosing the canonical schema. Too narrow, and you'll lose important nuance. Too broad, and you'll spend forever populating optional fields. A good approach is to start with the reporting requirements (e.g., what fields are needed for SAR filings?) and then add fields that support internal risk models. We recommend a schema that includes a unique event ID, timestamps with timezone, source system identifier, and a flexible key-value store for source-specific fields that don't fit the core schema.

One common mistake is to design the schema around the easiest source system, forcing all others to conform. This creates a 'lowest common denominator' dataset that loses critical details from richer sources. Instead, design the schema to capture the union of all fields that matter, with clear rules for how to handle missing data (null, default value, or flag). For example, if one system has a 'transaction purpose' field and another doesn't, the schema should include it, and the missing values should be flagged as 'unavailable' rather than assumed to be empty.

Normalization Rules: The Engine

Normalization is where most of the work happens. It's not just about formatting; it's about resolving conflicts. When two sources report different values for the same field (e.g., customer address), which one do you trust? The answer depends on source reliability and timeliness. A good practice is to assign a confidence score to each source and use it to resolve conflicts. For example, the CRM might be the authoritative source for customer name, while the transaction system is authoritative for transaction amount. These rules should be explicit and documented, so that downstream users understand the lineage.

Another normalization challenge is handling synonyms and abbreviations. A customer might be listed as 'Acme Corp.' in one system and 'ACME Corporation' in another. Simple string matching won't work. A synthesis pipeline should include a fuzzy matching step, using techniques like Levenshtein distance or phonetic encoding, to link records. But fuzzy matching introduces false positives, so it's essential to have a manual review queue for uncertain matches.

How It Works Under the Hood: A Practical Pipeline

Let's walk through a concrete pipeline architecture that we've seen work in production. The pipeline is event-driven, processing records as they arrive from source systems. The core components are an ingestion layer, a transformation engine, a storage layer, and an API for downstream consumption.

The ingestion layer listens for changes in source systems—either via webhooks, polling, or scheduled batch jobs. Each source has a connector that translates its native format into a standardized event envelope (source, event type, timestamp, payload). The payload is still raw at this point; no normalization has happened. The event envelope is published to a message queue (e.g., Kafka) for decoupling.

The transformation engine consumes events from the queue and applies a series of transformation functions. Each function is a small, testable unit that handles one aspect of normalization—date parsing, currency conversion, ID resolution, etc. Functions are chained in a DAG (directed acyclic graph) so that the output of one function feeds into the next. This modular design makes it easy to add or update rules without rebuilding the entire pipeline.

Identity Resolution: The Hardest Part

The most complex transformation is identity resolution: determining that two records from different sources refer to the same real-world entity. This is a classic data matching problem. We've found that a two-pass approach works well. The first pass uses deterministic rules (exact match on tax ID, or match on name + date of birth). The second pass uses probabilistic matching (weighted scoring on multiple fields) to catch records that don't have a perfect match but are likely the same. The probabilistic pass generates a match score; records above a threshold are automatically linked, records below are sent to a human review queue.

One team we worked with had a customer who changed their legal name after a merger. The deterministic pass missed the link, but the probabilistic pass gave a high score because the address, phone number, and incorporation date all matched. The human reviewer confirmed the link, and the rule was added to the deterministic set. Over time, the pipeline learns and reduces the manual review burden.

Enrichment and Scoring

After normalization and identity resolution, the unified record is enriched with additional context. This might include adding a risk score from a model, flagging transactions that exceed a threshold, or attaching recent news sentiment about the counterparty. Enrichment is usually done by calling external services or running batch computations. The enriched record is then stored in a data warehouse (e.g., Snowflake or Redshift) with a schema optimized for analytical queries.

The final step is making the synthesized data available to RegTech applications via a REST API or direct database access. The API should support filtering, aggregation, and export in standard formats (JSON, CSV, XML). It's crucial to include provenance metadata with each field, so downstream users know where the data came from and how it was transformed. This transparency builds trust and helps debug issues.

Worked Example: Transaction Monitoring in a Mid-Sized Bank

Let's ground the concepts with a composite scenario. A mid-sized bank, let's call it 'Meridian Bank', operates in three countries and uses a legacy core system, a Salesforce CRM for commercial clients, and a third-party AML screening tool. The compliance team wants to build a unified view of all wire transfers over $10,000 to detect structuring patterns—multiple small transfers that avoid reporting thresholds.

The raw data comes in three streams: daily CSV exports from the core system (containing transaction date, amount, currency, sender account, receiver account), API calls to Salesforce (customer name, industry, risk rating, KYC status), and XML files from the AML tool (sanction matches, flags). The first step is to build connectors for each source. For the CSV exports, a Python script parses the file and publishes each row as a Kafka event. For Salesforce, a scheduled job pulls all customers updated in the last hour. For the AML tool, a webhook sends alerts in real time.

Next, the transformation engine normalizes the data. The core system uses 'MM/DD/YYYY' dates; the pipeline converts them to ISO 8601. The currency field uses three-letter codes, but some rows have 'US$' instead of 'USD'—a regex handles that. The sender and receiver accounts are alphanumeric strings; the pipeline looks up the account in a reference table to get the customer ID from Salesforce. This is where identity resolution kicks in: the account number in the core system might not match the account ID in Salesforce. A deterministic rule maps account numbers to customer IDs using a cross-reference table that was manually built during onboarding. For new accounts, a probabilistic match on account holder name and address is used.

Once the transaction is linked to a customer, enrichment adds the customer's risk rating and KYC status. If the customer is rated high-risk, the transaction is flagged for review. The pipeline also calculates a rolling 7-day sum of all transactions from the same sender, to detect structuring. If the sum exceeds $10,000, an alert is generated.

What We Learned from Meridian Bank

The implementation revealed several pitfalls. First, the cross-reference table for account-to-customer mapping was incomplete—about 5% of accounts had no match. The team had to set up a manual review process for those, which slowed down the pipeline. Second, the AML tool's XML format changed without notice, breaking the parser. They added a schema validation step that logs errors and alerts the engineering team. Third, the rolling sum calculation was initially done in the pipeline using a window function, but it caused performance issues for high-volume accounts. They moved the aggregation to a separate batch job that runs every hour.

The most valuable outcome was not the alerts themselves, but the ability to trace every alert back to the raw data. When a regulator asked for evidence that a particular transaction was reviewed, the compliance team could pull up the unified record showing the transaction, the customer's KYC status, and the AML screening result—all linked by a common ID. That traceability is the real edge.

Edge Cases and Exceptions

No synthesis pipeline survives contact with reality unscathed. Here are the edge cases we've seen trip up even experienced teams.

Legacy System Integration

Legacy systems often don't have APIs, and their export formats are brittle. One team we know had to parse a mainframe report that was a fixed-width text file with no delimiters. The file format was undocumented, and the only person who understood it had retired. The solution was to use a pattern-matching tool to reverse-engineer the format, then add a manual validation step where a human checks a sample of records each month. It's ugly, but it works. The lesson: always budget extra time for legacy integration, and plan for format changes.

Unstructured Data: Emails and Chat Logs

Many compliance-relevant communications happen in emails, chat messages, and voice transcripts. These are unstructured and require NLP to extract entities, sentiment, and intent. The challenge is that NLP models are imperfect, especially for domain-specific jargon. A message that says 'Please wire the funds to the usual account' might be innocent or might be a red flag. We've found that the best approach is to use NLP to pre-tag messages with potential risk indicators (e.g., mentions of 'urgent', 'offshore', 'cash'), then send those to human reviewers. The synthesis pipeline should store both the raw text and the extracted entities, so reviewers can see the full context.

Data Retention and Jurisdictional Conflicts

Different jurisdictions have different rules for how long data must be retained and when it must be deleted. A synthesis pipeline that stores unified records in a central warehouse might violate GDPR or local banking secrecy laws. The solution is to implement data tagging at the record level: each record is tagged with its source jurisdiction and the applicable retention policy. The pipeline then applies a retention policy engine that automatically purges or anonymizes records when they expire. This is non-trivial because a single unified record might combine data from multiple jurisdictions with conflicting rules. In that case, the most restrictive rule should apply, or the data should be partitioned so that each jurisdiction's data is stored separately.

Handling Data Quality Degradation

Over time, source systems change, and data quality can degrade. A field that was always populated might start showing nulls after a system upgrade. The synthesis pipeline should include data quality monitors that track metrics like completeness, uniqueness, and timeliness per source. When a metric drops below a threshold, an alert is sent to the data engineering team. Without these monitors, bad data can silently propagate and corrupt downstream models.

Limits of the Approach

Synthesis is powerful, but it's not a silver bullet. Being honest about its limits will save you from over-investing in the wrong solution.

Data Quality Is Still King

No amount of clever synthesis can fix fundamentally bad data. If the raw records are missing critical fields, or if the source systems have systematic errors (e.g., a CRM that allows free-text entry for country names), the synthesized data will be unreliable. Synthesis can flag quality issues, but it cannot invent missing data. Always start with a data quality audit before building the pipeline. If the raw data is too poor, consider investing in source system improvements first.

Scale and Latency Trade-offs

Synthesis adds latency. Every transformation step takes time, and identity resolution can be computationally expensive. For real-time use cases like transaction screening, a synchronous pipeline that waits for enrichment might be too slow. The alternative is to use a two-speed architecture: a fast path that does minimal normalization (just enough to make a screening decision) and a slow path that does full enrichment for later analysis. This adds complexity but is necessary for time-sensitive applications.

Human Judgment Is Irreplaceable

Synthesis can organize and enrich data, but it cannot replace the judgment of an experienced compliance officer. The pipeline will generate false positives and miss edge cases. The goal is not to automate compliance decisions, but to give humans better information to make decisions. A pipeline that tries to fully automate alert disposition is likely to produce brittle rules that miss novel patterns. Instead, treat synthesis as a decision support tool, not a decision maker.

Regulatory Interpretation Changes

Regulations evolve, and a synthesis pipeline built for today's rules may need significant rework when rules change. For example, if a regulator starts requiring a new field (e.g., beneficial ownership information), the pipeline must be updated to ingest that data. This is not a one-time project; it's an ongoing investment. Teams should budget for continuous maintenance and have a process for responding to regulatory changes quickly.

Given these limits, we recommend a pragmatic approach: start with a minimal viable pipeline that addresses your highest-priority use case, measure its impact, and iterate. Don't try to build a perfect system from day one. The edge comes not from having the most sophisticated pipeline, but from having a pipeline that is reliable, transparent, and adaptable. That is the real RegTech advantage.

Compliance Data Synthesis: From Raw Records to RegTech Edge

Table of Contents

Why Synthesis Matters Now: The Cost of Raw Data

The Hidden Costs of Fragmented Data

Why Now, Not Later

The Core Mechanism: From Raw to Unified

Schema Design: The Foundation

Normalization Rules: The Engine

How It Works Under the Hood: A Practical Pipeline

Identity Resolution: The Hardest Part

Enrichment and Scoring

Worked Example: Transaction Monitoring in a Mid-Sized Bank

What We Learned from Meridian Bank

Edge Cases and Exceptions

Legacy System Integration

Unstructured Data: Emails and Chat Logs

Data Retention and Jurisdictional Conflicts

Handling Data Quality Degradation

Limits of the Approach

Data Quality Is Still King

Scale and Latency Trade-offs

Human Judgment Is Irreplaceable

Regulatory Interpretation Changes

Comments (0)

Table of Contents

Why Synthesis Matters Now: The Cost of Raw Data

The Hidden Costs of Fragmented Data

Why Now, Not Later

The Core Mechanism: From Raw to Unified

Schema Design: The Foundation

Normalization Rules: The Engine

How It Works Under the Hood: A Practical Pipeline

Identity Resolution: The Hardest Part

Enrichment and Scoring

Worked Example: Transaction Monitoring in a Mid-Sized Bank

What We Learned from Meridian Bank

Edge Cases and Exceptions

Legacy System Integration

Unstructured Data: Emails and Chat Logs

Data Retention and Jurisdictional Conflicts

Handling Data Quality Degradation

Limits of the Approach

Data Quality Is Still King

Scale and Latency Trade-offs

Human Judgment Is Irreplaceable

Regulatory Interpretation Changes

Share this article:

Comments (0)

Related Articles

Compliance Data Synthesis for Modern Professionals: Adaptive Schema Design

Compliance Data Synthesis: Architecting Insight Pipelines for Modern Professionals

Kryxis Engineers the Compliance Synthesis Engine: Architecting for Autonomous Regulatory Intelligence