Input Sanitation
Data enrichment systems ingest user-provided data from sources like:
- CRMs
- ATSs
- Web forms
User-provided data can contain mistakes or be invalid. Pipe0 has a robust sanitation layer to clean and regenerate data.
Cleanup
The following request payload contains common errors but will be processed successfully.
{
"pipes": [
{
"pipe_id": "company:identity@1",
},
],
"input": [
{
"id": 1,
"name": "Susi Jui",
"company_websiste_url": "pipe0.com", // missing protocol 'https://'
"email": "mailto:susi@pipe0.com", // "malto:" prefix
"personal_website_url": "wwwww.susi.com" // wwww instad of www
},
{
"id": 2,
"name": "Tom Schmidt",
"company_name": "Pipe0",
"company_websiste_url": "not today" // invalid: expected URL
},
],
}
Here’s how we clean this request:
- Parse URLs into a consistent format and clean common mistakes
- Parse email addresses into a consistent format and clean common mistakes
- Fix obvious typos and remove invalid characters
- Parse data formats on demand (int > float, float > int, int > string)
Regeneration
In our example, "company_websiste_url": "not today"
is not a valid URL.
Because company_websiste_url
is an output field of company:identity@1
,
we can find the correct value.
During processing, company:identity@1
detects that company_websiste_url
is of invalid format
and replaces it with the correct value.
The result may look like this:
{
"id": 2,
"name": "Tom Schmidt",
"company_name": "Pipe0",
"company_websiste_url": "https://valid-url.com" // healed
},
Valid input values are not regenerated. Instead, they are copied from the input
to the record.
Incomplete data
It is common for input data to be incomplete.
Failing the entire task because one input object cannot be processed is impractical and annoying.
Partially missing input fields
If we find at least one input object that can be processed, pipeline validation will pass.
Let’s look at the following request payload:
{
"pipes": [
{
"pipe_id": "company:identity@1",
},
],
"input": [
{
"id": 1,
"name": "Susi Jui",
"company_name": "Pipe0",
"company_websiste_url": "pipe0.com",
},
{ // CANNOT be processed by "company:identity@1"
"id": 2,
// required `company_name` missing
},
],
}
The pipe company:identity@1
requires the input field company_website_url
which is not present
in record id=2
. In this case:
- Pipeline validation passes
- Record
id=1
is processed in full. - Record
id=2
has failed fields
No input object has required input fields
Let’s look at another example:
{
"pipes": [
{
"pipe_id": "company:identity@1",
},
],
"input": [
{ // CANNOT be processed by "company:identity@1"
"id": 1,
"name": "Susi Jui",
},
{ // CANNOT be processed by "company:identity@1"
"id": 2,
"name": "Tom Schmidt"
},
],
}
No input object has the required field company_websiste_url
. The request will fail during pipeline validation.
The entire task will fail before processing starts.
Never fail a task
In practice, dealing with failing tasks can be annoying. If you don’t want to deal
with failing task, there’s an escape hatch: If you define the expected input fields and set them to null
, pipeline validation will
pass. The task will not fail. Instead, only individual fields fail.