Input Sanitation

Data enrichment systems ingest user-provided data from sources like:

CRMs
ATSs
Web forms

User-provided data can contain mistakes or be invalid. Pipe0 has a robust sanitation layer to clean and regenerate data.

Cleanup

The following request payload contains common errors but will be processed successfully.


{
  "pipes": [
    { 
      "pipe_id": "company:identity@1",
    },
  ],
  "input": [
    {
      "id": 1,
      "name": "Susi Jui",
      "company_websiste_url": "pipe0.com", // missing protocol 'https://'
      "email": "mailto:susi@pipe0.com", // "malto:" prefix
      "personal_website_url": "wwwww.susi.com" // wwww instad of www
    },
    {
      "id": 2,
      "name": "Tom Schmidt",
      "company_name": "Pipe0",
      "company_websiste_url": "not today" // invalid: expected URL
    },
  ],
}

Here’s how we clean this request:

Parse URLs into a consistent format and clean common mistakes
Parse email addresses into a consistent format and clean common mistakes
Fix obvious typos and remove invalid characters
Parse data formats on demand (int > float, float > int, int > string)

Regeneration

In our example, "company_websiste_url": "not today" is not a valid URL.

Because company_websiste_url is an output field of company:identity@1, we can find the correct value.

During processing, company:identity@1 detects that company_websiste_url is of invalid format and replaces it with the correct value.

The result may look like this:


{
    "id": 2,
    "name": "Tom Schmidt",
    "company_name": "Pipe0",
    "company_websiste_url": "https://valid-url.com" // healed
},

Note

Valid input values are not regenerated. Instead, they are copied from the input to the record.

Incomplete data

It is common for input data to be incomplete.

Failing the entire task because one input object cannot be processed is impractical and annoying.

Partially missing input fields

If we find at least one input object that can be processed, pipeline validation will pass.

Let’s look at the following request payload:


{
  "pipes": [
    { 
      "pipe_id": "company:identity@1",
    },
  ],
  "input": [
    {
      "id": 1,
      "name": "Susi Jui",
      "company_name": "Pipe0",
      "company_websiste_url": "pipe0.com",
    },
    { // CANNOT be processed by "company:identity@1"
      "id": 2,
      // required `company_name` missing
    },
  ],
}

The pipe company:identity@1 requires the input field company_website_url which is not present in record id=2. In this case:

Pipeline validation passes
Record id=1 is processed in full.
Record id=2 has failed fields

No input object has required input fields

Let’s look at another example:


{
  "pipes": [
    { 
      "pipe_id": "company:identity@1",
    },
  ],
  "input": [
    { // CANNOT be processed by "company:identity@1"
      "id": 1,
      "name": "Susi Jui",
    },
    { // CANNOT be processed by "company:identity@1"
      "id": 2,
      "name": "Tom Schmidt"
    },
  ],
}

No input object has the required field company_websiste_url. The request will fail during pipeline validation.

The entire task will fail before processing starts.

Never fail a task

In practice, dealing with failing tasks can be annoying. If you don’t want to deal with failing task, there’s an escape hatch: If you define the expected input fields and set them to null, pipeline validation will pass. The task will not fail. Instead, only individual fields fail.