Query Language

Query Language

You are here:
< All Topics

The article is aimed both at developers who will use our unmanaged backend and at external collaborators who will enter our supply chain to help us scale the managed process.

1. Clauses:

1.1. FETCH

Coordinates the web page acquisition process by defining the sequence of actions involved among those available.
It is the starting clause of an interpolating acquisition pipeline with the reference input dataset.

es:

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = β€˜{expression}’))

THEN

(action = CLICK WITH ARGS (selector = β€˜{expression}’))

)

Each action is delimited by round brackets and the sequence is marked by the keyword THEN.
The parameters of each action are defined within the WITH ARGS directive using
the keyword AND to define a multiplicity of parameters. For more details on
parameters of a single action see the specific Actions section

1.2. JOIN

It allows you to manage the navigation of detail pages by defining the segmentation filter, the actions involved with this segment and the extraction fields involved. Together with the CURRENT and PIVOTED clauses we can refer to fields of the parent page and fields of the child page using CSS selectors

es:
JOIN (

(PIVOTED(β€˜div[class=”{class}”] span’))

)

WHERE (

splitter = β€˜div[class=”{class}]”’

AND

ACTIONS ARE

(

(action = {action} WITH ARGS (param = β€˜{expression}’))

)

)

1.3. SELECT

It coincides with the selection and projection operator in the SQL context and is responsible for defining the fields to be extracted

es:

SELECT (

(CURRENT(β€˜div[class=”{class}”]’)) AS field1

THEN

(CURRENT(div[class=”{class}”])) AS field2

)

1.4. FLATSELECT

Combine the features of FLATTEN and SELECT. The selection and projection involves the specific segment identified by the coherently enhanced splitter attribute

FLATSELECT (

(PIVOTED(β€˜div[class=”{class}”’)) AS field1

THEN

(PIVOTED(β€˜div[class=”{class}”’)) AS field2

)

WHERE (

splitter = β€˜div[class=”{class}]”’

)

1.5. FLATTEN

es:

FLATTEN WHERE (

splitter = β€˜div[class=”{class}”]’

)

 

1.6. EXPLORE

It allows you to manage a recursive crossing of the pages consistently with the defined selection operator, by performing the actions and selecting the fields on the individual detail pages

es:

EXPLORE (

(PIVOTED(β€˜div[class=”{class}”]’)) as field1

THEN

(PIVOTED(β€˜div[class=”{class}”]’)) as field2

)

WHERE (

splitter = β€˜div[class=”{class}]”’

AND

ACTIONS ARE

(

(action = {action} WITH ARGS (param = β€˜{expression}’))

)

)

1.7 VISITJOIN

Quick JOIN operator involving the execution of VISIT navigation action without an explicit selection of fields

VISITJOIN WHERE

(

splitter = β€˜div[class={class}]’

)

1.8. WGETJOIN

Quick JOIN operator involving the execution of WGET navigation action (HTTP request that does not involve a headless browser) without an explicit selection of fields

es:

WGETJOIN WHERE

(

splitter = β€˜div[class={class}]’

)

1.9. VISITEXPLORE

Quick EXPLORE operator involving the execution of VISIT navigation action without an explicit selection of fields

es:

VISITEXPLORE WHERE

(

splitter = β€˜div[class=[”{class}”]’

)

2. Actions:

Actions defined in the FETCH, JOIN and EXPLORE clause

2.1. VISIT

Browsing action using headless browser

es:

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = '{expression}'))

)

The expression in curly brackets can be a dynamic interpolating expression with variables and fields from the input dataset

2.2. WGET

Navigation action that takes the form of a simple HttpRequest. More performing than the VISIT counterpart but with use limits to static HTML pages that do not involve complex javascript interactions

FETCH WHERE ACTIONS ARE (

action = WGET WITH ARGS (url = β€˜{expression}’))

)

2.3 CLICK

Consistently with the use of a headless browser, it manages the automatic click of the element identified by the respective CSS selector

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS ( url = '{expression}'))

THEN

( action = CLICK WITH ARGS (selector = β€˜div[class=”{class}]”’))

)

2.4. CLICKNEXT

Consistently with the use of a headless browser, it manages the automatic click of the next element not yet clicked during the user session

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS ( url = '{expression}'))

THEN

(

action = CLICKNEXT WITH ARGS ( selector = β€˜div[class=”{class}]”’)

)

2.5. TEXTINPUT

Consistently with the use of a headless browser, it manages the enhancement of a textual field.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = '{expression}'))

THEN

( action = TEXTINPUT WITH ARGS ( selector = β€˜div[class=”{class}]”')) AND value = β€˜{expression}’)

)

2.6. DELAY

Suspend execution for the time indicated by the duration parameter in milliseconds

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = '{expression}'))

THEN

(action = DELAY WITH ARGS ( duration = '10'))

)

2.7 RANDOMDELAY

Suspend the execution of a randomly selected time in milliseconds from the minimum value to a maximum value

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

(action = RANDOMDELAY WITH ARGS ( min = β€˜2’ AND max=’10’ ))

)

2.8. SCREENSHOT

Take a screenshot of the current page saved as an image

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

action = SCREENSHOT WITH ARGS ( filter = β€˜{MustHaveTitle|NoFilter}’ )

)

2.9. SUBMIT

Executes the submit command associated with a data submission form

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

action = SUBMIT WITH ARGS ( selector = β€˜{expression}’ )

)

2.10. DROPDOWNSELECT

Select the value of a DropDown control

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

action = DROPDOWN WITH ARGS ( selector = β€˜{expression}’ , value = β€˜{expression}’)

)

2.11. EXESCRIPT

Executes injected client-side javascript code

{idScript} refers to the javascript script code defined in the database or the XML configuration file of the internal tool in managed mode

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

(action = EXESCRIPT WITH ARGS ( selector = β€˜{div[class=”{class}”]}’ AND idClient=’{idScript}’ )

)

 

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

( action = EXESCRIPT WITH ARGS ( selector = β€˜{div[class=”{class}”]}’ AND value=’{script}’ )

)

2.12. DRAGSLIDER

Set the scroll bar with a specific percentage

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

action = DRAGSLIDER WITH ARGS ( selector = β€˜{div[class=”{class}”]}’ AND percentage=’{percentage}’ )

)

2.13. LOOP

Loop according to a specific condition

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

(

action = LOOP WHERE SUBACTIONS ARE (

(action = CLICK WITH ARGS (selector = β€˜{div[class=”{class}”}]’))

)

WITH ARGS ( limit = β€˜{limit}’)

)

 

2.14. WAITFOR

Wait until the element indicated by the CSS selector has been loaded on the page before continuing with the flow

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

THEN

(

action = WAITFOR WITH ARGS ( selector = β€˜{cssselector}’)

)

2.15. TRY

Management of a retry policy consistent with the number of attempts defined.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = β€˜{expression}’))

 

THEN

( action = TRY

WHERE SUBACTIONS ARE (

( action = CLICK WITH ARGS ( selector = β€˜{selector}’))

)

WITH ARGS (selector = β€˜{cssselector}’)

)

)

3. Dynamic Expressions:

You can associate variables with our selection expressions.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (

url = ((β€˜http://{url}’ ) + $(key_url))

))

)

key_url is a variable assigned by the AS projection operator or retrieved from the input dataset field that can be associated with the query script

4. Parameters

We can define action level parameters with the WITH ARGS directive already encountered in our examples or at clause level with the WITH PARAMETERS ARE directive

Below are the parameters for each clause involved. (Generally, it is not necessary to use them)

4.1. FETCH

numPartitions: number of partitions allocated by Spark when segmenting the dataset

4.2. JOIN

numPartitions: number of partitions allocated by Spark during segmentation

flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}

4.3. VISITJOIN

numPartitions: number of partitions allocated by Spark during segmentation

flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}

4.4 WGETJOIN

numPartitions: number of partitions allocated by Spark during segmentation

flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}

4.5. EXPLORE

numPartitions: number of partitions allocated by Spark during segmentation

maxDepth: maximum level of traversal during the recursive crawling phase

4.6. VISITEXPLORE

numPartitions: number of partitions allocated by Spark during segmentation

maxDepth: maximum level of traversal during the recursive crawling phase

4.7. WGETEXPLORE

numPartitions: number of partitions allocated by Spark during segmentation

maxDepth: maximum level of traversal during the recursive crawling phase

4.8. FLATTEN

alias: association of a variable to which each segment refers, possibly projectable on the output dataset

4.9. FLATSELECT

alias: association of a variable to which each segment refers, possibly projectable on the output dataset

 

 

 

 

 

 

Table of Contents