Query Language
The article is aimed both at developers who will use our unmanaged backend and at external collaborators who will enter our supply chain to help us scale the managed process.
1. Clauses:
1.1. FETCH
Coordinates the web page acquisition process by defining the sequence of actions involved among those available.
It is the starting clause of an interpolating acquisition pipeline with the reference input dataset.
es:
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS ( url = β{expression}β))
THEN
(action = CLICK WITH ARGS (selector = β{expression}β))
)
Each action is delimited by round brackets and the sequence is marked by the keyword THEN.
The parameters of each action are defined within the WITH ARGS directive using
the keyword AND to define a multiplicity of parameters. For more details on
parameters of a single action see the specific Actions section
1.2. JOIN
It allows you to manage the navigation of detail pages by defining the segmentation filter, the actions involved with this segment and the extraction fields involved. Together with the CURRENT and PIVOTED clauses we can refer to fields of the parent page and fields of the child page using CSS selectors
es:
JOIN (
(PIVOTED(βdiv[class=β{class}β] spanβ))
)
WHERE (
splitter = βdiv[class=β{class}]ββ
AND
ACTIONS ARE
(
(action = {action} WITH ARGS (param = β{expression}β))
)
)
1.3. SELECT
It coincides with the selection and projection operator in the SQL context and is responsible for defining the fields to be extracted
es:
SELECT (
(CURRENT(βdiv[class=β{class}β]β)) AS field1
THEN
(CURRENT(div[class=β{class}β])) AS field2
)
1.4. FLATSELECT
Combine the features of FLATTEN and SELECT. The selection and projection involves the specific segment identified by the coherently enhanced splitter attribute
FLATSELECT (
(PIVOTED(βdiv[class=β{class}ββ)) AS field1
THEN
(PIVOTED(βdiv[class=β{class}ββ)) AS field2
)
WHERE (
splitter = βdiv[class=β{class}]ββ
)
1.5. FLATTEN
es:
FLATTEN WHERE (
splitter = βdiv[class=β{class}β]β
)
1.6. EXPLORE
It allows you to manage a recursive crossing of the pages consistently with the defined selection operator, by performing the actions and selecting the fields on the individual detail pages
es:
EXPLORE (
(PIVOTED(βdiv[class=β{class}β]β)) as field1
THEN
(PIVOTED(βdiv[class=β{class}β]β)) as field2
)
WHERE (
splitter = βdiv[class=β{class}]ββ
AND
ACTIONS ARE
(
(action = {action} WITH ARGS (param = β{expression}β))
)
)
1.7 VISITJOIN
Quick JOIN operator involving the execution of VISIT navigation action without an explicit selection of fields
VISITJOIN WHERE
(
splitter = βdiv[class={class}]β
)
1.8. WGETJOIN
Quick JOIN operator involving the execution of WGET navigation action (HTTP request that does not involve a headless browser) without an explicit selection of fields
es:
WGETJOIN WHERE
(
splitter = βdiv[class={class}]β
)
1.9. VISITEXPLORE
Quick EXPLORE operator involving the execution of VISIT navigation action without an explicit selection of fields
es:
VISITEXPLORE WHERE
(
splitter = βdiv[class=[β{class}β]β
)
2. Actions:
Actions defined in the FETCH, JOIN and EXPLORE clause
2.1. VISIT
Browsing action using headless browser
es:
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS ( url = '{expression}'))
)
The expression in curly brackets can be a dynamic interpolating expression with variables and fields from the input dataset
2.2. WGET
Navigation action that takes the form of a simple HttpRequest. More performing than the VISIT counterpart but with use limits to static HTML pages that do not involve complex javascript interactions
FETCH WHERE ACTIONS ARE (
action = WGET WITH ARGS (url = β{expression}β))
)
2.3 CLICK
Consistently with the use of a headless browser, it manages the automatic click of the element identified by the respective CSS selector
FETCH WHERE ACTIONS ARE (
( action = VISIT WITH ARGS ( url = '{expression}'))
THEN
( action = CLICK WITH ARGS (selector = βdiv[class=β{class}]ββ))
)
2.4. CLICKNEXT
Consistently with the use of a headless browser, it manages the automatic click of the next element not yet clicked during the user session
FETCH WHERE ACTIONS ARE (
( action = VISIT WITH ARGS ( url = '{expression}'))
THEN
(
action = CLICKNEXT WITH ARGS ( selector = βdiv[class=β{class}]ββ)
)
2.5. TEXTINPUT
Consistently with the use of a headless browser, it manages the enhancement of a textual field.
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS ( url = '{expression}'))
THEN
( action = TEXTINPUT WITH ARGS ( selector = βdiv[class=β{class}]β')) AND value = β{expression}β)
)
2.6. DELAY
Suspend execution for the time indicated by the duration parameter in milliseconds
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS ( url = '{expression}'))
THEN
(action = DELAY WITH ARGS ( duration = '10'))
)
2.7 RANDOMDELAY
Suspend the execution of a randomly selected time in milliseconds from the minimum value to a maximum value
FETCH WHERE ACTIONS ARE (
( action = VISIT WITH ARGS (url = β{expression}β))
THEN
(action = RANDOMDELAY WITH ARGS ( min = β2β AND max=β10β ))
)
2.8. SCREENSHOT
Take a screenshot of the current page saved as an image
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
action = SCREENSHOT WITH ARGS ( filter = β{MustHaveTitle|NoFilter}β )
)
2.9. SUBMIT
Executes the submit command associated with a data submission form
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
action = SUBMIT WITH ARGS ( selector = β{expression}β )
)
2.10. DROPDOWNSELECT
Select the value of a DropDown control
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
action = DROPDOWN WITH ARGS ( selector = β{expression}β , value = β{expression}β)
)
2.11. EXESCRIPT
Executes injected client-side javascript code
{idScript} refers to the javascript script code defined in the database or the XML configuration file of the internal tool in managed mode
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
(action = EXESCRIPT WITH ARGS ( selector = β{div[class=β{class}β]}β AND idClient=β{idScript}β )
)
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
( action = EXESCRIPT WITH ARGS ( selector = β{div[class=β{class}β]}β AND value=β{script}β )
)
2.12. DRAGSLIDER
Set the scroll bar with a specific percentage
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
action = DRAGSLIDER WITH ARGS ( selector = β{div[class=β{class}β]}β AND percentage=β{percentage}β )
)
2.13. LOOP
Loop according to a specific condition
FETCH WHERE ACTIONS ARE (
( action = VISIT WITH ARGS (url = β{expression}β))
THEN
(
action = LOOP WHERE SUBACTIONS ARE (
(action = CLICK WITH ARGS (selector = β{div[class=β{class}β}]β))
)
WITH ARGS ( limit = β{limit}β)
)
2.14. WAITFOR
Wait until the element indicated by the CSS selector has been loaded on the page before continuing with the flow
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
(
action = WAITFOR WITH ARGS ( selector = β{cssselector}β)
)
2.15. TRY
Management of a retry policy consistent with the number of attempts defined.
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (url = β{expression}β))
THEN
( action = TRY
WHERE SUBACTIONS ARE (
( action = CLICK WITH ARGS ( selector = β{selector}β))
)
WITH ARGS (selector = β{cssselector}β)
)
)
3. Dynamic Expressions:
You can associate variables with our selection expressions.
FETCH WHERE ACTIONS ARE (
(action = VISIT WITH ARGS (
url = ((βhttp://{url}β ) + $(key_url))
))
)
key_url is a variable assigned by the AS projection operator or retrieved from the input dataset field that can be associated with the query script
4. Parameters
We can define action level parameters with the WITH ARGS directive already encountered in our examples or at clause level with the WITH PARAMETERS ARE directive
Below are the parameters for each clause involved. (Generally, it is not necessary to use them)
4.1. FETCH
numPartitions: number of partitions allocated by Spark when segmenting the dataset
4.2. JOIN
numPartitions: number of partitions allocated by Spark during segmentation
flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}
4.3. VISITJOIN
numPartitions: number of partitions allocated by Spark during segmentation
flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}
4.4 WGETJOIN
numPartitions: number of partitions allocated by Spark during segmentation
flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}
4.5. EXPLORE
numPartitions: number of partitions allocated by Spark during segmentation
maxDepth: maximum level of traversal during the recursive crawling phase
4.6. VISITEXPLORE
numPartitions: number of partitions allocated by Spark during segmentation
maxDepth: maximum level of traversal during the recursive crawling phase
4.7. WGETEXPLORE
numPartitions: number of partitions allocated by Spark during segmentation
maxDepth: maximum level of traversal during the recursive crawling phase
4.8. FLATTEN
alias: association of a variable to which each segment refers, possibly projectable on the output dataset
4.9. FLATSELECT
alias: association of a variable to which each segment refers, possibly projectable on the output dataset