Combine, filter, group: SQL scripts for obtaining samples

The SQL language is an integral tool when working with data. Magento developers or testers use it to store, change and process large amounts of information.

Selection, sorting, and limits

How do we obtain samples from several data sources, applying the necessary restrictions in a specific situation? Consider the process using the user's table as an example. It is located in the product schema (folder) and contains user registration data.

The same data is sorted by age and country code:

reg_dt id gender age countre_code app
2014-01-02 17:30:01 234533 f 34 CH mobile
2014-03-02 17:54:01 343422 m 56 US mobile
2014-03-10 18:44:01 222334 f 43 US dekstop
2014-03-12 11:54:01 345675 m 23 GE mobile
2014-03-12 19:52:01 234456 m 65 US dekstop

SELECT reg_dt, id, gender, age, country_code, app

FROM product.users

ORDER BY age DESC, country_code

The ORDER BY command will sort the field in ascending order, and the DESC command will sort the area in descending order. To select all fields, you need to specify * instead of column names SELECT * FROM product.users.

Query formation, use of logical operators, and WHERE conditions

Let's say we need to sort female users over 45.

SELECT reg_dt, id, gender, age, country_code, app

FROM product.users

WHERE gender = 'f' AND age > 45 – specify the conditions for the output fields.

ORDER BY age DESC, country_code

In the WHERE condition, you can use many criteria that AND and OR logical operators connect. When using OR, be careful and use parentheses so that the alternatives are clearly defined. In SQL, you can use line wraps and indents for formatting; they do not affect the execution of the query.

When we have completed the request, we will get a list of users. Some are registered outside the time range we have highlighted - 2012-2014.

It is because the OR condition specified at the end cancels all previous conditions connected by the AND operator. We interpret this condition as "give the specified fields to users registered on the specified date and have a .us domain or just a .com domain (without date conditions)." It is because AND is executed before OR.

If you want the date condition to be preserved for both domains, you must highlight the alternatives for the OR operator using parentheses:

WHERE reg_dt BETWEEN '2012-01-01' AND '2015-01-01' AND (site LIKE '%.us' OR site LIKE '%.com')

The condition can also be formulated as a result of transformations:

SELECT age, site, LENGTH(site)

FROM product.users

WHERE age%2 != 0 OR LENGTH(site) > 17

Samples combination

To get data from additional tables, we use LEFT JOIN. Our source table is on the left, and it's like we're adding the corresponding rows from the orders table.

For the gender field, we used the alias u.gender AS sex. You can use the word AS or write an alias separated by a space to do this.

Aliases can also be applied to table names with the same syntax (with or without AS).

After executing the query, we got the following table:

Users who made several purchases were duplicated in the table. And for users who do not have payments, the value NULL was applied.

Next, we work with the received data set according to the same principle we used to work with one table — we select the fields we are interested in, filter and sort.

What is the difference between conditions in WHERE and LEFT JOIN

The condition in LEFT JOIN affects only the fields from the table we are joining.

SELECT u.id, op.service_id, op.user_id

FROM product.users u

LEFT JOIN product.orders_paid op ON u.id = op.user_id AND op.service_id = 14

First, we filtered the orders_paid table, and only all orders of service #14 remained. Then they were combined, and all registered users were obtained from users. For those with an order for service No. 14, separate lines from the order table were output for all other users - NULL.

When the condition is in WHERE, it filters the selection already after joining the tables.

Before filtering:

SELECT u.id, op.service_id, op.user_id

FROM product.users u

LEFT JOIN product.orders_paid op ON u.id = op.user_id

After filtering:

SELECT u.id, op.service_id, op.user_id

FROM product.users u

LEFT JOIN product.orders_paid op ON u.id = op.user_id

WHERE op.service_id = 14

We can add the same table several times or add it to itself. Our orders table has a parent_order_id field, representing the order that initiated the user's subscription. This way, the first payments, and subscription renewals are stored in the same table, and we link the orders_paid table to itself by the parent-payment ID.

We can also add a reference table with types of services. We use INNER JOIN to connect to orders_paid so that only users with orders remain.

SELECT u.id, opp.order_id parent_order, opp.dt AS parent_dt, op.order_id, op.dt AS order_dt, ps.service_name

FROM product.users u

INNER JOIN product.orders_paid op ON u.id = op.user_id

LEFT JOIN product.orders_paid opp ON op.parent_order_id = opp.order_id

LEFT JOIN product.pay_services ps ON ps.id = op.service_id

ORDER BY op.order_id

Here we see that parent_order = order_id for the payment that triggers the subscription, parent_order = order_id. And the VIP service does not have a parent_order because it is a different type of service. So parent_dt is the subscription start date, and order_dt is the specific payment date.