Apache Phoenix performance best practices
The most important aspect of Apache Phoenix performance is to optimize the underlying Apache HBase. Phoenix creates a relational data model atop HBase that converts SQL queries into HBase operations, such as scans. The design of your table schema, the selection and ordering of the fields in your primary key, and your use of indexes all affect Phoenix performance.
Table schema design
When you create a table in Phoenix, that table is stored in an HBase table. The HBase table contains groups of columns (column families) that are accessed together. A row in the Phoenix table is a row in the HBase table, where each row consists of versioned cells associated with one or more columns. Logically, a single HBase row is a collection of key-value pairs, each having the same rowkey value. That is, each key-value pair has a rowkey attribute, and the value of that rowkey attribute is the same for a particular row.
The schema design of a Phoenix table includes the primary key design, column family design, individual column design, and how the data is partitioned.
Primary key design
The primary key defined on a table in Phoenix determines how data is stored within the rowkey of the underlying HBase table. In HBase, the only way to access a particular row is with the rowkey. In addition, data stored in an HBase table is sorted by the rowkey. Phoenix builds the rowkey value by concatenating the values of each of the columns in the row, in the order they're defined in the primary key.
For example, a table for contacts has the first name, last name, phone number, and address, all in the same column family. You could define a primary key based on an increasing sequence number:
rowkey | address | phone | firstName | lastName |
---|---|---|---|---|
1000 | 1111 San Gabriel Dr. | 1-425-000-0002 | John | Dole |
8396 | 5415 San Gabriel Dr. | 1-230-555-0191 | Calvin | Raji |
However, if you frequently query by lastName this primary key may not perform well, because each query requires a full table scan to read the value of every lastName. Instead, you can define a primary key on the lastName, firstName, and social security number columns. This last column is to disambiguate two residents at the same address with the same name, such as a father and son.
rowkey | address | phone | firstName | lastName | socialSecurityNum |
---|---|---|---|---|---|
1000 | 1111 San Gabriel Dr. | 1-425-000-0002 | John | Dole | 111 |
8396 | 5415 San Gabriel Dr. | 1-230-555-0191 | Calvin | Raji | 222 |
With this new primary key the row keys generated by Phoenix would be:
rowkey | address | phone | firstName | lastName | socialSecurityNum |
---|---|---|---|---|---|
Dole-John-111 | 1111 San Gabriel Dr. | 1-425-000-0002 | John | Dole | 111 |
Raji-Calvin-222 | 5415 San Gabriel Dr. | 1-230-555-0191 | Calvin | Raji | 222 |
In the first row of given table, the data for the rowkey is represented as shown:
rowkey | key | value |
---|---|---|
Dole-John-111 | address | 1111 San Gabriel Dr. |
Dole-John-111 | phone | 1-425-000-0002 |
Dole-John-111 | firstName | John |
Dole-John-111 | lastName | Dole |
Dole-John-111 | socialSecurityNum | 111 |
This rowkey now stores a duplicate copy of the data. Consider the size and number of columns you include in your primary key, because this value is included with every cell in the underlying HBase table.
Also, if the primary key has values that are monotonically increasing, you should create the table with salt buckets to help avoid creating write hotspots - see Partition data.
Column family design
If some columns are accessed more frequently than others, you should create multiple column families to separate the frequently accessed columns from rarely accessed columns.
Also, if certain columns tend to be accessed together, put those columns in the same column family.
Column design
- Keep VARCHAR columns under about 1 MB because of the I/O costs of large columns. When processing queries, HBase materializes cells in full before sending them over to the client, and the client receives them in full before handing them off to the application code.
- Store column values using a compact format such as protobuf, Avro, msgpack, or BSON. JSON isn't recommended, as it's larger.
- Consider compressing data before storage to cut latency and I/O costs.
Partition data
Phoenix enables you to control the number of regions where your data is distributed, which can significantly increase read/write performance. When creating a Phoenix table, you can either salt or pre-split your data.
To salt a table during creation, specify the number of salt buckets:
CREATE TABLE CONTACTS (...) SALT_BUCKETS = 16
This salting splits the table along the values of primary keys, choosing the values automatically.
To control where the table splits occur, you can pre-split the table by providing the range values along which the splitting occurs. For example, to create a table split along three regions:
CREATE TABLE CONTACTS (...) SPLIT ON ('CS','EU','NA')
Index design
A Phoenix index is an HBase table that stores a copy of some or all of the data from the indexed table. An index improves performance for specific types of queries.
When you have multiple indexes defined and then query a table, Phoenix automatically selects the best index for the query. The primary index is created automatically based on the primary keys you select.
For anticipated queries, you can also create secondary indexes by specifying their columns.
When designing your indexes:
- Only create the indexes you need.
- Limit the number of indexes on frequently updated tables. Updates to a table translate into writes to both the main table and the index tables.
Create secondary indexes
Secondary indexes can improve read performance by turning what would be a full table scan into a point lookup, at the cost of storage space and write speed. Secondary indexes can be added or removed after table creation and don’t require changes to existing queries – queries just run faster. Depending on your needs, consider creating covered indexes, functional indexes, or both.
Use covered indexes
Covered indexes are indexes that include data from the row in addition to the values that are indexed. After finding the desired index entry, there's no need to access the primary table.
For example, in the example contact table you could create a secondary index on just the socialSecurityNum column. This secondary index would speed up queries that filter by socialSecurityNum values, but retrieving other field values require another read against the main table.
rowkey | address | phone | firstName | lastName | socialSecurityNum |
---|---|---|---|---|---|
Dole-John-111 | 1111 San Gabriel Dr. | 1-425-000-0002 | John | Dole | 111 |
Raji-Calvin-222 | 5415 San Gabriel Dr. | 1-230-555-0191 | Calvin | Raji | 222 |
However, if you typically want to look up the firstName and lastName given the socialSecurityNum, you could create a covered index that includes the firstName and lastName as actual data in the index table:
CREATE INDEX ssn_idx ON CONTACTS (socialSecurityNum) INCLUDE(firstName, lastName);
This covered index enables the following query to acquire all data just by reading from the table containing the secondary index:
SELECT socialSecurityNum, firstName, lastName FROM CONTACTS WHERE socialSecurityNum > 100;
Use functional indexes
Functional indexes allow you to create an index on an arbitrary expression that you expect to be used in queries. Once you have a functional index in place and a query uses that expression, the index may be used to retrieve the results rather than the data table.
For example, you could create an index to allow you to do case-insensitive searches on the combined first name and last name of a person:
CREATE INDEX FULLNAME_UPPER_IDX ON "Contacts" (UPPER("firstName"||' '||"lastName"));
Query design
The main considerations in query design are:
- Understand the query plan and verify its expected behavior.
- Join efficiently.
Understand the query plan
In SQLLine, use EXPLAIN followed by your SQL query to view the plan of operations that Phoenix performs. Check that the plan:
- Uses your primary key when appropriate.
- Uses appropriate secondary indexes, rather than the data table.
- Uses RANGE SCAN or SKIP SCAN whenever possible, rather than TABLE SCAN.
Plan examples
As an example, say you have a table called FLIGHTS that stores flight delay information.
To select all the flights with an airlineid
of 19805
, where airlineid
is a field that isn't in the primary key or in any index:
select * from "FLIGHTS" where airlineid = '19805';
Run the explained command as follows:
explain select * from "FLIGHTS" where airlineid = '19805';
The query plan looks like this:
CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER FLIGHTS
SERVER FILTER BY AIRLINEID = '19805'
In this plan, note the phrase FULL SCAN OVER FLIGHTS. This phrase indicates the execution does a TABLE SCAN over all rows in the table, rather than using the more efficient RANGE SCAN or SKIP SCAN option.
Now, say you want to query for flights on January 2, 2014 for the carrier AA
where its flightnum was greater than 1. Let's assume that the columns year, month, dayofmonth, carrier, and flightnum exist in the example table, and are all part of the composite primary key. The query would look as follows:
select * from "FLIGHTS" where year = 2014 and month = 1 and dayofmonth = 2 and carrier = 'AA' and flightnum > 1;
Let's examine the plan for this query with:
explain select * from "FLIGHTS" where year = 2014 and month = 1 and dayofmonth = 2 and carrier = 'AA' and flightnum > 1;
The resulting plan is:
CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER FLIGHTS [2014,1,2,'AA',2] - [2014,1,2,'AA',*]
The values in square brackets are the range of values for the primary keys. In this case, the range values are fixed with year 2014, month 1, and dayofmonth 2, but allow values for flightnum starting with 2 and on up (*
). This query plan confirms that the primary key is being used as expected.
Next, create an index on the FLIGHTS table named carrier2_idx
that is on the carrier field only. This index also includes flightdate
, tailnum
, origin
, and flightnum
as covered columns whose data is also stored in the index.
CREATE INDEX carrier2_idx ON FLIGHTS (carrier) INCLUDE(FLIGHTDATE,TAILNUM,ORIGIN,FLIGHTNUM);
Say you want to get the carrier along with the flightdate
and tailnum
, as in the following query:
explain select carrier,flightdate,tailnum from "FLIGHTS" where carrier = 'AA';
You should see this index being used:
CLIENT 1-CHUNK PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER CARRIER2_IDX ['AA']
For a complete listing of the items that can appear in explain plan results, see the Explain Plans section in the Apache Phoenix Tuning Guide.
Join efficiently
Generally, you want to avoid joins unless one side is small, especially on frequent queries.
If necessary, you can do large joins with the /*+ USE_SORT_MERGE_JOIN */
hint, but a large join is an expensive operation over huge numbers of rows. If the overall size of all right-hand-side tables would exceed the available memory, use the /*+ NO_STAR_JOIN */
hint.
Scenarios
The following guidelines describe some common patterns.
Read-heavy workloads
For read-heavy use cases, make sure you're using indexes. Additionally, to save read-time overhead, consider creating covered indexes.
Write-heavy workloads
For write-heavy workloads where the primary key is monotonically increasing, create salt buckets to help avoid write hotspots, at the expense of overall read throughput because of the additional scans needed. Also, when using UPSERT to write a large number of records, turn off autoCommit and batch up the records.
Bulk deletes
When deleting a large data set, turn on autoCommit before issuing the DELETE query, so that the client doesn't need to remember the row keys for all deleted rows. AutoCommit prevents the client from buffering the rows affected by the DELETE, so that Phoenix can delete them directly on the region servers without the expense of returning them to the client.
Immutable and Append-only
If your scenario favors write speed over data integrity, consider disabling the write-ahead log when creating your tables:
CREATE TABLE CONTACTS (...) DISABLE_WAL=true;
For details on this and other options, see Apache Phoenix Grammar.
Next steps
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for