Note: Parquet Support is now obsolete.
The IBM FHIR Server has early support for Bulk Data export to the Apache Parquet format using the Apache Spark libraries. New as of version 4.4.0, the export to parquet feature requires:
- Apache Spark v3.0 and the IBM Stocator adapter (version 1.1)
- the configuration
/fhirServer/bulkdata/storageProviders/(source)/enableParquet
set to true
The Parquet Bulk Data export is activated using a custom _outputFormat
in the export request.
{
"name": "_outputFormat",
"valueString": "application/fhir+parquet"
},
Let me show you how to build a custom IBM FHIR Server container with parquet support Docker: ibmcom/ibm-fhir-server. It is recommended to use 4.9.0 or higher.
Recipe
- Prior to 4.9.0, build the Maven Projects and the Docker Build. You should see
[INFO] BUILD SUCCESS
after each Maven build, anddocker.io/ibmcom/ibm-fhir-server:latest
when the Docker build is successful.
mvn clean install -f fhir-examples -B -DskipTests -ntp
mvn clean install -f fhir-parent -B -DskipTests -ntp
docker build -t ibmcom/ibm-fhir-server:latest fhir-install
- Download the dependency files for parquet and stocator.
export WORKSPACE=~/git/wffh/2021/fhir
bash ${WORKSPACE}/fhir-bulkdata-webapp/src/main/sh/cache-parquet-deps.sh
- Download the fhir-server-config.json
curl -L -o fhir-server-config.json \
https://raw.githubusercontent.com/IBM/FHIR/main/fhir-server/liberty-config/config/default/fhir-server-config.json
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 8423 100 8423 0 0 40495 0 --:--:-- --:--:-- --:--:-- 40301
- Update the fhir-server-config.json to use a IBM COS storage provider with parquet support. You’ll need to update with your HMAC id, internal and external URLs and parquet enabled.
"storageProviders": {
"default" : {
"type": "ibm-cos",
"bucketName": "fhir-performance",
"location": "us-east",
"endpointInternal": "https://s3.us-east.cloud-object-storage.appdomain.cloud",
"endpointExternal": "https://s3.us-east.cloud-object-storage.appdomain.cloud",
"auth" : {
"type": "hmac",
"accessKeyId": "key",
"secretAccessKey": "secret"
},
"enableParquet": true,
"disableOperationOutcomes": true,
"duplicationCheck": false,
"validateResources": false,
"create": false
}
}
- Start the Docker container, and capture the container id. It’s going to take a few moments to start up as it lays down the test database.
docker run -d -p 9443:9443 -e BOOTSTRAP_DB=true \
-v $(pwd)/fhir-server-config.json:/config/config/default/fhir-server-config.json \
-v $(pwd)/deps:/config/userlib/ \
ibmcom/ibm-fhir-server
3f8e90f20cd42129adc58df8a0295efc3fb2a0f4507350589f71939a072999ae
- Check the logs until you see:
docker logs 3f8e90f20cd42129adc58df8a0295efc3fb2a0f4507350589f71939a072999ae
...
[6/16/21, 15:31:34:533 UTC] 0000002a FeatureManage A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 17.665 seconds.
- Download the Sample Data
curl -L https://raw.githubusercontent.com/IBM/FHIR/main/fhir-server-test/src/test/resources/testdata/everything-operation/Antonia30_Acosta403.json \
-o Antonia30_Acosta403.json
- Load the Sample Data bundle to the IBM FHIR Server
curl -k --location --request POST 'https://localhost:9443/fhir-server/api/v4' \
--header 'Content-Type: application/fhir+json' \
--user "fhiruser:${DUMMY_PASSWORD}" \
--data-binary "@Antonia30_Acosta403.json" -o response.json
Note, DUMMY_PASSWORD should be previously set.
- Scan the response.json for any status that is not "status": "201". For example, the status is in the family of User Request Error or Server Side Error.
cat response.json | jq -r '.entry[].response.status' | sort -u
201
- Call the export to Parquet operation, and grab the content-location.
curl --location --request GET 'https://localhost:9443/fhir-server/api/v4/$export?_outputFormat=application/fhir%2Bparquet&_type=Patient' \
--header 'X-FHIR-TENANT-ID: default' \
--user "fhiruser:${DUMMY_PASSWORD}" \
--header 'Content-Type: application/json' -k -v
< content-location: https://localhost:9443/fhir-server/api/v4/$bulkdata-status?job=LqzauvqtHSmkpChVHo%2B1MQ
- Check the exprot status using the previous URL, and once you see a 200 response, you can go out and use your exported Parquet data.
curl --location --request GET 'https://localhost:9443/fhir-server/api/v4/$bulkdata-status?job=LqzauvqtHSmkpChVHo%2B1MQ' \
--header 'X-FHIR-TENANT-ID: default' \
--user "fhiruser:${DUMMY_PASSWORD}" \
--header 'Content-Type: application/json' -k
{
"transactionTime": "2021-08-09T00:34:11.594Z",
"request": "https://localhost:9443/fhir-server/api/v4/$export?_outputFormat=application/fhir%2Bparquet&_type=Patient",
"requiresAccessToken": false,
"output": [
{
"type": "Patient",
"url": "https://s3.us-east.cloud-object-storage.appdomain.cloud/fhir-performance/AZ0gsQS05_RqZnHPhj57AfhYSIHU8VzwmnWjDCQdi2I/Patient_1.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=fc85bf9cc1ac49e99e40085f9ba00f77%2F20210809%2Fus-east%2Fs3%2Faws4_request&X-Amz-Date=20210809T003601Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=6d54f677b91d92304caf889eb0a1efbc2b3ebe3d24cefd9c17169b21816d1cdf",
"count": 1
}
]
}


You now have a working Parquet output.