同步操作将从 Juicedata/JuiceFS 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
JuiceFS provides Hadoop-compatible FileSystem by Hadoop Java SDK to support variety of components in Hadoop ecosystem.
NOTICE:
JuiceFS use local mapping of user and UID. So, you should sync all the needed users and their UIDs across the whole Hadoop cluster to avoid permission error.
JuiceFS Hadoop Java SDK is compatible with Hadoop 2.x and Hadoop 3.x. As well as variety of components in Hadoop ecosystem.
In order to make JuiceFS works with other components, it usually takes 2 steps:
core-site.xml
).You need first installing Go 1.13+, JDK 8+ and Maven, then run following commands:
$ cd sdk/java
$ make
After compiling you could find the JAR file in sdk/java/target
directory, e.g. juicefs-hadoop-0.10.0.jar
. Beware that file with original-
prefix, it doesn't contain third-party dependencies. It's recommended to use the JAR file with third-party dependencies.
Note: The SDK could only be deployed to same operating system as it be compiled. For example, if you compile SDK in Linux then you must deploy it to Linux.
Then put the JAR file and $JAVA_HOME/lib/tools.jar
to the classpath of each Hadoop ecosystem component. It's recommended create a symbolic link to the JAR file. The following tables describe where the SDK be placed.
Name | Installing Paths |
---|---|
CDH |
/opt/cloudera/parcels/CDH/lib/hadoop/lib /opt/cloudera/parcels/CDH/spark/jars /var/lib/impala
|
HDP |
/usr/hdp/current/hadoop-client/lib /usr/hdp/current/hive-client/auxlib /usr/hdp/current/spark2-client/jars
|
Amazon EMR |
/usr/lib/hadoop/lib /usr/lib/spark/jars /usr/lib/hive/auxlib
|
Alibaba Cloud EMR |
/opt/apps/ecm/service/hadoop/*/package/hadoop*/share/hadoop/common/lib /opt/apps/ecm/service/spark/*/package/spark*/jars /opt/apps/ecm/service/presto/*/package/presto*/plugin/hive-hadoop2 /opt/apps/ecm/service/hive/*/package/apache-hive*/lib /opt/apps/ecm/service/impala/*/package/impala*/lib
|
Tencent Cloud EMR |
/usr/local/service/hadoop/share/hadoop/common/lib /usr/local/service/presto/plugin/hive-hadoop2 /usr/local/service/spark/jars /usr/local/service/hive/auxlib
|
UCloud UHadoop |
/home/hadoop/share/hadoop/common/lib /home/hadoop/hive/auxlib /home/hadoop/spark/jars /home/hadoop/presto/plugin/hive-hadoop2
|
Baidu Cloud EMR |
/opt/bmr/hadoop/share/hadoop/common/lib /opt/bmr/hive/auxlib /opt/bmr/spark2/jars
|
Name | Installing Paths |
---|---|
Spark | ${SPARK_HOME}/jars |
Presto | ${PRESTO_HOME}/plugin/hive-hadoop2 |
Flink | ${FLINK_HOME}/lib |
Configuration | Default Value | Description |
---|---|---|
fs.jfs.impl |
io.juicefs.JuiceFileSystem |
The FileSystem implementation for jfs:// URIs. If you wanna use different schema (e.g. cfs:// ), you could rename this configuration to fs.cfs.impl . |
fs.AbstractFileSystem.jfs.impl |
io.juicefs.JuiceFS |
|
juicefs.meta |
Redis URL. Its format is redis://<user>:<password>@<host>:<port>/<db> . |
|
juicefs.accesskey |
Access key of object storage. See this document to learn how to get access key for different object storage. | |
juicefs.secretkey |
Secret key of object storage. See this document to learn how to get secret key for different object storage. |
Configuration | Default Value | Description |
---|---|---|
juicefs.cache-dir |
Directory paths of local cache. Use colon to separate multiple paths. Also support wildcard in path. It's recommended create these directories manually and set 0777 permission so that different applications could share the cache data.
|
|
juicefs.cache-size |
0 | Maximum size of local cache in MiB. It's the total size when set multiple cache directories. |
juicefs.discover-nodes-url |
The URL to discover cluster nodes, refresh every 10 minutes. YARN: yarn Spark Standalone: http://spark-master:web-ui-port/json/ Spark ThriftServer: http://thrift-server:4040/api/v1/applications/ Presto: http://coordinator:discovery-uri-port/v1/service/presto/
|
Configuration | Default Value | Description |
---|---|---|
juicefs.access-log |
Access log path. Ensure Hadoop application has write permission, e.g. /tmp/juicefs.access.log . The log file will rotate automatically to keep at most 7 files. |
|
juicefs.superuser |
hdfs |
The super user |
juicefs.no-usage-report |
false |
Whether disable usage reporting. JuiceFS only collects anonymous usage data (e.g. version number), no user or any sensitive data will be collected. |
When you use multiple JuiceFS file systems, all these configurations could be set to specific file system alone. You need put file system name in the middle of configuration, for example (replace {JFS_NAME}
with appropriate value):
<property>
<name>juicefs.{JFS_NAME}.meta</name>
<value>redis://host:port/1</value>
</property>
Note: Replace {HOST}
, {PORT}
and {DB}
in juicefs.meta
with appropriate values.
<property>
<name>fs.jfs.impl</name>
<value>io.juicefs.JuiceFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.jfs.impl</name>
<value>io.juicefs.JuiceFS</value>
</property>
<property>
<name>juicefs.meta</name>
<value>redis://{HOST}:{PORT}/{DB}</value>
</property>
<property>
<name>juicefs.cache-dir</name>
<value>/data*/jfs</value>
</property>
<property>
<name>juicefs.cache-size</name>
<value>1024</value>
</property>
<property>
<name>juicefs.access-log</name>
<value>/tmp/juicefs.access.log</value>
</property>
Add configurations to core-site.xml
.
Besides core-site
, you also need to configure mapreduce.application.classpath
of the YARN component, add:
$HADOOP_COMMON_HOME/lib/juicefs-hadoop.jar
Besides core-site
, you also need to configure mapreduce.application.classpath
of the MapReduce2 component, add (variables do not need to be replaced):
/usr/hdp/${hdp.version}/hadoop/lib/juicefs-hadoop.jar
Add configurations to conf/flink-conf.yaml
. You could only setup Flink client without modify configurations in Hadoop.
When the following components need to access JuiceFS, they should be restarted.
Note: Before restart, you need to confirm JuiceFS related configuration has been written to the configuration file of each component,
usually you can find them in core-site.xml
on the machine where the service of the component was deployed.
Components | Services |
---|---|
Hive | HiveServer Metastore |
Spark | ThriftServer |
Presto | Coordinator Worker |
Impala | Catalog Server Daemon |
HBase | Master RegionServer |
HDFS, Hue, ZooKeeper and other services don't need to be restarted.
When Class io.juicefs.JuiceFileSystem not found
or No FilesSystem for scheme: jfs
exceptions was occurred after restart, reference FAQ.
$ hadoop fs -ls jfs://{JFS_NAME}/
CREATE TABLE IF NOT EXISTS person
(
name STRING,
age INT
) LOCATION 'jfs://{JFS_NAME}/tmp/person';
JuiceFS provides some benchmark tools for you when JuiceFS has been deployed
create
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBenchWithoutMR -operation create -numberOfFiles 10000 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench_local
It creates 10000 empty files without write data
open
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBenchWithoutMR -operation open -numberOfFiles 10000 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench_local
It opens 10000 files without read data
rename
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBenchWithoutMR -operation rename -numberOfFiles 10000 -bytesPerBlock 134217728 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench_local
delete
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBenchWithoutMR -operation delete -numberOfFiles 10000 -bytesPerBlock 134217728 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench_local
for reference
operation | tps | delay(ms) |
---|---|---|
create | 546 | 1.83 |
open | 1135 | 0.88 |
rename | 364 | 2.75 |
delete | 289 | 3.46 |
sequential write
hadoop jar juicefs-hadoop.jar io.juicefs.bench.TestFSIO -write -fileSize 20000 -baseDir jfs://{JFS_NAME}/benchmarks/fsio
sequential read
hadoop jar juicefs-hadoop.jar io.juicefs.bench.TestFSIO -read -fileSize 20000 -baseDir jfs://{JFS_NAME}/benchmarks/fsio
When run the cmd for the second time, the result may be much better than the first run. It's because the data was cached in memory, just clean the local disk cache.
for reference
operation | throughput(MB/s) |
---|---|
write | 453 |
read | 141 |
Distribute benchmark use MapReduce program to test meta and IO throughput performance
Enough resources should be provided to make sure all Map task can be started at the same time
We use 3 4c32g ecs(5Gbit/s) and AliYun Redis 5.0 4G redis for the benchmark
create
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBench -operation create -threadsPerMap 10 -maps 10 -numberOfFiles 1000 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench
10 map task, each has 10 threads, each thread create 1000 empty file. 100000 files in total
create
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBench -operation open -threadsPerMap 10 -maps 10 -numberOfFiles 1000 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench
10 map task, each has 10 threads, each thread open 1000 file. 100000 files in total
create
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBench -operation rename -threadsPerMap 10 -maps 10 -numberOfFiles 1000 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench
10 map task, each has 10 threads, each thread rename 1000 file. 100000 files in total
create
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBench -operation delete -threadsPerMap 10 -maps 10 -numberOfFiles 1000 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench
10 map task, each has 10 threads, each thread delete 1000 file. 100000 files in total
for reference
operation | tps | delay(ms) |
---|---|---|
create | 2307 | 3.6 |
open | 3215 | 2.3 |
rename | 1700 | 5.22 |
delete | 1378 | 6.7 |
operation | tps | delay(ms) |
---|---|---|
create | 8375 | 11.5 |
open | 12691 | 7.5 |
rename | 5343 | 18.4 |
delete | 3576 | 27.6 |
sequential write
hadoop jar juicefs-hadoop.jar io.juicefs.bench.TestDFSIO -write -nrFiles 10 -fileSize 10000 -baseDir jfs://{JFS_NAME}/benchmarks/fsio
10 map task, each task write 10000MB random data sequentially
sequential read
hadoop jar juicefs-hadoop.jar io.juicefs.bench.TestDFSIO -read -nrFiles 10 -fileSize 10000 -baseDir jfs://{JFS_NAME}/benchmarks/fsio
10 map task, each task read 10000MB random data sequentially
for reference
operation | total throughput(MB/s) |
---|---|
write | 1792 |
read | 1409 |
Class io.juicefs.JuiceFileSystem not found
exceptionIt means JAR file was not loaded, you can verify it by lsof -p {pid} | grep juicefs
.
You should check whether the JAR file was located properly, or other users have the read permission.
Some Hadoop distribution also need to modify mapred-site.xml
and put the JAR file location path to the end of the parameter mapreduce.application.classpath
.
No FilesSystem for scheme: jfs
exceptionIt means JuiceFS Hadoop Java SDK was not configured properly, you need to check whether there is JuiceFS related configuration in the core-site.xml
of the component configuration.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。