There are a few optional keyword arguments that are useful only for S3 access. These are both passed to boto.s3_connect() as keyword arguments. The S3 reader supports gzipped content, as long as the key is obviously a gzipped file (e.g. ends with ".gz").
s3 hdfs webhdfs boto streaming file streaming-data gzip-stream bz2Ibis is a toolbox to bridge the gap between local Python environments, remote storage, execution systems like Hadoop components (HDFS, Impala, Hive, Spark) and SQL databases. Its goal is to simplify analytical workflows and make you more productive. Learn more about using the library at http://ibis-project.org.
hadoop impala pandas hdfs ibisApache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.
cluster cluster-computing data-analytics analytics hdfs map-reduce big-dataSnakebite is a python library that provides a pure python HDFS client and a wrapper around Hadoops minicluster. The client uses protobuf for communicating with the NameNode and comes in the form of a library and a command line interface. Currently, the snakebite client supports most actions that involve the Namenode and reading data from DataNodes.Note: all methods that read data from a data node are able to check the CRC during transfer, but this is disabled by default because of performance reasons. This is the opposite behaviour from the stock Hadoop client.
hdfs python-hdfs-clientThis is a native golang client for hdfs. It connects directly to the namenode using the protocol buffers API. It tries to be idiomatic by aping the stdlib os package, where possible, and implements the interfaces from it, including os.FileInfo and os.PathError.
hdfs commandlineArray data management made fast and easy. TileDB allows you to manage the massive dense and sparse multi-dimensional array data that frequently arise in many important scientific applications.
tiledb arrays storage-engine scientific-computing data-analysis hdfs s3 s3-storagekafka-connect-hdfs is a Kafka Connector for copying data between Kafka and Hadoop HDFS.Documentation for this connector can be found here.
confluent kafka apache-kafka kafka-connect-hdfs kafka-connector hadoop hdfs big-data streamingDynamometer is a tool to performance test Hadoop's HDFS NameNode. The intent is to provide a real-world environment by initializing the NameNode against a production file system image and replaying a production workload collected via e.g. the NameNode's audit logs. This allows for replaying a workload which is not only similar in characteristic to that experienced in production, but actually identical. Dynamometer will launch a YARN application which starts a single NameNode and a configurable number of DataNodes, simulating an entire HDFS cluster as a single application. There is an additional workload job run as a MapReduce job which accepts audit logs as input and uses the information contained within to submit matching requests to the NameNode, inducing load on the service.
hadoop hadoop-filesystem hdfs hdfs-dfs testing testing-tools scale scale-up performance-testing performance-test performance-analysis performance-metrics hadoop-framework hadoop-hdfsArcGIS 10.4 GeoEvent Extension for Server sample Hadoop Output Connector for storing GeoEvents in HDFS. Find a bug or want to request a new feature? Please let us know by submitting an issue.
arcgis geoevent hadoop connector hdfs transport arcgis-geoevent-server big-data bigdata serverI am currently following and testing against the WebHDFS REST API documentation for the 1.2.1 release, by Apache. Make sure you enable WebHDFS in the hdfs site configuration file. I use Mocha and should.js for unit testing. They will be required if you want to run the unit tests. To execute the tests, simply npm test, but install the requirements first. You will also likely need to adjust the constants in the test file first (or have a username "ryan" setup for hosts "endpoint1" and "endpoint2").
hdfs webhdfs httpRosbagInputFormat is an open source splittable Hadoop InputFormat for the ROS bag file format. For an example of rosbag file larger than 2 GB see doc/Rosbag larger than 2 GB.ipynb Solved the issue https://github.com/valtech/ros_hadoop/issues/6 The issue was due to ByteBuffer being limitted by JVM Integer size and has nothing to do with Spark or how the RosbagMapInputFormat works within Spark. It was only problematic to extract the conf index with the jar.
hadoop ros robotics spark hdfs rosbag bag hadoop-inputformat machine-learningAPI and command line interface for HDFS. See the documentation to learn more.
hdfs clin.b. we currently run Python 2.7 on the Hadoop cluster, so streaming Hadoop tasks need to stick to that version. Other code should be written in Python 3 but be compatible with both where possible.
warc cdx hdfs wayback webarchive web-archivingHadoop HDFS FSImage Exporter allows exporting HDFS statistics for Prometheus from the Hadoop HDFS FSImage file snapshots.
prometheus-exporter hadoop-fsimage hadoop hdfs monitoring hdfs-metricsTools for working with Hadoop written with performance in mind. By default, hh will behave the same as hdfs dfs or hadoop fs in terms of which user name to use for HDFS, or which namenodes to use.
hadoop haskell hdfsBecause the world needs yet another way to talk to HDFS from Python. This library provides a Python client for WebHDFS. NameNode HA is supported by passing in both NameNodes. Responses are returned as nice Python classes, and any failed operation will raise some subclass of HdfsException matching the Java exception.
hdfs webhdfs hadoop hadoop-filesystemA Java API for creating unified big-data processing flows providing an engine independent programming model which can express both batch and stream transformations.
big-data apache-flink apache-spark java-api hadoop kafka hdfs unified-bigdata-processing streaming-data batch-processingA few of the Hadoop, NoSQL, Web & Linux tools I've written over the years. All programs have --help to list the available options. For many more tools see DevOps Python Tools and the Advanced Nagios Plugins Collection which contains many more Hadoop, NoSQL and Linux/Web tools.
ambari kerberos hadoop hdfs hbase sql anonymize solr solrcloud nginx hive cassandra pig docker neo4j apache-drill mysql oracle recaser
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.