DataSphereStudio - DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling

  •        52

DataSphere Studio (DSS for short) is WeDataSphere, a big data platform of WeBank, a self-developed one-stop data application development management portal. Based on Linkis computation middleware, DSS can easily integrate upper-level data application systems, making data application development simple and easy to use.

https://github.com/WeBankFinTech/DataSphereStudio


Dependencies:

org.scala-lang:scala-library:2.11.8
org.scala-lang:scala-compiler:2.11.8
org.scala-lang:scala-reflect:2.11.8
org.scala-lang:scalap:2.11.8
commons-lang:commons-lang:2.6

Tags
Implementation
License
Platform

   




Related Projects

Scriptis - Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis

  •    Vue

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis. Script editor: Support multi-language, auto-completion, syntax highlighting and SQL syntax error-correction.

WeDataSphere - WeDataSphere is a financial level one-stop open-source suitcase for big data platforms

  •    

DataSphere Studio, Linkis, Scriptis, Qualitis, Schedulis, Exchangis. DataSphere Studio is positioned as a data application development portal, and the closed loop covers the entire process of data application development. With a unified UI, the workflow-like graphical drag-and-drop development experience meets the entire lifecycle of data application development from data import, desensitization cleaning, data analysis, data mining, quality inspection, visualization, scheduling to data output applications, etc.

Azkaban - Batch workflow job scheduler created at LinkedIn to run Hadoop jobs

  •    Java

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows. Azkaban was designed primarily with usability in mind. It has been running at LinkedIn for several years, and drives many of their Hadoop and data warehouse processes.

Schedulis - Schedulis is a high performance workflow task scheduling system that supports high availability and multi-tenant financial level features, Linkis computing middleware, and has been integrated into data application development portal DataSphere Studio

  •    Java

Schedulis is a high performance workflow task scheduling system that supports high availability and multi-tenant financial level features, Linkis computing middleware, and has been integrated into data application development portal DataSphere Studio


Luigi - Python module that helps you build complex pipelines of batch jobs

  •    Python

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

Hue - The open source Apache Hadoop UI

  •    Java

Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.

Shark - Hive on Spark

  •    Scala

Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

Apache Oozie - Workflow Scheduler for Hadoop

  •    Java

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Its jobs are Directed Acyclical Graphs (DAGs) of actions and triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

coral - Coral is a translation, analysis, and query rewrite engine for SQL and other relational languages

  •    Java

Coral is a library for analyzing, processing, and rewriting views defined in the Hive Metastore, and sharing them across multiple execution engines. It performs SQL translations to enable views expressed in HiveQL (and potentially other languages) to be accessible in engines such as Trino (formerly PrestoSQL), Apache Spark, and Apache Pig. Coral not only translates view definitions between different SQL/non-SQL dialects, but also rewrites expressions to produce semantically equivalent ones, taking into account the semantics of the target language or engine. For example, it automatically composes new built-in expressions that are equivalent to each built-in expression in the source view definition. Additionally, it integrates with Transport UDFs to enable translating and executing user-defined functions (UDFs) across Hive, Trino, Spark, and Pig. Coral is under active development. Currently, we are looking into expanding the set of input view language APIs beyond HiveQL, and implementing query rewrite algorithms for data governance and query optimization. The project is under active development and we welcome contributions of different forms. Please see the Contribution Agreement.

Apache Hive - The Apache Hive (TM) data warehouse software facilitates querying and managing large d

  •    Java

The Apache Hive (TM) data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  •    Python

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

spring-hadoop - Spring for Apache Hadoop is a framework for application developers to take advantage of the features of both Hadoop and Spring

  •    Java

The Spring for Apache Hadoop project provides extensions to Spring, Spring Batch, and Spring Integration to build manageable and robust pipeline solutions around Hadoop.Spring for Apache Hadoop extends Spring Batch by providing support for reading from and writing to HDFS, running various types of Hadoop jobs (Java MapReduce, Streaming, Hive, Spark, Pig) and using HBase. An important goal is to provide excellent support for non-Java based developers to be productive using Spring Hadoop and not have to write any Java code to use the core feature set.

Quicksql - Simpler, Safer, Faster Unified SQL Analytics Engine for Multi-Datasources

  •    Java

Quicksql is a SQL query product which can be used for specific datastore queries or multiple datastores correlated queries. It supports relational databases, non-relational databases and even datastore which does not support SQL (such as Elasticsearch, Druid) . In addition, a SQL query can join or union data from multiple datastores in Quicksql. For example, you can perform unified SQL query on one situation that a part of data stored on Elasticsearch, but the other part of data stored on Hive. The most important is that QSQL is not dependent on any intermediate compute engine, users only need to focus on data and unified SQL grammar to finished statistics and analysis. An architecture diagram helps you access Quicksql more easily.

azkaban - Azkaban workflow manager.

  •    Java

Azkaban builds use Gradle and requires Java 8 or higher. The following set of commands run on *nix platforms like Linux, OS X.

pipeline - PipelineAI: Real-Time Enterprise AI Platform

  •    HTML

Each model is built into a separate Docker image with the appropriate Python, C++, and Java/Scala Runtime Libraries for training or prediction. Use the same Docker Image from Local Laptop to Production to avoid dependency surprises.

Trino - A query engine that runs at ludicrous speed

  •    Java

Trino is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics. It is an ANSI SQL compliant query engine, that works with BI tools such as R, Tableau, Power BI, Superset and many others. It helps to natively query data in Hadoop, S3, Cassandra, MySQL, and many others, without the need for complex, slow, and error-prone processes for copying the data.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.